Sign up
Sign in
… we can just use one layer that acts the same as if we had 3 different layers, one for q, k, and v. Because there is no interdependence between layers we can think of these as modular. We’ll see a similar trick again later but with convolutions.
Steven Walton
Sayak Paul
Follow
--
1
Share
Could you expand a bit more on this one?
ML at 🤗 | Netflix Nerd | Personal site: https://sayak.dev/
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams