Might have been a bit more informative to include why layer normalization in particular?

We should only have two questions left: “why the residual connections” and “why the normalizing layers?” The residual connections are a common deep learning technique that allows us to build deeper networks in a more stable manner, see ResNet. Residual layers are easier to optimize, so be sure to include them if you are using deep neural networks. The normalization returns our data back into a normal distribution, mean of 0 and variance of 1. Layer normalization is another technique that helps stabilize our network and increases our training speed.
Training Compact Transformers from Scratch in 30 Minutes with PyTorch
844
13
Steven Walton
Sayak Paul
·Follow
Jun 29, 2021
--
Might have been a bit more informative to include why layer normalization in particular?
Since Transformers were proposed for NMT primarily and in NMT normalizing with batch statistics is not helpful. So, to have that distinction in the stats, Layer Normalization was used that makes use of individual sample stats to normalize.
--
--
Written by Sayak Paul839 Followers
·31 Following
ML at 🤗 | Netflix Nerd | Personal site: https://sayak.dev/
No responses yet
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams