Might have been a bit more informative to include why layer normalization in particular?
Since Transformers were proposed for NMT primarily and in NMT normalizing with batch statistics is not helpful. So, to have that distinction in the stats, Layer Normalization was used that makes use of individual sample stats to normalize.