This is possible by observing that while adaptive layer norm blocks are minimal in FLOPs, they have a high parameter count, around 670M. Since the input for the adaptive layer norm includes timestep conditioning, we cannot reduce FLOP computation. However, since there are no dependencies on model intermediate activations, we can batch the adaptive layer norm computation of every timestep to the beginning of diffusion sampling all at once, converting matrix-vector multiplication to matrix-matrix multiplication, which is slightly more efficient.