Do you have a reference for this optimization recipe?

This is possible by observing that while adaptive layer norm blocks are minimal in FLOPs, they have a high parameter count, around 670M. Since the input for the adaptive layer norm includes timestep conditioning, we cannot reduce FLOP computation. However, since there are no dependencies on model intermediate activations, we can batch the adaptive layer norm computation of every timestep to the beginning of diffusion sampling all at once, converting matrix-vector multiplication to matrix-matrix multiplication, which is slightly more efficient.
From iPhone, iPad to Mac — Enabling Rapid Local Deployment of SD3 Medium with s4nnc
56
6
Liu Liu
Sayak Paul
·Follow
Jul 2, 2024
--
Do you have a reference for this optimization recipe?
--
--
Written by Sayak Paul839 Followers
·31 Following
ML at 🤗 | Netflix Nerd | Personal site: https://sayak.dev/
Responses (2)
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams