One of the most obvious patterns that we can notice when running training on transformers is that activations take up the most memory:
Share this post
Trading Compute for Memory: Using Activation…
Share this post
One of the most obvious patterns that we can notice when running training on transformers is that activations take up the most memory: