Training large neural networks requires careful consideration of how to scale various hyperparameters as model size increases. In this post, we’ll explore some key recipes and insights for efficient large-scale training.

Why Scaling Matters

As models grow larger, the relationship between learning rate, batch size, and model width becomes increasingly important. Naive scaling can lead to training instabilities or suboptimal performance.

Key Scaling Laws

The seminal work by Kaplan et al. established power-law relationships between compute, data, and model size:

\[L(N) \propto N^{-0.076}\]

where $N$ is the number of parameters and $L$ is the loss.

Practical Recommendations

  1. Learning Rate Scaling: Scale learning rate with the square root of batch size
  2. Initialization: Use width-dependent initialization scales
  3. Warmup: Longer warmup periods for larger models
  4. Gradient Clipping: Essential for stability at scale

μP (Maximal Update Parameterization)

The μP framework provides principled scaling rules that allow hyperparameters tuned on small models to transfer to larger ones.


More details on implementation coming soon.