Table of Contents:

Gradient-based Optimization Algorithms: Overview

Vanilla SGD has not used for state-of-the-art LLM training for many years. In this post, we’ll discuss the technical details and derivation behind modern SGD variants such as Adam.

Gradient update optimizer Optimizer type Paper names + year
Adam Adaptive first-order gradient optimizer Attention Is All You Need / Transformer (2017); GPT-1 (2018); GPT-3 (2020)
Adam with weight decay / Adam-style WD Adaptive optimizer with regularization / decay BERT (2018); RoBERTa (2019); Megatron-LM (2019)
AdamW Decoupled weight-decay Adam Chinchilla (2022); LLaMA (2023); Llama 2 pretraining (2023); OLMo (2024); DeepSeek-V3 (2024/2025)
Adafactor Memory-efficient adaptive optimizer T5 (2019/2020); PaLM (2022); FLAN / instruction-tuned T5-family work (2022)

SGD

SGD has a number of issues: it’s reactive. Karpathy has an older blog post that covers this nicely.

Adam

Adam (Kingma & Ba, 2014) is basically SGD with two running statistics of the gradient:

  1. a running average of the gradient direction, like momentum;
  2. a running average of squared gradients, used to scale the step size per parameter.

The update consists of just 5 lines, after initializing \(m_t\) and \(v_t\) to zero-filled vectors:

\[\begin{align} m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &\leftarrow \frac{m_t}{1 - \beta_1^t} \\ \hat{v}_t &\leftarrow \frac{v_t}{1 - \beta_2^t} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{align}\]

Where (\(g_t\)) is the gradient at step (t), and \(m_t\) and \(v_t\) are vectors. Now, we’ll examine the derivation and intuition for each line.

1. The first moment (\(m_t\)): “which direction have gradients been pointing?”

\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t\]

This is a moving average of gradients. \(\beta_2\) is often set to 0.9, meaning Adam keeps about 90% of the previous momentum and mixes in 10% of the new gradient.

Instead of trusting only the current gradient (\(g_t\)), Adam remembers recent gradients. If gradients keep pointing in the same direction, (\(m_t\)) builds up. If gradients are noisy and alternate directions, (\(m_t\)) smooths them out.

So (\(m_t\)) is like momentum.

Intuition:

Current gradient says: go left.
Recent gradients also said: go left.
Adam says: okay, confidently go left.

But if the gradients are inconsistent:

Step 1: go left.
Step 2: go right.
Step 3: go left.
Step 4: go right.
Adam says: this direction is noisy, slow down.

2. The second moment (\(v_t\)): “how large have gradients been for this parameter?”

\[v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\]

This tracks a moving average of squared gradients. \(\beta_2\) is often set to 0.999.

Squaring removes sign, so Adam is no longer asking “which direction?” It is asking “how big are the gradients usually?”

If a parameter has consistently huge gradients, (\(v_t\)) becomes large. If a parameter has tiny gradients, (\(v_t\)) stays small.

This is used to normalize the update:

\[\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\]

So parameters with large gradients get smaller effective steps, and parameters with small gradients get larger effective steps.

Intuition:

This parameter usually has huge gradients.
Don't take a massive step just because the raw gradient is big.
Scale it down.

And:

This parameter usually has tiny gradients.
Don't ignore it completely.
Scale it up relative to its usual size.

3. Why divide by (\(\sqrt{v_t}\))?

Because (\(v_t\)) tracks squared gradients. If \(v_t \approx g_t^2\), then: \(\sqrt{v_t} \approx \|g_t\|\).

So Adam roughly normalizes the gradient by its recent magnitude. That means the update is less sensitive to the absolute scale of the gradient.

For example, suppose two parameters have gradients: \(g_1 = 100, \qquad g_2 = 0.01\)

Vanilla SGD would update parameter 1 much more aggressively than parameter 2. Adam says: “relative to each parameter’s usual gradient scale, how strong is this update?”

That is why Adam is helpful when different parameters have very different gradient magnitudes.

4. Why the bias correction?

At the beginning: \(m_0 = 0, \qquad v_0 = 0\), so early moving averages are biased toward zero.

For example, at (t = 1):

\[m_1 = \beta_1 \cdot 0 + (1 - \beta_1)g_1\]

If (\(\beta_1 = 0.9\)), then:

\[m_1 = 0.1 g_1\]

That is much smaller than the actual gradient. So Adam corrects it:

\[\hat{m}_1 = \frac{m_1}{1 - \beta_1^1} = \frac{0.1g_1}{0.1} = g_1\]

Same idea for (\(v_t\)):

\[\hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]

This matters especially because (\(\beta_2\)) is usually very close to 1, like (0.999). Without bias correction, the second moment estimate would be tiny at the start, which could make early updates unstable or incorrectly scaled.

5. What Adam is doing in one sentence

Adam says:

Move in the direction of the smoothed gradient, but scale each parameter’s step by how large its gradients usually are.

More compactly:

[ \theta_t ========

\theta_{t-1}

\alpha \frac{ \text{smoothed gradient} }{ \text{smoothed gradient magnitude} } ]

6. Mental model

Vanilla SGD:

Take a step proportional to the current gradient.

Momentum:

Take a step in the direction gradients have consistently pointed.

RMSProp / adaptive scaling:

Shrink updates for parameters with large historical gradients.
Boost updates for parameters with small historical gradients.

Adam:

Momentum + adaptive per-parameter learning rates.

That is why Adam often works well out of the box: it smooths noisy gradients and automatically rescales updates parameter by parameter.

import numpy as np


class Adam:
  def __init__(self, params, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
    """
    params: dict of parameter arrays, e.g. {"W": W, "b": b}
    lr: learning rate
    beta1: decay rate for first moment estimate
    beta2: decay rate for second moment estimate
    eps: small constant for numerical stability
    """
    self.params = params
    self.lr = lr
    self.beta1 = beta1
    self.beta2 = beta2
    self.eps = eps

    self.t = 0

    # First and second moment buffers
    self.m = {k: np.zeros_like(v) for k, v in params.items()}
    self.v = {k: np.zeros_like(v) for k, v in params.items()}

  def step(self, grads):
    """
    grads: dict of gradients with same keys as params
    """
    self.t += 1

    for k in self.params:
      g = grads[k]

      # Update biased first moment estimate
      self.m[k] = self.beta1 * self.m[k] + (1 - self.beta1) * g

      # Update biased second raw moment estimate
      self.v[k] = self.beta2 * self.v[k] + (1 - self.beta2) * (g ** 2)

      # Bias correction
      m_hat = self.m[k] / (1 - self.beta1 ** self.t)
      v_hat = self.v[k] / (1 - self.beta2 ** self.t)

      # Parameter update
      self.params[k] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

  def zero_grad(self):
    """
    Optional placeholder, useful if mirroring PyTorch style.
    In NumPy, gradients are usually recomputed each iteration.
    """
    pass
import numpy as np

np.random.seed(0)

# Fake data: y = Xw + b + noise
N, D = 100, 3
X = np.random.randn(N, D)
true_w = np.array([[2.0], [-3.0], [1.0]])
true_b = 0.5
y = X @ true_w + true_b + 0.1 * np.random.randn(N, 1)

# Parameters
params = {
    "W": np.random.randn(D, 1),
    "b": np.zeros((1,))
}

optimizer = Adam(params, lr=0.05)

for step in range(1000):
    # Forward pass
    y_pred = X @ params["W"] + params["b"]

    # MSE loss
    loss = np.mean((y_pred - y) ** 2)

    # Backward pass
    dloss = 2 * (y_pred - y) / N
    grads = {
        "W": X.T @ dloss,
        "b": np.sum(dloss, axis=0)
    }

    # Adam update
    optimizer.step(grads)

    if step % 100 == 0:
        print(f"step {step}, loss = {loss:.6f}")

print("Learned W:", params["W"].ravel())
print("Learned b:", params["b"])

References.

  1. Diederik P. Kingma, Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. [PDF]