Optimization Algorithms
Table of Contents:
Gradient-based Optimization Algorithms: Overview
Vanilla SGD has not used for state-of-the-art LLM training for many years. In this post, we’ll discuss the technical details and derivation behind modern SGD variants such as Adam.
| Gradient update optimizer | Optimizer type | Paper names + year |
|---|---|---|
| Adam | Adaptive first-order gradient optimizer | Attention Is All You Need / Transformer (2017); GPT-1 (2018); GPT-3 (2020) |
| Adam with weight decay / Adam-style WD | Adaptive optimizer with regularization / decay | BERT (2018); RoBERTa (2019); Megatron-LM (2019) |
| AdamW | Decoupled weight-decay Adam | Chinchilla (2022); LLaMA (2023); Llama 2 pretraining (2023); OLMo (2024); DeepSeek-V3 (2024/2025) |
| Adafactor | Memory-efficient adaptive optimizer | T5 (2019/2020); PaLM (2022); FLAN / instruction-tuned T5-family work (2022) |
SGD
SGD has a number of issues: it’s reactive. Karpathy has an older blog post that covers this nicely.
Adam
Adam (Kingma & Ba, 2014) is basically SGD with two running statistics of the gradient:
- a running average of the gradient direction, like momentum;
- a running average of squared gradients, used to scale the step size per parameter.
The update consists of just 5 lines, after initializing \(m_t\) and \(v_t\) to zero-filled vectors:
\[\begin{align} m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &\leftarrow \frac{m_t}{1 - \beta_1^t} \\ \hat{v}_t &\leftarrow \frac{v_t}{1 - \beta_2^t} \\ \theta_t &\leftarrow \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{align}\]Where (\(g_t\)) is the gradient at step (t), and \(m_t\) and \(v_t\) are vectors. Now, we’ll examine the derivation and intuition for each line.
1. The first moment (\(m_t\)): “which direction have gradients been pointing?”
\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t\]This is a moving average of gradients. \(\beta_2\) is often set to 0.9, meaning Adam keeps about 90% of the previous momentum and mixes in 10% of the new gradient.
Instead of trusting only the current gradient (\(g_t\)), Adam remembers recent gradients. If gradients keep pointing in the same direction, (\(m_t\)) builds up. If gradients are noisy and alternate directions, (\(m_t\)) smooths them out.
So (\(m_t\)) is like momentum.
Intuition:
Current gradient says: go left.
Recent gradients also said: go left.
Adam says: okay, confidently go left.
But if the gradients are inconsistent:
Step 1: go left.
Step 2: go right.
Step 3: go left.
Step 4: go right.
Adam says: this direction is noisy, slow down.
2. The second moment (\(v_t\)): “how large have gradients been for this parameter?”
\[v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\]This tracks a moving average of squared gradients. \(\beta_2\) is often set to 0.999.
Squaring removes sign, so Adam is no longer asking “which direction?” It is asking “how big are the gradients usually?”
If a parameter has consistently huge gradients, (\(v_t\)) becomes large. If a parameter has tiny gradients, (\(v_t\)) stays small.
This is used to normalize the update:
\[\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\]So parameters with large gradients get smaller effective steps, and parameters with small gradients get larger effective steps.
Intuition:
This parameter usually has huge gradients.
Don't take a massive step just because the raw gradient is big.
Scale it down.
And:
This parameter usually has tiny gradients.
Don't ignore it completely.
Scale it up relative to its usual size.
3. Why divide by (\(\sqrt{v_t}\))?
Because (\(v_t\)) tracks squared gradients. If \(v_t \approx g_t^2\), then: \(\sqrt{v_t} \approx \|g_t\|\).
So Adam roughly normalizes the gradient by its recent magnitude. That means the update is less sensitive to the absolute scale of the gradient.
For example, suppose two parameters have gradients: \(g_1 = 100, \qquad g_2 = 0.01\)
Vanilla SGD would update parameter 1 much more aggressively than parameter 2. Adam says: “relative to each parameter’s usual gradient scale, how strong is this update?”
That is why Adam is helpful when different parameters have very different gradient magnitudes.
4. Why the bias correction?
At the beginning: \(m_0 = 0, \qquad v_0 = 0\), so early moving averages are biased toward zero.
For example, at (t = 1):
\[m_1 = \beta_1 \cdot 0 + (1 - \beta_1)g_1\]If (\(\beta_1 = 0.9\)), then:
\[m_1 = 0.1 g_1\]That is much smaller than the actual gradient. So Adam corrects it:
\[\hat{m}_1 = \frac{m_1}{1 - \beta_1^1} = \frac{0.1g_1}{0.1} = g_1\]Same idea for (\(v_t\)):
\[\hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]This matters especially because (\(\beta_2\)) is usually very close to 1, like (0.999). Without bias correction, the second moment estimate would be tiny at the start, which could make early updates unstable or incorrectly scaled.
5. What Adam is doing in one sentence
Adam says:
Move in the direction of the smoothed gradient, but scale each parameter’s step by how large its gradients usually are.
More compactly:
[ \theta_t ========
\theta_{t-1}
\alpha \frac{ \text{smoothed gradient} }{ \text{smoothed gradient magnitude} } ]
6. Mental model
Vanilla SGD:
Take a step proportional to the current gradient.
Momentum:
Take a step in the direction gradients have consistently pointed.
RMSProp / adaptive scaling:
Shrink updates for parameters with large historical gradients.
Boost updates for parameters with small historical gradients.
Adam:
Momentum + adaptive per-parameter learning rates.
That is why Adam often works well out of the box: it smooths noisy gradients and automatically rescales updates parameter by parameter.
import numpy as np
class Adam:
def __init__(self, params, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
"""
params: dict of parameter arrays, e.g. {"W": W, "b": b}
lr: learning rate
beta1: decay rate for first moment estimate
beta2: decay rate for second moment estimate
eps: small constant for numerical stability
"""
self.params = params
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.t = 0
# First and second moment buffers
self.m = {k: np.zeros_like(v) for k, v in params.items()}
self.v = {k: np.zeros_like(v) for k, v in params.items()}
def step(self, grads):
"""
grads: dict of gradients with same keys as params
"""
self.t += 1
for k in self.params:
g = grads[k]
# Update biased first moment estimate
self.m[k] = self.beta1 * self.m[k] + (1 - self.beta1) * g
# Update biased second raw moment estimate
self.v[k] = self.beta2 * self.v[k] + (1 - self.beta2) * (g ** 2)
# Bias correction
m_hat = self.m[k] / (1 - self.beta1 ** self.t)
v_hat = self.v[k] / (1 - self.beta2 ** self.t)
# Parameter update
self.params[k] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
def zero_grad(self):
"""
Optional placeholder, useful if mirroring PyTorch style.
In NumPy, gradients are usually recomputed each iteration.
"""
pass
import numpy as np
np.random.seed(0)
# Fake data: y = Xw + b + noise
N, D = 100, 3
X = np.random.randn(N, D)
true_w = np.array([[2.0], [-3.0], [1.0]])
true_b = 0.5
y = X @ true_w + true_b + 0.1 * np.random.randn(N, 1)
# Parameters
params = {
"W": np.random.randn(D, 1),
"b": np.zeros((1,))
}
optimizer = Adam(params, lr=0.05)
for step in range(1000):
# Forward pass
y_pred = X @ params["W"] + params["b"]
# MSE loss
loss = np.mean((y_pred - y) ** 2)
# Backward pass
dloss = 2 * (y_pred - y) / N
grads = {
"W": X.T @ dloss,
"b": np.sum(dloss, axis=0)
}
# Adam update
optimizer.step(grads)
if step % 100 == 0:
print(f"step {step}, loss = {loss:.6f}")
print("Learned W:", params["W"].ravel())
print("Learned b:", params["b"])
References.
- Diederik P. Kingma, Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. [PDF]