adam optimizer

what does adam actually do?

so adam and adamw are adaptive learning rate optimization algorithms. they manipulate learning rates for each parameter individually based on gradient history.

the core math

adam tracks two things:

momentum: exponential moving average of gradients
variance: exponential moving average of squared gradients

the equations are:

$m_{t} = β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot g_{t}$

$v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot g_{t}^{2}$

$\overset{m}{^}_{t} = \frac{m _{t}}{1 - β _{1}^{t}}$

$\overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{2}^{t}}$

$θ_{t} = θ_{t - 1} - α \cdot \frac{m ^ _{t}}{v ^ _{t} + ϵ}$

so what does momentum and variance actually mean?

momentum ( $m_{t}$ ) = “which direction should i go?”

it’s a weighted average of recent gradients
tells you the overall trend direction
filters out noise and random fluctuations
like if gradients are [5, -2, 4, -1, 6], momentum might be ~3, saying “overall trend is positive”

variance ( $v_{t}$ ) = “how confident am i about this direction?”

tracks how much gradients have been jumping around
high variance = gradients are chaotic, be careful
low variance = gradients are stable, move confidently
stable gradients [3, 3.1, 2.9, 3.2] → low variance → take big steps
chaotic gradients [5, -2, 8, -3] → high variance → take small steps

wait, doesn’t gradient descent already give us direction?

yeah it does, but the problem is gradients are noisy as hell.

raw gradients jump around:

step 1: gradient = +5 (go right!)
step 2: gradient = -3 (go left!)
step 3: gradient = +4 (go right!)
step 4: gradient = -1 (go left!)

following raw gradients = zigzagging back and forth

adam’s momentum smooths this:

momentum = ~+1.25 (average says “go right, but gently”)
instead of zigzag, you get smooth progress toward optimum

why are gradients noisy?

mini-batches (different data each step)
saddle points and plateaus
numerical precision issues

so adam = “don’t trust one gradient, trust the pattern of recent gradients”

the full weight update equation

standard gd: $θ = θ - α \cdot gradient$

adam’s full update: $m_{t} = β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot gradient$ $v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot gradient^{2}$ $θ = θ - α \cdot \frac{m _{t}}{v _{t} + ϵ}$

so what’s happening:

$m_{t}$ (smoothed gradient) replaces raw gradient entirely
$v_{t}$ scales the learning rate per parameter
adam uses momentum for direction and variance for step size

what are these beta values?

$β_{1}$ and $β_{2}$ are decay rates - they control how much history to remember.

$β_{1}$ (momentum decay) = 0.9:

controls how much old momentum vs new gradient
0.9 means “90% old momentum + 10% new gradient”
higher $β_{1}$ = more smoothing, slower to change direction

$β_{2}$ (variance decay) = 0.999:

controls how much old variance vs new gradient²
0.999 means “99.9% old variance + 0.1% new gradient²”
needs longer memory because variance changes slower than momentum

these values were found empirically to work well.

does adam track this for all parameters?

yes. every single parameter gets its own $m_{t}$ and $v_{t}$ .

so if you have 1 million parameters, adam tracks 2 million values ( $m_{t}$ and $v_{t}$ for each parameter).

some parameters need to move fast (stable gradients), others need to move slow (noisy gradients). adam figures this out automatically for each one.

what’s adamw then?

adamw fixes how weight decay is applied in adam.

original adam with weight decay: $m_{t} = β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot (g_{t} + λ \cdot θ_{t - 1})$ $v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot (g_{t} + λ \cdot θ_{t - 1})^{2}$ $θ_{t} = θ_{t - 1} - α \cdot \frac{m _{t}}{v _{t} + ϵ}$

adamw (decoupled weight decay): $m_{t} = β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot g_{t}$ $v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot g_{t}^{2}$ $θ_{t} = θ_{t - 1} - α \cdot \frac{m _{t}}{v _{t} + ϵ} - α \cdot λ \cdot θ_{t - 1}$

the problem with original adam:

weight decay gets mixed with adaptive learning rates
parameters with small gradients get less regularization

adamw solution:

apply weight decay directly to parameters (not gradients)
every parameter gets same relative weight decay
regularization is independent of gradient magnitude

adamw works much better, especially for large models like transformers. it’s now the standard choice.