what does adam actually do?
so adam and adamw are adaptive learning rate optimization algorithms. they manipulate learning rates for each parameter individually based on gradient history.
the core math
adam tracks two things:
- momentum: exponential moving average of gradients
- variance: exponential moving average of squared gradients
the equations are:
so what does momentum and variance actually mean?
momentum () = “which direction should i go?”
- it’s a weighted average of recent gradients
- tells you the overall trend direction
- filters out noise and random fluctuations
- like if gradients are [5, -2, 4, -1, 6], momentum might be ~3, saying “overall trend is positive”
variance () = “how confident am i about this direction?”
- tracks how much gradients have been jumping around
- high variance = gradients are chaotic, be careful
- low variance = gradients are stable, move confidently
- stable gradients [3, 3.1, 2.9, 3.2] → low variance → take big steps
- chaotic gradients [5, -2, 8, -3] → high variance → take small steps
wait, doesn’t gradient descent already give us direction?
yeah it does, but the problem is gradients are noisy as hell.
raw gradients jump around:
- step 1: gradient = +5 (go right!)
- step 2: gradient = -3 (go left!)
- step 3: gradient = +4 (go right!)
- step 4: gradient = -1 (go left!)
following raw gradients = zigzagging back and forth
adam’s momentum smooths this:
- momentum = ~+1.25 (average says “go right, but gently”)
- instead of zigzag, you get smooth progress toward optimum
why are gradients noisy?
- mini-batches (different data each step)
- saddle points and plateaus
- numerical precision issues
so adam = “don’t trust one gradient, trust the pattern of recent gradients”
the full weight update equation
standard gd:
adam’s full update:
so what’s happening:
- (smoothed gradient) replaces raw gradient entirely
- scales the learning rate per parameter
- adam uses momentum for direction and variance for step size
what are these beta values?
and are decay rates - they control how much history to remember.
(momentum decay) = 0.9:
- controls how much old momentum vs new gradient
- 0.9 means “90% old momentum + 10% new gradient”
- higher = more smoothing, slower to change direction
(variance decay) = 0.999:
- controls how much old variance vs new gradient²
- 0.999 means “99.9% old variance + 0.1% new gradient²”
- needs longer memory because variance changes slower than momentum
these values were found empirically to work well.
does adam track this for all parameters?
yes. every single parameter gets its own and .
so if you have 1 million parameters, adam tracks 2 million values ( and for each parameter).
some parameters need to move fast (stable gradients), others need to move slow (noisy gradients). adam figures this out automatically for each one.
what’s adamw then?
adamw fixes how weight decay is applied in adam.
original adam with weight decay:
adamw (decoupled weight decay):
the problem with original adam:
- weight decay gets mixed with adaptive learning rates
- parameters with small gradients get less regularization
adamw solution:
- apply weight decay directly to parameters (not gradients)
- every parameter gets same relative weight decay
- regularization is independent of gradient magnitude
adamw works much better, especially for large models like transformers. it’s now the standard choice.