Initialization

First, to understand normalization layers, we need to understand network input initialization, since each layer (not just the first layer) needs such a stable distribution.

Xavier Glorot

Kaiming He

too large of learning rate if init to large values

why do we want whitened data (linearly transformed to have mean 0, unit variance)

why mean 0?
why do we want unit variance?

Long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened.

(1) In short, due to limited precision in network. By normalizing, you can make the bias fitting task much easier (the bias is close to 0, and at a minimum the “usable precision” of the float parameters can be meaningfully spent fitting the data

What if we zero-initialize all layer weights? -> multiply by zero activations in backprop, get 0 grad

What if we initialize all weights to same value -> all weights wil be the same, no use in more paramters, all redundant

Normalization layers

(2) Internal “Covariate shift”: the distribution of each layer’s inputs (network activations) changes during training, as the parameters of the previous layers change See Szegedy and Ioffe, 2015.

BatchNorm -> fixes the means and variances of layer inputs.

\[y^{(k)} = \gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}\]

GroupNorm
LayerNorm

[ADD GRAPHIC]

Why no batch norm in transformers? due to padding on sequence dim?

ADD NUMPY IMPLEMENTATIONS

he weights and biases are represented with limited precision so you’re limited in essentially the number of digits for a quantity you can reliably predict

Layers & Numerical Precision

Which layers need the highest precision?

softmax-cross entropy? -> Yes? norm layers? -> Yes?

MatMUL? -> No?ReLU? -> No? Leaky ReLU?

References

Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift. [PDF].