Normalization Layers for Deep Learning
Table of Contents:
Initialization
First, to understand normalization layers, we need to understand network input initialization, since each layer (not just the first layer) needs such a stable distribution.
Xavier Glorot
Kaiming He
too large of learning rate if init to large values
why do we want whitened data (linearly transformed to have mean 0, unit variance)
-
why mean 0?
-
why do we want unit variance?
Long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened.
(1) In short, due to limited precision in network. By normalizing, you can make the bias fitting task much easier (the bias is close to 0, and at a minimum the “usable precision” of the float parameters can be meaningfully spent fitting the data
What if we zero-initialize all layer weights? -> multiply by zero activations in backprop, get 0 grad
What if we initialize all weights to same value -> all weights wil be the same, no use in more paramters, all redundant
Normalization layers
(2) Internal “Covariate shift”: the distribution of each layer’s inputs (network activations) changes during training, as the parameters of the previous layers change See Szegedy and Ioffe, 2015.
- BatchNorm -> fixes the means and variances of layer inputs.
-
GroupNorm
-
LayerNorm
[ADD GRAPHIC]
Why no batch norm in transformers? due to padding on sequence dim?
ADD NUMPY IMPLEMENTATIONS
he weights and biases are represented with limited precision so you’re limited in essentially the number of digits for a quantity you can reliably predict
Layers & Numerical Precision
Which layers need the highest precision?
softmax-cross entropy? -> Yes? norm layers? -> Yes?
MatMUL? -> No?ReLU? -> No? Leaky ReLU?
References
- Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training b y Reducing Internal Covariate Shift. [PDF].
2.
3.