Improving the batch normalization method
There is a recently introduced method called batch normalization (“batchnorm” 1). this method is effective at reducing the training time it also improves the performance and stability of neural networks it works great when we use feed-forward neural networks with large minibatch size consisting of independent samples. but we cannot use this method for recurrent neural networks because the statistics are computed per batch. the problem is what to do when we want to train a recurrent neural network or we want to use small minibatch size or have dependent samples . in this paper we address this problem.
We know that normalizing the inputs in a neural network will help it learn better because consider you want to train a neural network for recognizing cat pictures but the quality of your training pictures may not be the same for example some pictures may have come from good quality cameras and some from mobile phones and some of them may be noisy etc so the distribution of your input pictures are not the same by normalizing your input data you can reduce this effects and make your neural network fit the data better
The idea of batch normalization is that you not only standardize your input data but you consider each layer as the input of another subneural network and normalize each layer because the distribution of each layer’s input changes as the parameters of other layers change and it makes the training slower and harder because we should be careful in initializing our parameters and make our learning rate smaller3. in a feed-forward neural network, batch normalization stores the statistics for each layer separately. But for recurrent neural networks with the varied length of the sequence as input, applying batch normalization requires different statistics for different time-steps. So we cannot use batch normalization in recurrent neural networks. In this paper, we introduce a method called layer normalization which will solve this problem 2.
Consider we have a node x and we know that it depends on other parameters that are used to compute x and when those parameters have changed the distribution of x will also change this phenomenon is called covariate shift we can solve this problem by normalizing the values of this node over a mini-batch. so consider we have minibatch B consisting of we then update normalize this values using the following equations 1:
Here ? is a small constant which is used to avoid division by zero and ? and ? are parameters and can be computed as to other parameters. notice here we are estimating the expected value of the data statistics using the mean and standard deviation of current mini-batch And this puts constraints on the size of the min-batch and it is hard to apply to recurrent neural networks 2 we address this problem using a method called layer normalization 2. so batch normalization will process just one mini-batch each time but for test you probably want to process the data one at a time hence we have to use a different method to compute ? and ?. what we do is to estimate ? and ? using an exponentially weighted average over mini-batches. So the activities are computed in different ways in training and test time and also model layer inputs are dependent on all mini-batches. we address this second problem using a method called batch renormalization 3.
3. Layer normalization
In Layer normalization method we compute ? and ? over hidden units in each layer hence we have 2:
Note that H in this formula is number of hidden units in the layer. here we compute ? and ? over the layer not over the mini-batches so the hidden units in each layer have the same amounts of ? and ?.
4. Batch Renormalization
In batch normalization, it can be observed that the activities are computed in different ways in training and test time. Consider a node x assume that ? and ? are estimations of mean and standard deviation computed using an exponentially weighted average overall mini batches and consider that are minibatch statistics and given that 3:
Now if and then in batch norm we assume that r and d are constant r=1, d=0 In batch renormalization we don’t have this assumption but in each step, we estimate their values and treat them as constants in gradient computation. Algorithm 1 presents the batch renormalization method 3.
5.1. Layer Normalization
We have trained data using MINIST database(figure 2.) with layer normalization and batch normalization and as you can see the performance of layer normalization is far more better than batch normalization . note that here we used a feed-forward network with size 784-1000-1000-10 . but for convolutional neural networks batch normalization is better2
5.2. Batch Renormalization
batch normalization performance was 78.3% on Inception-v34 with batch size 32, but only 74.2% with batch size 4.but batch renormalization performance was 76.5% with batch size 4. the result can be found in figure 2 3.
You can also see the comparison between batch normalization and batch renormalization when we have non-i.i.d. training data in figure 3.
Algorithm 1: Training (top) and inference (bottom) with Batch Renormalization, applied to activation x over a mini-batch. During backpropagation, standard chain rule is used. The values marked with stop gradient are treated as constant for a given training step, and the gradient is not propagated through them3.
In this paper we reviewed two methods to improve batch normalization method . the layer normalization is
Figure 1 : Training performance of a model with batch sizes of 128 and 4 . notice that here the negative log likelihood is used.
a simple and effective method for applying batch normalization to small mini batch recurrent neural networks but it might not be good at training convolutional neural networks . the second method was batch renormalization it was very effective
method when we used small or non-i.i.d. minibatches it also improves any model that uses batch norm like Residual Networks or Generative Adversarial Networks. It is also possible to use this method in training recurrent neural networks.
Figure 2: Validation accuracy for models trained using batchnorm or Batch Renormalization, here normalization is performed for sets with 4 samples . Batch Renormalization makes the model train faster and achieve a higher performance, although normalizing sets of 32 samples acheive a better accuracy3.
Figure 3: Validation accuracy when training on non-i.i.d. minibatches, obtained by sampling 2 images for each of 16 (out of total 1000) random labels. This distribution bias results not only in a low test accuracy, but also low accuracy on the training set, with an eventual drop. This indicates overfitting to the particular minibatch distribution, which is confirmed by the improvement when the test minibatches also contain 2 images per label, and batchnorm uses minibatch statistics during inference. It
improves further if batchnorm is applied separately to 2 halves of a training minibatch, making each of them more i.i.d. Finally, by using Batch Renorm, we are able to just train and evaluate normally, and achieve the same validation accuracy as we get for i.i.d. minibatches 3.
1 S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 448–456, 2015.
2 J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
3 S. Ioffe. Batch Renormalization: Towards reducing Minibatch Dependence in Batch-Normalized Models. arXiv preprint arXiv:1702.03275, 2017
4 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.