• Train




          Data Collection

          Building Blocks​

          Device Enrollment

          Monitoring Dashboards

          Video Annotation​

          Application Editor​

          Device Management

          Remote Maintenance

          Model Training

          Application Library

          Deployment Manager

          Unified Security Center

          AI Model Library

          Configuration Manager

          IoT Edge Gateway

          Privacy-preserving AI

          Ready to get started?

          Expert Services
  • Customers
  • Company
Close this search box.

The Role of Batch Normalization in CNNs

Build, deploy, operate computer vision at scale
  • One platform for all use cases
  • Scale on robust infrastructure
  • Enterprise security

One of the main issues for computer vision (CV) developers is to avoid overfitting. That’s a situation when your CV model works great on the training data but is inefficient in predicting the test data. The reason is that your model is overfitting. To fix this problem you need regularization, most often done by Batch Normalization (BN).

The regularization methods enable CV models to converge faster and prevent over-fitting. Therefore, the learning process becomes more efficient. There are several techniques for regularization and here, we present the concept of batch normalization.

About us: Viso Suite is the computer vision solution for enterprises. By covering the entire ML pipeline in a simple interface, teams can easily integrate computer vision into their business workflows. To learn how your team can get started with computer vision, book a demo of Viso Suite.

One unified infrastructure to build deploy scale secure

real-world computer vision

What is Batch Normalization?

Batch normalization is a method that can enhance the efficiency and reliability of deep neural network models. It is very effective in training convolutional neural networks (CNN), providing faster neural network convergence. As a supervised learning method, BN normalizes the activation of the internal layers during training. The next layer can analyze the data more effectively by resetting the output distribution from the previous layer.


Neural network weighted input with formula to compute the output signal
Neural network layers and activations from the Artificial Neural Networks blog by


Internal covariate shift denotes the effect that parameters change in the previous layers have on the inputs of the current layer. This makes the optimization process more complex and slows down the model convergence.

In batch normalization – the activation value doesn’t matter, making each layer learn separately. This approach leads to faster learning rates. Also, the amount of information lost between processing stages may decrease. That will provide a significant increase in the precision of the network.


How Does Batch Normalization Work?

The batch normalization method enhances the efficiency of a deep neural network by discarding the batch mean and dividing it into the batch standard deviation. The gradient descent method scales the outputs by a parameter if the loss function is large. Subsequently – it updates the weights in the next layer.

Batch normalization aims to improve the training process and increase the model’s generalization capability. It reduces the need for precise initialization of the model’s weights and enables higher learning rates. That will accelerate the training process.

Batch normalization multiplies its output by a standard deviation parameter (γ). Also, it adds a mean parameter (beta) when applied to a layer. Due to the coaction between batch normalization and gradient descent, data may be disarranged when adjusting these two weights for each output. As a result, a reduction of data loss and improved network stability will be achieved by setting up the other relevant weights.


Batch normalization gradient magnitudes
Gradient magnitudes at initialization for layer 55 of a network with and without BN – Source


Commonly, CV experts apply batch normalization before the layer’s activation. It is often used in conjunction with other regularization functions. Also, deep learning methods, including image classification, natural language processing, and machine translation utilize batch normalization.


Batch Normalization in CNN Training

Internal Covariate Shift Reduction

Google researchers Sergey Ioffe and Christian Szegedy defined the internal covariate shift as a change in the order of network activations due to the change in network parameters during training. To improve the training, they aimed to reduce the internal covariate shift. Their goal was to increase the training speed by optimizing the distribution of layer inputs as training progressed.

Previous researchers (Lyu, Simoncelli, 2008) applied statistics over a single training example, or, in the case of image networks, over different feature maps at a given location. They wanted to preserve the information in the network. Therefore, they normalized the activations in a training sample relative to the statistics of the entire training dataset.


Training, testing, ResNet
Training and testing phases using a ResNet-50 with BN on ImageNetSource


The gradient descent optimization doesn’t take into account the fact that the normalization will happen. Ioffe and Szegedy wanted to ensure that the network always produces activations for parameter values. Due to the gradient loss, they applied the normalization and calculated its dependence on the model parameters.

Training and Inference with Batch-Normalized CNNs

Training can be more efficient by normalizing activations that depend on a mini-batch, but it is not necessary during inference. (Mini-batch is a portion of the training dataset). The researchers needed the output to depend only on the input. By applying moving averages, they tracked the accuracy of a model, while it trained.

Normalization can be done by applying a linear transformation to each activation, as means and variances are fixed during inference. To batch-normalize a CNN, researchers specified a subset of activations and inserted the BN transform for each (Algorithm below).

Authors considered a mini-batch B of size m. They performed normalization to each activation independently, by focusing on a particular activation x(k) and omitting k for clarity. They got m values for each activation in the mini-batch:
B = {x1…m}
They denoted normalized values as x1…m, and their linear transformations were y1…m. Researchers have defined the transform
BN γ,β : x1…m → y1…m

to be the Batch Normalizing transform. They conducted the BN Transform algorithm given below. In the algorithm, σ is a constant added to the mini-batch variance for numerical stability.


BN transform algorithm
Batch Normalizing Transform, applied to activation x over a mini-batch – Source


The BN transform has been added to a network to manipulate all activations. By y = BN γ,β(x), researchers indicated that the parameters γ and β should be learned. However, they noted that the BN transform does not independently process the activation in each training example.

Consequently, BN γ,β(x) depends both on the training example and the other samples in the mini-batch. They passed the scaled and shifted values y to other network layers. The normalized activations xb were internal to the transformation, but their presence was crucial. The distributions of values of all xb had the expected value of 0 and the variance of 1.

All layers that previously received x as the input – now receive BN(x). Batch normalization allows for the training of a model using batch gradient descent, or stochastic gradient descent with a mini-batch size m > 1.

Batch-Normalized Convolutional Networks

Szegedy et al., (2014) used batch normalization to create a new Inception network, trained on the ImageNet classification task. The network had a large number of convolutional and pooling layers.

  • They included a SoftMax layer to predict the image class, out of 1000 possibilities. Convolutional layers include ReLU as the nonlinearity.
  • The main difference between their CNN was that the 5 × 5 convolutional layers were replaced by two consecutive layers of 3 × 3 convolutions.
  • By using batch normalization researchers matched the accuracy of Inception in less than half the number of training steps.
  • With slight modifications, they significantly increased the training speed of the network. BN-x5 needed 14 times fewer steps than Inception to reach 72.2% accuracy.


accuracy Inception batch normalization
Validation accuracy of Inception and its batch-normalized variants – Source


By increasing the learning rate further (BN-x30) they caused the model to train slower at the start. Still, it was able to reach a higher final accuracy. It reached 74.8% after 6·106 steps, i.e. 5 times fewer steps than required by Inception.


Benefits of Batch Normalization

Batch normalization brings multiple benefits to the learning process:

  • Higher learning rates. The training process is faster since batch normalization enables higher learning rates.
  • Improved generalization. BN reduces overfitting and improves the model’s generalization ability. Also, it normalizes the activations of a layer.
  • Stabilized training process. Batch normalization reduces the internal covariate shift that occurs during training, improving the stability of the training process. Thus, it makes it easier to optimize the model.
  • Model Regularization. Batch normalization treats the training example together with other examples in the mini-batch. Therefore, the training network no longer produces deterministic values for a given training example.
  • Reduced need for careful initialization. Batch normalization decreases the model’s dependence on the initial weights, making it easier to train.


What’s Next?

Batch normalization offers a solution to address challenges with training deep neural networks for computer systems. By normalizing the activations of each layer, batch normalization allows for smoother and more stable optimization, resulting in faster convergence and improved generalization performance. Because it can mitigate issues like internal covariate shifts it enables the development of more robust and efficient neural network architectures.

For other relevant topics in computer vision, check out our other blogs:

Enterprise Computer Vision with Viso Suite

Viso Suite omits the need for enterprises to approach process automation with point solutions. Traditionally, computer vision solutions have only tackled a single step in the computer vision pipeline. This has led to companies dealing with disjointed computer vision initiatives.

As a result, using mixed infrastructure leads to a breakdown of the computer vision project at hand. Conversely, as a development platform, Viso Suite consolidates the entire machine learning pipeline into a single interface. With a unified initiative, enterprises have been able to realize a 695% ROI based on improved efficiency, productivity, and process automation. To learn more, check out our economic impact study.

Finally, Viso Suite was developed on the premise of being:

  • Fast and accurate
  • Easy to use
  • Open and extensible
  • Efficient and scalable
  • Private and secure

To learn more about Viso Suite and the value it provides enterprise ML teams, book a demo with our team.

Viso Suite Computer Vision Enterprise Platform
Viso Suite is the Computer Vision Enterprise Platform
Play Video