• Train




          Data Collection

          Building Blocks​

          Device Enrollment

          Monitoring Dashboards

          Video Annotation​

          Application Editor​

          Device Management

          Remote Maintenance

          Model Training

          Application Library

          Deployment Manager

          Unified Security Center

          AI Model Library

          Configuration Manager

          IoT Edge Gateway

          Privacy-preserving AI

          Ready to get started?

          Expert Services
  • Why Viso Suite
  • Pricing
Close this search box.

EfficientNet: Pushing the Boundaries of Deep Learning Efficiency


Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.

Need Computer Vision?

Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.

EfficientNet is a Convolutional Neural Network (CNN) architecture that utilizes a compound scaling method to uniformly scale depth, width, and resolution, providing high accuracy with computational efficiency.

CNNs (Convolutional Neural Networks) power computer vision tasks like object detection and image classification. Their ability to learn from raw images has led to breakthroughs in autonomous vehicles, medical diagnosis, and facial recognition. However, as the size and complexity of datasets grow, CNNs need to become deeper and more complex to maintain high accuracy.

Increasing the complexity of CNNs leads to better accuracy, which demands more computational resources.

This increased computational demand makes CNN impractical for real-time applications, and use on devices with limited processing capabilities (smartphones and IoT devices). This is the problem EfficientNet tries to solve. It provides a solution for sustainable and efficient scaling of CNNs.


Introducing Viso Suite. Viso Suite is the end-to-end computer vision platform for enterprises. By consolidating the entire machine learning pipeline into a single infrastructure. Viso Suite allows ML teams to manage and control the entire application lifecycle. Learn more about Viso Suite by booking a demo with our team.

Viso Suite is an end-to-end machine learning solution.
Viso Suite is the end-to-End, No-Code Computer Vision Solution.


The Path to EfficientNet

The popular strategy of increasing accuracy through growing model size yielded impressive results in the past, with models like GPipe achieving state-of-the-art accuracy on the ImageNet dataset.

From GoogleNet to GPipe (2018), ImageNet top-1 accuracy jumped from 74.8% to 84.3%, along with parameter counts (going from 6.8M to 557M), leading to excessive computational demands.


Model size vs accuracy parameters
Model size vs accuracy –Source


Model scaling can be achieved in three ways: by increasing model depth, width, or image resolution.

  • Depth (d): Scaling network depth is the most commonly used method. The idea is simple, deeper ConvNet captures richer and more complex features and also generalizes better. However, this solution comes with a problem, the vanishing gradient problem.
Depth scaling depiction
Depth scaling –Source


  • Width (w):  This is used in smaller models. Widening a model allows it to capture more fine-grained features. However, extra-wide models are unable to capture higher-level features.
Width Scaling used in EfficientNet
Width Scaling –Source


  • Image resolution (r): Higher resolution images enable the model to capture more fine-grained patterns. Previous models used 224 x 224 size images, and newer models tend to use a higher resolution. However, higher resolution also leads to increased computation requirements.


Resolution scaling in EfficientNet
Resolution scaling – Source


Problem with Scaling

As we have seen, scaling a model has been a go-to method, but it comes with overhead computation costs. Here is why:

More Parameters: Increasing depth (adding layers) or width (adding channels within layers) leads to a significant increase in the number of parameters in the network. Each parameter requires computation during training and prediction. More parameters translate to more calculations, increasing the overall computational burden.

Moreover, scaling also leads to Memory Bottleneck as larger models with more parameters require more memory to store the model weights and activations during processing.


What is EfficientNet?

EfficientNet proposes a simple and highly effective compound scaling method, which enables it to easily scale up a baseline ConvNet to any target resource constraints, in a more principled and efficient way.

What is Compound Scaling?

The creator of EfficientNet observed that different scaling dimensions (depth, width, image size) are not independent.

High-resolution images require deeper networks to capture large-scale features with more pixels. Additionally, wider networks are needed to capture the finer details present in these high-resolution images. To pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling.

However, scaling CNNs using particular ratios yields a better result. This is what compound scaling does.


Compound Scaling used in EfficientNet.
Compound scaling – Source


The compound scaling coefficient method uniformly scales all three dimensions (depth, width, and resolution) in a proportional manner using a predefined compound coefficient ɸ.

Here is the mathematical expression for the compound scaling method:


Compound scaling expression involving depth, width and resolution.
Compound scaling expression – Source

α: Scaling factor for network depth (typically between 1 and 2)
β: Scaling factor for network width (typically between 1 and 2)
γ: Scaling factor for image resolution (typically between 1 and 1.5)
ɸ (phi): Compound coefficient (positive integer) that controls the overall scaling factor.

This equation tells us how much to scale the model (depth, width, resolution) which yields maximum performance.

Benefits of Compound Scaling
  • Optimal Resource Utilization: By scaling all three dimensions proportionally, EfficientNet avoids the limitations of single-axis scaling (vanishing gradients or saturation).
  • Flexibility: The predefined coefficients allow for creating a family of EfficientNet models (B0, B1, B2, etc.) with varying capacities. Each model offers a different accuracy-efficiency trade-off, making them suitable for diverse applications.
  • Efficiency Gains: Compared to traditional scaling, compound scaling achieves similar or better accuracy with significantly fewer parameters and FLOPs (FLoating-point Operations Per Second), making them ideal for resource-constrained devices.

Moreover, the advantage of compound scaling can be visualized using an activation map.


Advantages of compound scaling can be visualised using activation map.
Class Activation Map – Source


However, to develop an efficient CNN model that can be scaled, the creator of EfficientNet created a unique baseline network, called the EfficientNets. This baseline network is then further scaled in steps to obtain a family of larger networks (EfficientNet-B0 to EfficientNet-B7).


The EfficientNet Family

EfficientNet consists of 8 models, going from EfficientNet-B0 to EfficientNet-B7.

EfficientNet models
Overview of EfficientNet models


EfficientNet-B0 is the foundation upon which the entire EfficientNet family is built. It’s the smallest and most efficient model within the EfficientNet variants.


EfficientNet Architecture

EfficientNet-B0, discovered through Neural Architectural Search (NAS) is the baseline model. The main components of the architecture are:

  • MBConv block (Mobile Inverted Bottleneck Convolution)
  • Squeeze-and-excitation optimization


EfficientNet architecture
EfficientNet Architecture –Source


What is the MBConv Block?

The MBConv block is an evolved inverted residual block inspired by MobileNetv2.

What is a Residual Network?

Residual networks (ResNets) are a type of CNN architecture that addresses the vanishing gradient problem, as the network gets deeper, the gradient diminishes. ResNets solves this problem and allows for training very deep networks. This is achieved by adding the original input to the output of the transformation applied by the layer, improving gradient flow through the network.

Residual learning, a diagram
Residual Learning –Source
What is an inverted residual block?

In residual blocks used in ResNets, the main pathway involves convolutions that reduce the dimensionality of the input feature map. A shortcut or residual connection then adds the original input to the output of this convolutional pathway. This process allows the gradients to flow through the network more freely.


Depiction of the residual block
Residual Block –Source


However, an inverted residual block starts by expanding the input feature map into a higher-dimensional space using a 1×1 convolution then applies a depthwise convolution in this expanded space and finally uses another 1×1 convolution that projects the feature map back to a lower-dimensional space, the same as the input dimension. The “inverted” aspect comes from this expansion of dimensionality at the beginning of the block and reduction at the end, which is opposite to the traditional approach where expansion happens towards the end of the residual block.


An inverted residual block.
Inverted Residual Block –Source


What is Squeeze-and-Excitation?

Squeeze-and-Excitation (SE) simply allows the model to emphasize useful features, and suppress the less useful ones. This is done in two steps:

  • Squeeze: This phase aggregates the spatial dimensions (width and height) of the feature maps across each channel into a single value, using global average pooling. This results in a compact feature descriptor that summarizes the global distribution for each channel, reducing each channel to a single scalar value.
  • Excitation:  In this step, the model using a full-connected layer applied after the squeeze step, produces a collection of per channel weight (activations or scores). The final step is to apply these learned importance scores to the original input feature map, channel-wise, effectively scaling each channel by its corresponding score.


Squueze and excitation block.
Squeeze-and-Excitation block –Source


This process allows the network to emphasize more relevant features and diminish less important ones, dynamically adjusting the feature maps based on the learned content of the input images.

Moreover, EfficientNet also incorporates the Swish activation function as part of its design to improve accuracy and efficiency.

What is the Swish Activation Function?

Swish is a smooth continuous function, unlike Rectified Linear Unit (ReLU) which is a piecewise linear function. Swish allows a small number of negative weights to be propagated through, while ReLU thresholds all negative weights to zero.


Swish Activation Function, a graph


Relu Activation Function – source


EfficientNet incorporates all the above elements into its architecture. Finally, the architecture looks like this:


EfficientNet architecture
EfficientNet Architecture –Source

Performance and Benchmarks

The EfficientNet family, starting from EfficientNet-B0 to EfficientNet-B7 and beyond, offers a range of models that scale in complexity and accuracy. Here are some key performance benchmarks for EfficientNet on the ImageNet dataset, reflecting the balance between efficiency and accuracy.

The benchmarks obtained are performed on the ImageNet dataset. Here are a few key insights from the benchmark:

  • Higher accuracy with fewer parameters: EfficientNet models achieve high accuracy with fewer parameters and lower FLOPs than other convolutional neural networks (CNNs). For example, EfficientNet-B0 achieves 77.1% top-1 accuracy on ImageNet with only 5.3M parameters, while ResNet-50 achieves 76.0% top-1 accuracy with 26M parameters. Additionally, the B-7 model performs at par with Gpipe, but with way fewer parameters ( 66M vs 557M).
  • Fewer Computations: EfficientNet models can achieve similar accuracy to other CNNs with significantly fewer FLOPs. For example, EfficientNet-B1 achieves 79.1% top-1 accuracy on ImageNet with 0.70 billion FLOPs, while Inception-v3 achieves 78.8% top-1 accuracy with 5.7 billion FLOPs.

As the EfficientNet model size increases (B0 to B7), the accuracy and FLOPs also increase. However, the increase in accuracy is smaller for the larger models. For example, EfficientNet-B0 achieves 77.1% top-1 accuracy, while EfficientNet-B7 achieves 84.3% top-1 accuracy.


EfficientNet benchmarks for the various models.
EfficientNet Benchmark –Source

Applications Of EfficientNet

EfficientNet’s strength lies in its ability to achieve high accuracy while maintaining efficiency. This makes it an important tool in scenarios where computational resources are limited. Here are some of the use cases for EfficientNet models:

  • Human Emotion Analysis on Mobile Devices: Video-based facial analysis of the affective behavior of humans done using the EfficientNet model on Mobile Devices, which achieved an F1-score of 0.38. Read here.


Emotion Recognition with Deep Learning
Human emotion analysis and recognition with computer vision


  • Health and Medicine: Use of B0 model for cancer diagnosis which obtained an accuracy of 91.18%. Read here.


Use case of EfficientNet in cancer detection
Cancer Detection – source


  • Plant Leaf disease: Plant leaf disease classification done using a deep learning model showed that the B5 and B4 models of EfficientNet architecture achieved the highest values compared to other deep learning models in original and augmented datasets with 99.91%  for accuracy and 99.39% for precision respectively.
Use of EfficientNet model in plant leaf disease identification.
Plant Disease –Source


  • Mobile and Edge Computing: EfficientNet’s lightweight architecture, especially the B0 and B1 variants, makes it perfect for deployment on mobile devices and edge computing platforms with limited computational resources. This allows EfficientNet to be used in real-time applications like augmented reality, enhancing mobile photography, and performing real-time video analysis.
  • Embedded Systems: EfficientNet models can be used in resource-constrained embedded systems for tasks like image recognition in drones or robots. Their efficiency allows for on-board processing without requiring powerful hardware.
  • Faster Experience: EfficientNet’s efficiency allows for faster processing on mobile devices, leading to a smoother user experience in applications like image recognition or augmented reality, moreover with reduced battery consumption.

Follow us

Related Articles

Join 6,300+ Fellow
AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

Sign up to receive news and other stories from viso.ai. Your information will be used in accordance with viso.ai's privacy policy. You may opt out at any time.
Play Video

Join 6,300+ Fellow
AI Enthusiasts

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy.

Build any Computer Vision Application, 10x faster

All-in-one Computer Vision Platform for businesses to build, deploy and scale real-world applications.