• Train




          Data Collection

          Building Blocks​

          Device Enrollment

          Monitoring Dashboards

          Video Annotation​

          Application Editor​

          Device Management

          Remote Maintenance

          Model Training

          Application Library

          Deployment Manager

          Unified Security Center

          AI Model Library

          Configuration Manager

          IoT Edge Gateway

          Privacy-preserving AI

          Ready to get started?

          Expert Services
  • Why Viso Suite
  • Pricing
Close this search box.

Vision Transformers (ViT) in Image Recognition – 2024 Guide


Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.

Need Computer Vision?

Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.

Vision Transformers (ViT) has recently emerged as a competitive alternative to Convolutional Neural Networks (CNNs) that are currently state-of-the-art in different image recognition computer vision tasks. ViT models outperform the current state-of-the-art (CNN) by almost x4 in terms of computational efficiency and accuracy.

Transformer models have become the de facto status quo in Natural Language Processing (NLP). For example, the popular ChatGPT AI chatbot is a transformer-based language model. Specifically, it is based on the GPT (Generative Pre-trained Transformer) architecture, which uses self-attention mechanisms to model the dependencies between words in a text.

In computer vision research, there has recently been a rise in interest in Vision Transformer (ViTs) and Multilayer Perceptrons (MLPs).

This article will cover the following topics:

  • What is a Vision Transformer (ViT)?
  • Using ViT models in Image Recognition
  • How do Vision Transformers work?
  • Use Cases and applications of Vision Transformers


About us: Viso.ai provides the leading end-to-end Computer Vision Platform Viso Suite. Our solution enables organizations worldwide to seamlessly build and deliver video image recognition applications. Get a demo for your company.

Viso Suite is a leading computer vision platform
Viso Suite provides end-to-end software for AI vision.


Vision Transformer (ViT) in Image Recognition

While the Transformer architecture has become the highest standard for tasks involving Natural Language Processing (NLP), its use cases relating to Computer Vision (CV) remain only a few. In many computer vision tasks, attention is either used in conjunction with convolutional neural networks (CNN) or used to substitute certain aspects of convolutional networks while keeping their entire composition intact. Popular image recognition algorithms include ResNet, VGG, YOLOv3, YOLOv7 or YOLOv8, and Segment Anything (SAM).


Convolutional Neural Networks Concept
The concept of widely popular Convolutional Neural Networks (CNN)

However, this dependency on CNN is not mandatory, and a pure transformer applied directly to sequences of image patches can work exceptionally well on image classification tasks.


Performance of Vision Transformers in Computer Vision

Vision Transformers (ViT) have recently achieved highly competitive performance in benchmarks for several computer vision applications, such as image classification, object detection, and semantic image segmentation.

CSWin Transformer is an efficient and effective Transformer-based backbone for general-purpose vision tasks. It uses a new technique called “Cross-Shaped Window self-attention” to analyze different parts of the image simultaneously, making it much faster.

The CSWin Transformer has surpassed previous state-of-the-art methods like the Swin Transformer. In benchmark tasks, CSWIN achieved excellent performance, including 85.4% Top-1 accuracy on ImageNet-1K, 53.9 box AP and 46.4 masks AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task.


Computer vision application for segmentation
Computer vision application for segmentation


What is a Vision Transformer (ViT)?

The Vision Transformer (ViT) model architecture was introduced in a research paper published as a conference paper at ICLR 2021 titled “An Image is Worth 16*16 Words: Transformers for Image Recognition at Scale”. It was developed and published by Neil Houlsby, Alexey Dosovitskiy, and 10 more authors of the Google Research Brain Team.

The fine-tuning code and pre-trained ViT models are available on the GitHub of the Google Research team. You find them here. The ViT models were pre-trained on the ImageNet and ImageNet-21k datasets.


Origin and history of vision transformer models

In the following, we highlight some of the most significant vision transformers that have been developed over the years. They are based on the transformer architecture, which was originally proposed for natural language processing (NLP) in 2017.

DateModelDescriptionVision Transformer?
2017 JunTransformerA model based solely on an attention mechanism. It demonstrated excellent performance on NLP tasks.No
2018 OctBERTPre-trained transformer models started dominating the NLP field.No
2020 MayDETRDETR is a simple yet effective framework for high-level vision that views object detection as a direct set prediction problem.Yes
2020 MayGPT-3The GPT-3 is a huge transformer model with 170B parameters that takes a significant step towards a general NLP model.No
2020 JuliGPTThe transformer model, originally developed for NLP, can also be used for image pre-training.Yes
2020 OctViTPure transformer architectures that are effective for visual recognition.Yes
2020 DecIPT/SETR/CLIPTransformers have been applied to low-level vision, segmentation, and multimodality tasks, respectively.Yes
2021 – todayViT VariantsSeveral ViT variants include DeiT, PVT, TNT, Swin, and CSWin (2022).Yes


Are Transformers a Deep Learning method?

A transformer in machine learning is a deep learning model that uses the mechanisms of attention, differentially weighing the significance of each part of the input sequence of data. Transformers in machine learning are composed of multiple self-attention layers. They are primarily used in the AI subfields of natural language processing (NLP) and computer vision (CV).

Transformers in machine learning hold strong promises toward a generic learning method that can be applied to various data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art standard accuracy with better parameter efficiency.


Vision Transformer and Image Classification

Image classification is a fundamental task in computer vision that involves assigning a label to an image based on its content. Over the years, deep convolutional neural networks (CNNs) like YOLOv7 have been the state-of-the-art method for image classification.

However, recent advancements in transformer architecture, which was originally introduced for natural language processing (NLP), have shown great promise in achieving competitive results in image classification tasks.


Example of classification in medical imaging
A practical example of Image Classification in Medical Imaging and Healthcare

An example is CrossViT, a cross-attention Vision Transformer for Image Classification. Computer vision research indicates that when pre-trained with a sufficient amount of data, ViT models are at least as robust as ResNet models.

Other papers showed that Vision Transformer Models have great potential for privacy-preserving image classification and outperform state-of-the-art methods in terms of robustness against attacks and classification accuracy.


Difference between CNN and ViT (ViT vs. CNN)

Vision Transformer (ViT) achieves remarkable results compared to convolutional neural networks (CNN) while obtaining substantially fewer computational resources for pre-training. In comparison to convolutional neural networks (CNN), Vision Transformer (ViT) shows a generally weaker inductive bias resulting in increased reliance on model regularization or data augmentation (AugReg) when training on smaller datasets.

The ViT is a visual model based on the architecture of a transformer originally designed for text-based tasks. The ViT model represents an input image as a series of image patches, like the series of word embeddings used when using transformers to text, and directly predicts class labels for the image. ViT exhibits an extraordinary performance when trained on enough data, breaking the performance of a similar state-of-the-art CNN with 4x fewer computational resources.


CNN versus VIT benchmark chart
CNN vs. ViT: FLOPs and throughput comparison of CNN and Vision Transformer Models  – Source


These transformers have high success rates when it comes to NLP models and are now also applied to images for image recognition tasks. CNN uses pixel arrays, whereas ViT splits the input images into visual tokens. The visual transformer divides an image into fixed-size patches, correctly embeds each of them, and includes positional embedding as an input to the transformer encoder. Moreover, ViT models outperform CNNs by almost four times when it comes to computational efficiency and accuracy.

The self-attention layer in ViT makes it possible to embed information globally across the overall image. The model also learns from training data to encode the relative location of the image patches to reconstruct the structure of the image.

The transformer encoder includes the following:

  • Multi-Head Self Attention Layer (MSP): This layer concatenates all the attention outputs linearly to the right dimensions. The many attention heads help train local and global dependencies in an image.
  • Multi-Layer Perceptrons (MLP) Layer: This layer contains a two-layer with Gaussian Error Linear Unit (GELU).
  • Layer Norm (LN): This is added prior to each block as it does not include any new dependencies between the training images. This thereby helps improve the training time and overall performance.

Moreover, residual connections are included after each block as they allow the components to flow through the network directly without passing through non-linear activations.

In the case of image classification, the MLP layer implements the classification head. It does it with one hidden layer at pre-training time and a single linear layer for fine-tuning.


What is the self-attention of Vision Transformers?

The self-attention mechanism is a key component of the transformer architecture, which is used to capture long-range dependencies and contextual information in the input data. The self-attention mechanism allows a ViT model to attend to different regions of the input data, based on their relevance to the task at hand.

Therefore, the self-attention mechanism computes a weighted sum of the input data, where the weights are computed based on the similarity between the input features. This allows the model to give more importance to the relevant input features, which helps it capture more informative representations of the input data.

Hence, self-attention is a computational primitive used to quantify pairwise entity interactions that help a network to learn the hierarchies and alignments present inside input data. Attention has proven to be a key element for vision networks to achieve higher robustness.


Vision Transformer Attention Map
Raw images (left) with attention maps of the ViT-S/16 model (right). – Source


What are the attention maps of ViT?

The attention maps of Vision Transformer (ViT) are matrices that represent the importance of different parts of an input image to different parts of the model’s learned representations. In ViT, the entire image of the input data is first divided into non-overlapping patches, which are then flattened and fed into the transformer encoder (more about the architecture below).

Attention maps refer to the visualizations of the attention weights that are calculated between each token (or patch) in the image and all other tokens. These attention maps are calculated using a self-attention mechanism, where each token attends to all other tokens to obtain a weighted sum of their representations.

The attention maps can be visualized as a grid of heatmaps, where each heatmap represents the attention weights between a given token and all other tokens. The brighter the color of a pixel in the heatmap, the higher the attention weight between the corresponding tokens. By analyzing the attention maps, we can gain insights into which parts of the image are most important for the classification task at hand.


Visualization of attention maps of ViT on images from ImageNet-A
Visualization of attention maps of ViT on images from ImageNet-A- Source


Vision Transformer ViT Architecture

Several vision transformer models have been proposed in the literature. The overall structure of the vision transformer architecture consists of the following steps:

  1. Split an image into patches (fixed sizes)
  2. Flatten the image patches
  3. Create lower-dimensional linear embeddings from these flattened image patches
  4. Include positional embeddings
  5. Feed the sequence as an input to a state-of-the-art transformer encoder
  6. Pre-train the ViT model with image labels, which is then fully supervised on a big dataset
  7. Fine-tune the downstream dataset for image classification


Vision Transformer ViT Architecture
Vision Transformer ViT Architecture – Source


Vision Transformers (ViT) is an architecture that uses self-attention mechanisms to process images. The Vision Transformer Architecture consists of a series of transformer blocks. Each transformer block consists of two sub-layers: a multi-head self-attention layer and a feed-forward layer.

The self-attention layer calculates attention weights for each pixel in the image based on its relationship with all other pixels, while the feed-forward layer applies a non-linear transformation to the output of the self-attention layer. The multi-head attention extends this mechanism by allowing the model to attend to different parts of the input sequence simultaneously.

ViT also includes an additional patch embedding layer, which divides the image into fixed-size patches and maps each patch to a high-dimensional vector representation. These patch embeddings are then fed into the transformer blocks for further processing.

The final output of the ViT architecture is a class prediction, obtained by passing the output of the last transformer block through a classification head, which typically consists of a single fully connected layer.


Performance benchmark comparison of Vision Transformers (ViT)
Performance benchmark comparison of Vision Transformers (ViT) with ResNet and MobileNet when trained from scratch on ImageNet. – Source


While the ViT full-transformer architecture is a promising option for vision processing tasks, the performance of ViTs is still inferior to that of similar-sized CNN alternatives (such as ResNet) when trained from scratch on a mid-sized dataset such as ImageNet. Overall, the ViT architecture allows for a more flexible and efficient way to process images, without relying on pre-defined handcrafted features.


How does a Vision Transformer (ViT) work?

The performance of a vision transformer model depends on decisions such as that of the optimizer, network depth, and dataset-specific hyperparameters. Compared to ViT, CNNs are easier to optimize.

The disparity on a pure transformer is to marry a transformer to a CNN front end. The usual ViT stem leverages a 16*16 convolution with a 16 stride. In comparison, a 3*3 convolution with stride 2 increases the stability and elevates precision.

CNN turns basic pixels into a feature map. Later, the feature map is translated by a tokenizer into a sequence of tokens that are then inputted into the transformer. The transformer then applies the attention technique to create a sequence of output tokens.

Eventually, a projector reconnects the output tokens to the feature map. The latter allows the examination to navigate potentially crucial pixel-level details. This thereby lowers the number of tokens that need to be studied, lowering costs significantly.

Particularly, if the ViT model is trained on huge datasets that are over 14M images, it can outperform the CNNs. If not, the best option is to stick to ResNet or EfficientNet. The vision transformer model is trained on a huge dataset even before the process of fine-tuning. The only change is to disregard the MLP layer and add a new D times KD*K layer, where K is the number of classes of the small dataset.

To fine-tune in better resolutions, the 2D representation of the pre-trained position embeddings is done. This is because the trainable liner layers model the positional embeddings.


Challenges of Vision Transformers

The challenges of vision transformers are many, and they include issues related to architecture design, generalization, robustness, interpretability, and efficiency.

In general, transformers lack some inductive biases compared to CNNs and rely heavily on massive datasets for large-scale training, which is why the quality of data significantly influences the generalization and robustness of transformers in computer vision tasks.

Whilst ViT shows exceptional performance on downstream image classification tasks, for example, VTAB and CIFAR, directly applying the ViT backbone on object detection has failed to surpass the results of CNNs

Additionally, it still remains a challenge to fully understand why transformers work well on visual tasks. Furthermore, developing efficient transformer models for computer vision that can be deployed on resource-limited devices is a challenging issue.


Real-World Vision Transformer (ViT) Use Cases and Applications

Vision transformers have extensive applications in popular image recognition tasks such as object detection, segmentation, image classification, and action recognition. Moreover, ViTs are applied in generative modeling and multi-model tasks, including visual grounding, visual-question answering, and visual reasoning.

Video forecasting and activity recognition are all parts of video processing that require ViT. Moreover, image enhancement, colorization, and image super-resolution also use ViT models. Last but not least, ViTs have numerous applications in 3D analysis, such as segmentation and point cloud classification.

A custom model for image segmentation in sports
An example of image segmentation in Sports



The vision transformer model uses multi-head self-attention in Computer Vision without requiring image-specific biases. The model splits the images into a series of positional embedding patches, which are processed by the transformer encoder.

It does so to understand the local and global features that the image possesses. Last, but not the least, the ViT has a higher precision rate on a large dataset with reduced training time.


What’s next?

Read more about related topics and other state-of-the-art methods in machine learning, image processing, and recognition.

Follow us

Related Articles

Join 6,300+ Fellow
AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

Sign up to receive news and other stories from viso.ai. Your information will be used in accordance with viso.ai's privacy policy. You may opt out at any time.
Play Video

Join 6,300+ Fellow
AI Enthusiasts

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy.

Build any Computer Vision Application, 10x faster

All-in-one Computer Vision Platform for businesses to build, deploy and scale real-world applications.