Search
Close this search box.

Build, deploy, operate computer vision at scale

  • One platform for all use cases
  • Connect all your cameras
  • Flexible for your needs
Contents

YOLO (You Only Look Once) is a family of real-time object detection machine-learning algorithms. Object detection is a computer vision task that uses neural networks to localize and classify objects in images. This task has a wide range of applications, from medical imaging to self-driving cars. Multiple machine-learning algorithms are used for object detection, one of which is convolutional neural networks (CNNs).

CNNs are the base for any YOLO model, researchers and engineers use these models for tasks like object detection and segmentation. YOLO models are open-source, and they are widely used in the field. These models have been improving from one version to the next, resulting in better accuracy, performance, and additional capabilities. This article will explore the entire YOLO family, we will start from the original to the latest, exploring their architecture, use cases, and demos.

About us: Viso Suite is the complete computer vision for enterprises. Within the platform, teams can seamlessly build, deploy, manage, and scale real-time applications for implementation across industries. Viso Suite is use case-agnostic, meaning that it performs all visual AI-associated tasks including people counting, defect detection, and safety tracking. To learn more, book a demo with our team.

Viso Suite is an end-to-end machine learning solution.
Viso Suite is the end-to-End, No-Code Computer Vision Solution.

YOLOv1 The Original

Before introducing YOLO object detection, researchers used convolutional neural network (CNN) based approaches like R-CNN and Fast R-CNN. Those approaches used a two-step process that predicted the bounding boxes and then used regression to classify objects in those boxes. This approach was slow and resource-extensive, but YOLO models revolutionized object detection. When the first YOLO was developed by Joseph Redmon and Ali Farhadi back in 2016, it overcame most problems with traditional object detection algorithms, with a new and enhanced architecture.

Architecture
YOLO v1 architecture
The Architecture of YOLOv1. Source.

The original YOLO architecture consisted of 24 convolutional layers followed by 2 fully connected layers inspired by the GoogLeNet model for image classification. The YOLOv1 approach was the first at its time.

The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates. This means that both the bounding boxes and the classification happen in one step. This one-step process streamlines the operation and achieves real-time efficiency. In addition, the YOLO architecture used the following optimization techniques.

  • Leaky ReLU Activation: Leaky ReLU helps to prevent the “dying ReLU” problem, where neurons can get stuck in an inactive state during training.
  • Dropout Regularization: YOLOv1 applies dropout regularization after the first fully connected layer to prevent overfitting.
  • Data Augmentation
How It Works

The essence of YOLO models is treating object detection as a regression problem. The YOLO approach is to apply a single convolutional neural network (CNN) to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region.

These bounding boxes are weighted by the predicted probabilities. Those weights can then be thresholded to show only the high-scoring detections.

How does YOLOv1 work
The detection process in YOLOv1. Source.

YOLOv1 divides the input image into a grid (SxS), and each grid cell is responsible for predicting bounding boxes and class probabilities for objects inside it. Each bounding box prediction includes a confidence score indicating the likelihood of an object being present in the box. The researchers calculate the confidence scores using techniques like intersection over union (IOU), which can be used to filter the prediction. Despite the novelty and speed of the YOLO approach, it faced some limitations like the following.

  • Generalization: YOLOv1 struggles to detect new objects, not seen in training accurately.
  • Spatial constraints: In YOLOv1, each grid cell predicts only two boxes and can only have one class, which makes it struggle with small objects that appear in groups, such as flocks of birds.
  • Loss Function Limitations: The YOLOv1 loss function treats errors the same in small and large bounding boxes. A small error in a large box is generally okay, but a small one has a much greater effect on IOU.
  • Localization Errors: One major issue with YOLOv1 was accuracy, it often mislocates where objects are in the image.

Now that we have the basic mechanism of YOLOs covered, let’s look at how the researchers upgraded this model’s capabilities in the next version.

YOLOv2 (YOLO9000)

YOLO9000 came one year after YOLOv1 to address the limitations in object detection datasets at the time. YOLO9000 was named that way because it can detect over 9000 different object categories. This was transformative in terms of accuracy and generalization.

The researchers behind YOLO9000 proposed a unique joint training algorithm that trains object detectors on both detection and classification data. This method leverages labeled detection images to learn to precisely localize objects and uses classification images to increase its vocabulary and robustness.

The YOLO v2 WordTree hierarchy.
YOLO9000 combines datasets using the WordTree hierarchy. Source.

By combining the features from different datasets, for both classification and detection, YOLO9000 showed a great improvement over its predecessor YOLOv1. YOLO9000 was announced as better, stronger, and faster.

  • Hierarchical Classification: A method used in YOLO9000 based on the WordTree structure, allowing increased generalization to unseen objects, and increased vocabulary or range of objects.
  • Architectural Changes: YOLO9000 introduced a few changes, like using batch normalization for faster training and stability, anchor boxes or sliding window approach, and uses Darknet-19 as a backbone. Darknet-19 is a CNN with 19 layers designed to be accurate and fast.
  • Joint Training: An algorithm that allows the model to utilize the hierarchical classification framework and learn from both classification and detection datasets like COCO and ImageNet.

However, the YOLO family continued to improve, next we will look at YOLOv3.

YOLOv3 Little Changes, Big Effects

A couple of years later, the researchers behind YOLO came up with the next version, YOLOv3. While YOLO9000 was a state-of-the-art model, object detection in general had its limitations. Improving accuracy and speed is always one of the goals of object detection models, which was the aim of YOLOv3, little adjustments here and there and we get a better-performing model.

The improvements start with the bounding boxes, while it still uses the sliding window approach, YOLOv3 had some enhancements. YOLOv3 introduced multi-scale predictions where it predicts bounding boxes at three different scales. This means more effectiveness in detecting objects of different sizes. This among other improvements allowed YOLO back on the map of state-of-the-art models, with speed and accuracy trade-offs.

YOLOv3 performance.
YOLOv3 compared to other state-of-the-art models at the time. Source.

As seen in the graph YOLOv3 provided one of the best speeds and accuracies using the mean average precision (mAP-50) metric. Additionally, YOLOv3 introduced other improvements as follows.

  • Backbone: YOLOv3 uses a better and bigger CNN backbone which is Darknet-53 which consists of 53 layers and is a hybrid approach between Darknet-19 and deep learning residual networks (Resnets), but more efficient than ResNet-101 or ResNet-152.
  • Predictions Across Scales: YOLOv3 predicts bounding boxes at three different scales, similar to feature pyramid networks. This allows the model to detect objects of various sizes more effectively.
  • Classifier: Independent logistic classifiers are used instead of a softmax function, allowing for multiple labels per box.
  • Dataset: The researchers train YOLOv3 on the COCO dataset only.

Also, while less significant, YOLOv3 fixed a little data-loading bug in YOLOv2, which helped by like 2 mAP points. Up next, let’s see how the YOLO model evolved into YOLOv4.

YOLOv4 Optimization is Key

Continuing the legacy of the YOLO family, YOLOv4 introduced multiple improvements and optimizations. Let’s dive deeper into the YOLOv4 mechanism.

Architecture

The most notable change is the 3 part architecture, while YOLOv4 is still a one-stage object detection network, the architecture involves 3 main components, backbone, head, and neck. This architecture split was a very important step in the evolution of YOLOs. The backbone, head, and neck each have their own functionality in YOLOs.

The backbone is the feature extraction part, usually a CNN that learns features across layers. Then the neck refines and combines the extracted features from different levels of the backbone, creating a rich and informative feature representation. Finally, the head does the actual prediction, and outputs bounding boxes, class probabilities, and objectness scores.

YOLO family, YOLOv4 architecture.
Backbone, neck, and head for object detection. Source.

For YOLOv4 the researchers used the following components for the backbone, neck, and head.

  • Backbone: CSPDarknet53 is a convolutional neural network and backbone for object detection that uses DarkNet-53 which uses a Cross Stage Partial Network (CSPNet) strategy.
  • Neck: Modified Spatial Pyramid Pooling (SPP) and Path aggregation network (PAN) were used for YOLOv4, which produced more granular feature extraction, better training, and better performance.
  • Head: YOLOv4 employs the (anchor-based) architecture of YOLOv3 as the head of YOLOv4.

This was not all that YOLOv4 introduced, there was a lot of work on optimization and picking the right methods and techniques, let’s explore those next.

Optimization

The YOLOv4 model came with two bags of methods, as the researchers introduced in the paper: Bag of Freebies (BoF) and Bag of Specials (BoS). Those methods were instrumental in the performance of YOLOv4, in this section, we will explore the important methods the researchers used.

  1. Mosaic Data Augmentation: This data augmentation method combines 4 training images into one, enabling the model to learn to detect objects in a wider variety of contexts and reducing the need for large mini-batch sizes. The researchers used it as a part of the BoF for the backbone training.
  2. Self-Adversarial Training (SAT): A two-stage data augmentation technique where the network tricks itself and modifies the input image to think there’s no object. Then, it trains on this modified image to improve robustness and generalization. Researchers used it as a part of the BoF for the detector or head of the network.
  3. Cross mini-batch Normalization (CmBN): A modification of Cross-Iteration Batch Normalization (CBN) that makes training more suitable for a single GPU. Used by researchers as part of the BoF for the detector.
  4. Modified Spatial Attention Module (SAM): The researchers modify the original SAM from spatial-wise attention to point-wise attention, enhancing the model’s ability to focus on important features without increasing computational costs. Used as part of the BoS as an additional block for the detector of the network.

However, this is not all, YOLOv4 used many other techniques within the BoS and BoF, like Mish activation and cross-stage partial connections (CSP) for the backbone as part of the BoS. All these optimization modifications resulted in a state-of-the-art performance for YOLOv4, especially in speed, but also accuracy.

YOLO v4 benchmark performance.
Comparison of the proposed YOLOv4 and other state-of-the-art object detectors. YOLOv4 runs twice as fast as EfficientDet with comparable accuracy Source.

YOLOv5 The Most Loved

While YOLOv5 didn’t come with a dedicated research paper, this model has impressed all developers, engineers, and researchers. YOLOv5 came only a few months after YOLOv4, there wasn’t much improvement, but it was slightly faster. Ultralytics designed YOLOv5 for easier implementation, and more detailed documentation with multiple languages support, most notably YOLOv5 was built on Pytorch making it easily usable for developers.

At the same time, its predecessors were slightly harder to implement. Ultralytics announced YOLOv5 as the world’s most loved vision AI, in addition, YOLOv5 came with some great features like different formats for model export, a training script to train on your own data, and multiple training tricks like Test-Time Augmentation (TTA) and Model Ensembling.

YOLOv5 versions compared.
Performance of different YOLOv5 variations. Source.
Demo

Since YOLOv5 is very similar in architecture and working mechanism to YOLOv4 but has an easier implementation we can try it out easily. This section will test a pre-trained YOLOv5 model on our images. This implementation will be done on a Google Colab notebook with Pytorch and will be easy for beginners. Let’s get started.

import torch
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True, trust_repo=True)

First, we import Pytorch and load the model using torch.hub.load, we will use the YOLOv5s model which is a compact model and very fast. Now that we have the model, let’s load some test images and inference it.

im = '/content/n09835506_ballplayer.jpeg'  # file, Path, PIL.Image, OpenCV, nparray, list
results = model(im)  # inference
results.save() #or .print(), .show(), .save(), .crop(), .pandas()

I will be using some sample images from the ImageNet dataset, the model took around 17 milliseconds including preprocessing to predict sample images. Below are the results of 3 samples.

YOLOv5 demo results
Sample results from YOLOv5 small.

Ease of use, continuous updates, a big community, and good documentation make YOLOv5 the perfect compact model that runs on light hardware and gives decent accuracy in almost real-time. YOLO models continued to evolve after YOLOv5 to YOLOv6 let’s explore that in the next section.

YOLOv6 Industrial Quality

YOLOv6 emerges as a significant evolution in the YOLO series, it introduces some key architectural and training changes to achieve a better balance between speed and accuracy. Notably, YOLOv6 distinguishes itself by focusing on industrial applications. This industrial focus offered deployment-ready networks and better consideration of the constraints of real-world environments.

With the balance between speed and accuracy, it can be run on commonly used hardware, such as the Tesla T4 GPU making the deployment of object detection in industrial settings easier than ever. YOLOv6 was not the only model available at the time, there were YOLOv5, YOLOX, and YOLOv7 all competing candidates for efficient detectors to deploy. Now, let’s discuss the changes that YOLOv6 introduced.

Architecture
YOLO family, YOLOv6 architecture.
The YOLOv6 framework. Source.
  • Backbone: The researchers build the backbone using EfficientRep which is a hardware-aware CNN featuring RepBlock for small models (N and S) and CSPStackRep Block for larger models (M and L).
  • Neck: Uses Rep-PAN topology, enhancing the modified PAN topology from YOLOv4 and YOLOv5 with RepBlocks or CSPStackRep Blocks. Which gives a more efficient feature aggregation from different levels of the backbone.
  • Head: YOLOv6 introduced the Efficient Decoupled Head simplifying the design for improved efficiency. It utilizes a hybrid-channel strategy, reducing the number of middle 3×3 convolutional layers and scaling the width jointly with the backbone and neck.

YOLOv6 also incorporates several other techniques to enhance performance.

  • Label Assignment: Utilizes Task Alignment Learning (TAL) to address the misalignment between classification and box regression tasks.
  • Self-Distillation: It applies self-distillation to both classification and regression tasks, improving the accuracy even more.
  • Loss Function: It employs VariFocal Loss for classification and a combination of SIoU and GIoU Loss for regression.
YOLO family, YOLOv6 performance.
Comparison of state-of-the-art efficient object detectors. Source.

YOLOv6 represents a refined and enhanced approach to object detection, building upon the strengths of its predecessors while introducing innovative solutions to address the challenges of real-world deployment. Its focus on efficiency, accuracy, and industrial applicability makes it a valuable tool for industrial applications.

YOLOv7: Trainable bag-of-freebies

While technically YOLOv6 was introduced before YOLOv7, the production version of YOLOv6 came after YOLOv7 and surpassed it in performance. However, YOLOv7 introduced a novel concept calling it the trainable bag of freebies (BoF). This includes a series of fine-grained refinements rather than a complete overhaul.

These refinements are primarily focused on optimizing the training process and enhancing the model’s ability to learn effective representations without significantly increasing the computational costs. Following are some of the key features YOLOv7 introduced.

  • Model Re-parameterization: YOLOv7 proposes a planned re-parameterized model, which is a strategy applicable to layers in different networks with the concept of gradient propagation path.
  • Dynamic Label Assignment: The training of the model with multiple output layers presents a new issue: “How to assign dynamic targets for the outputs of different branches?” To solve this problem, YOLOv7 introduces a new label assignment method called coarse-to-fine lead guided label assignment.
  • Extended and Compound Scaling: YOLOv7 proposes “extend” and “compound scaling” methods for the object detector that can effectively utilize parameters and computation.
YOLOs, YOLOv7 performance comparison
Comparison with other real-time object detectors. Source.

YOLOv7 focuses on fine-grained refinements and optimization strategies to enhance the performance of real-time object detectors. Its emphasis on trainable bag-of-freebies, deep supervision, and architectural improvements leads to a significant boost in accuracy without sacrificing speed, making it a valuable advancement in the YOLO series. However, the evolution keeps going producing YOLOv8 which is our subject next.

YOLOv8

YOLOv8 is an iteration in the YOLO series of real-time object detectors, offering cutting-edge performance in terms of accuracy and speed. However, YOLOv8 does not have an official paper to it but similar to YOLOv5 this was a user-friendly enhanced YOLO object detection model. Developed by Ultralytics YOLOv8 introduces new features and optimizations that make it an ideal choice for various object detection tasks in a wide range of applications. Here is a quick overview of its features.

  • Advanced Backbone and Neck Architectures
  • Anchor-free Split Ultralytics Head: YOLOv8 adopts an anchor-free split Ultralytics head, which contributes to better accuracy and a more efficient detection process compared to anchor-based approaches.
  • Optimized Accuracy-Speed Tradeoff

In addition to all that YOLOv8 is a well-maintained model by Ultralytics offering a diverse range of models, each specialized for specific tasks in computer vision like detection, segmentation, classification, and pose detection. Since YOLOv8 has ease of use through the Ultralytics library let’s try it in a demo.

Demo

This demo will simply use the Ultralytics library in Python to infer YOLOv8 models. There are many ways and options to infer YOLO models, here is a documentation of the prediction options.

from ultralytics import YOLO
# Load a COCO-pretrained YOLOv8s model
model = YOLO("yolov8s.pt")

This will import the Ultralytics library and load the YOLOv8 small model, which has a good balance between accuracy and speed, if the Ultralytics library is not already installed, make sure to install it on your notebook with: !pip install ultralytics. Now, let’s test it on some images.

results = model("/content/city.jpg", save=True, conf=0.5)

This will run the detection task, save the result, and only include bounding boxes of at least 0.5 confidence. Below is the result of the sample image I used.

YOLOv8 results
Inference example with YOLOv8.

YOLOv8 models achieve SOTA performance across various benchmarking datasets. For instance, the YOLOv8n model achieves a mAP (mean Average Precision) of 37.3 on the COCO dataset and a speed of 0.99 ms on A100 TensorRT. Next, let’s see how the YOLO family evolved further with YOLOv9.

YOLOv9: Accurate Learning

YOLOv9 takes a different approach compared to its predecessors, by directly addressing the issue of information loss in deep neural networks. It introduces the concept of Programmable Gradient Information (PGI) and a new architecture called Generalized Efficient Layer Aggregation Network (GELAN) to combat information bottlenecks and ensure reliable gradient flow during training.

The researchers introduced YOLOv9 because existing methods ignore the fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, a large amount of information will be lost. This information loss leads to unreliable gradients and hinders the model’s ability to learn accurate representations.

YOLOv9 introduces PGI, a novel method for generating reliable gradients by using an auxiliary reversible branch. This auxiliary branch provides complete input information for calculating the objective function, ensuring that the gradients used to update the network weights are more informative. The reversible nature of the auxiliary branch ensures that no information is lost during the feedforward process.

YOLO family, YOLOv9 PGI
PGI and related network architectures and methods. Source.

YOLOv9 also proposed GELAN as a new lightweight architecture designed to maximize information flow and facilitate the acquisition of relevant information for prediction. GELAN is a generalized version of the ELAN architecture, making use of any computational block while maintaining efficiency and performance possible. The researchers designed it based on gradient path planning, ensuring efficient information flow through the network.

YOLOs, YOLOv9 GELAN
Visualization results of random initial weight output feature maps for different network architectures. Source.

YOLOv9 presents a refreshed perspective on object detection by focusing on information flow and gradient quality. The introduction of PGI and GELAN, sets YOLOv9 apart from its predecessors. This focus on the fundamentals of information processing in deep neural networks leads to improved performance and a better explainability of the learning process in object detection.

YOLOv10 Real-Time Object Detection

The introduction of YOLOv10 was revolutionary for real-time end-to-end object detection. YOLOv10 surpassed all previous speed and accuracy benchmarks, achieving actual real-time object detection. YOLOv10 eliminates the need for non-maximum suppression (NMS) post-processing with NMS-Free Detection.

This not only improves inference speed but also simplifies the deployment process. YOLOv10 introduced a few key features like NMS-free training and a holistic design approach that made it excel in all metrics.

YOLOs, YOLOv10 performance benchmarks
Comparisons with others in terms of latency-accuracy. Source.
  • NMS-Free Detection: YOLOv10 delivers a novel NMS-free training strategy based on consistent dual assignments. It employs dual label assignments (one-to-many and one-to-one) and a consistent matching metric to provide rich supervision during training while eliminating NMS during inference. During inference, only the one-to-one head is used, enabling NMS-free detection.
  • Holistic Efficiency-Accuracy Driven Design: YOLOv10 adopts a holistic approach to model design, optimizing various components for both efficiency and accuracy. It introduces a lightweight classification head, spatial-channel decoupled downsampling, and rank-guided block design to reduce computational cost.
YOLOs, YOLOv10 architecture
Consistent dual assignments for NMS-free training. Source.

YOLO11: Architectural Enhancements

YOLO11 was released in September 2024. It went through a series of architectural refinements and a focus on enhancing computational efficiency without sacrificing accuracy.

It introduces novel components like the C3k2 block and the C2PSA block, which contribute to improved feature extraction and processing. This results in a slightly better performance, but with much fewer parameters for the model. Following are the key features of YOLO11.

  • C3k2 Block: YOLO11 introduces the C3k2 block, a computationally efficient implementation of the Cross Stage Partial (CSP) Bottleneck. It replaces the C2f block in the backbone and neck, and employs two smaller convolutions instead of one large convolution, reducing processing time.
  • C2PSA Block: Introduces the Cross Stage Partial with Spatial Attention (C2PSA) block after the Spatial Pyramid Pooling – Fast (SPPF) block to enhance spatial attention. This attention mechanism allows the model to focus more effectively on important regions within the image, potentially improving detection accuracy.

With this, we have discussed the whole YOLO family of object detection models. But something tells me the evolution won’t stop there, the innovation will continue and we will see even better performances in the future.

The Future of YOLO Models

The YOLO family has consistently pushed the boundaries of computer vision. It has evolved from a simple architecture to a sophisticated system. Each version has introduced novel features and expanded the range of supported tasks.

Looking ahead, the trend of increasing accuracy, speed, and multi-task capabilities will likely continue. Potential areas of development include the following.

  • Improved Explainability: Making the model’s decision-making process more transparent.
  • Enhanced Robustness: Making the model more resilient to challenging conditions.
  • Efficient Deployment: Optimizing the model for various hardware platforms.

The advancements in YOLO models have significant implications for various industries. YOLO’s ability to perform real-time object detection has the potential to change how we interact with the visual world. However, it is important to address ethical considerations and potential biases. Ensuring fairness, accountability, and transparency is crucial for responsible innovation.