viso.ai
        • Train

          Develop

          Deploy

          Operate

          Data Collection

          Building Blocks​

          Device Enrollment

          Monitoring Dashboards

          Video Annotation​

          Application Editor​

          Device Management

          Remote Maintenance

          Model Training

          Application Library

          Deployment Manager

          Unified Security Center

          AI Model Library

          Configuration Manager

          IoT Edge Gateway

          Privacy-preserving AI

          Ready to get started?

          Overview
          Whitepaper
          Expert Services
  • Customers
  • Company
Search
Close this search box.

YOLOv6: Single-Stage Object Detection

Build, deploy, operate computer vision at scale
  • One platform for all use cases
  • Scale on robust infrastructure
  • Enterprise security
Contents

Object detection is one of the crucial tasks in Computer Vision (CV). The YOLOv6 model localizes a section within an image and classifies the marked region within a predefined category. The output of the object detection is typically a bounding box and a label.

Computer vision researchers introduced YOLO architecture (You Only Look Once) as an object-detection algorithm in 2015. It was a single-pass algorithm having only one neural network to predict bounding boxes and class probabilities using a full image as input.

About Us: At Viso.ai, we power Viso Suite, the most complete end-to-end computer vision platform. We provide all the computer vision services and AI vision experience you’ll need. Get in touch with our team of AI experts and schedule a demo to see the key features.

 

Object detection in industry environment YOLOv6
Object-detection in industry environment performed by YOLOv6 – Source

 

What is YOLOv6?

In September 2022, C. Li, L. Li, H. Jiang, et al. (Meituan Inc.) published the YOLOv6 paper. Their goal was to create a single-stage object-detection model for industry applications.

They introduced quantization methods to boost inference speed without performance degradation, including Post-Training Quantization (PTQ) and Quantization-Awareness Training (QAT). These methods were utilized in YOLOv6 to achieve the goal of deployment-ready networks.

The researchers designed an efficient re-parameterizable backbone denoted as EfficientRep. For small models, the main component of the backbone is RepBlock during the training phase. During the inference phase, they converted each RepBlock to 3×3 convolutional layers (RepConv) with ReLU activation functions.

 

RepBlocks in training and inference stage
RepBlocks in the training and the inference stage – Source

 

YOLO Version History

YOLOv1

YOLOv1 architecture surpassed R-CNN with a mean average precision (mAP) of 63.4, and an inference speed of 45 FPS on the open-source Pascal VOC 2007 dataset. The model treats object detection as a regression task to predict bounding boxes and class probabilities from a single pass of an image.

YOLOv2

Released in 2016, it could detect 9000+ object categories. YOLOv2 introduced anchor boxes, predefined bounding boxes called priors that the model uses to pin down the ideal position of an object. YOLOv2 achieved 76.8 mAP at 67 FPS on the Pascal VOC 2007 dataset.

YOLOv3

The authors released YOLOv3 in 2018 which boasted higher accuracy than the previous versions, with an mAP of 28.2 at 22 milliseconds. To predict classes, the YOLOv3 model uses Darknet-53 as the backbone with logistic classifiers. It doesn’t use softmax and Binary Cross-entropy (BCE) loss.

YOLOv4

Alexey Bochkovskiy et al. (2020) released YOLOv4, introducing the concept of a Bag of Freebies (BoF) and a Bag of Specials (BoS). BoF is a data augmentation technique set that increases accuracy at no additional inference cost. BoS significantly enhances accuracy with a slight increase in cost. The model achieved 43.5 mAP at 65 FPS on the COCO dataset.

 

Comparison YOLO-series detectors on COCO
Comparison among different YOLO-series detectors on COCO 2017 dataset – Source

 

YOLOv5

Without an official research paper, Ultralytics released YOLOv5 in 2020, the same year YOLOv4 was released. The model is easy to train since it is implemented in PyTorch. The architecture uses a Cross-stage Partial (CSP) Connection block as the backbone for a better gradient flow to reduce computational cost. YOLOv5 uses YAML files instead of CFG files in the model size configurations.

YOLOv6

YOLOv6 is another unofficial version introduced in 2022 by Meituan, a Chinese shopping platform. The company targeted the model for industrial applications with better performance than its predecessor. The changes resulted in YOLOv6(nano) achieving an mAP of 37.5 at 1187 FPS on the COCO dataset. The small YOLOv6 model achieved 45 mAP at 484 FPS.

YOLOv7

In July 2022, a group of researchers released the open-source model YOLOv7. It is the fastest and the most accurate object detector with an mAP of 56.8% at FPS ranging from 5 to 160. YOLOv7 is based on the Extended Efficient Layer Aggregation Network (E-ELAN). It improves training by enabling the model to learn diverse features with efficient computation.

YOLOv8

The model YOLOv8 has no official paper (as with YOLOv5 and v6) but boasts higher accuracy and faster speed. For instance, the YOLOv8(medium) has a 50.2 mAP score at 1.83 milliseconds on the MS COCO dataset and A100 TensorRT. YOLO v8 also features a Python package and CLI-based implementation, making it easy to use and develop.

 

YOLOv8 output for Duck detection
YOLOv8 output for Duck detection – Source
YOLOv9

YOLOv9 is the latest version of YOLO, released in February 2024, by C.Y. Wang, I.H. Yeh, and H.Y.M. Liao. It is an improved real-time object detection model. To improve accuracy, researchers utilized programmable gradient information (PGI) and the Generalized Efficient Layer Aggregation Network (GELAN).

YOLOv6 Architecture

YOLOv6 architecture is composed of the following parts: a backbone, a neck, and a head. The backbone mainly determines the feature representation ability. Additionally, its design has a critical influence on the run inference efficiency since it carries a large portion of computation cost.

The neck’s purpose is to aggregate the low-level physical features with high-level semantic features and build up pyramid feature maps at all levels. The head consists of several convolutional layers and predicts final detection results according to multi-level features assembled by the neck.

Moreover, its structure is anchor-based and anchor-free, or parameter-coupled head and parameter-decoupled head.

 

YOLOv6 Architecture
YOLOv6 Architecture – Source

 

Based on the principle of hardware-friendly network design, researchers proposed two scaled re-parameterizable backbones and necks to accommodate models of different sizes. Also, they introduced an efficient decoupled head with the hybrid-channel strategy. The overall architecture of YOLOv6 is shown in the figure above.

Performance of YOLOv6

Researchers used the same optimizer and learning schedule as YOLOv5, i.e. Stochastic Gradient Descent (SGD) with momentum and cosine decay on the learning rate. Also, they utilized a warm-up, grouped weight decay strategy, and the Exponential Moving Average (EMA).

They adopted two strong data augmentations (Mosaic and Mixup) following previous YOLO versions. A complete list of hyperparameter settings is located in GitHub. They trained the model on the COCO 2017 training set and evaluated the accuracy of the COCO 2017 validation set.

The researchers utilized eight NVIDIA A100 GPUs for training. In addition, they measured the speed performance of an NVIDIA Tesla T4 GPU with TensorRT version 7.2.

 

Comparison of YOLOv6 with other object detectors
Comparison of YOLOv6 with other object detectors on the Coco dataset  – Source

 

  • The developed YOLOv6-N achieves 35.9% AP on the COCO dataset at a throughput of 1234 FPS on an NVIDIA Tesla T4 GPU.
  • YOLOv6-S strikes 43.5% AP at 495 FPS, outperforming other mainstream detectors at the same scale (YOLOv5-S, YOLOX-S, and PPYOLOE-S).
  • The quantized version of YOLOv6-S even brings a new state-of-the-art 43.3% AP at 869 FPS.
  • Furthermore, YOLOv6-M/L achieves better accuracy performance (i.e., 49.5%/52.3%) than other detectors with similar inference speeds.

Industry-Related Improvements

The authors introduced additional common practices and tricks to improve the performance including self-distillation and more training epochs.

  • For self-distillation, they used the teacher model to supervise both classification and box regression loss. They implemented the distillation of box regression using DFL (device-free localization).
  • In addition, the proportion of information from the soft and hard labels dynamically declined via cosine decay. This helped them selectively acquire knowledge at different phases during the training process.
  • Moreover, they encountered the problem of impaired performance without adding extra gray borders at evaluation, for which they provided some remedies.

 

YOLOv6 Channel-wise distillation loss
YOLOv6 Channel-wise distillation loss – Source

 

Quantization Results

For industrial deployment, it has been a common practice to apply quantization to further speed up. Also, quantization will not compromise the model performance. Post Training Quantization (PTQ) directly quantizes the model with only a small calibration set.

Whereas Quantization Aware Training (QAT) further improves the performance with access to the training set, it is typically used jointly with distillation. However, due to the heavy use of re-parameterization blocks in YOLOv6, previous PTQ techniques fail to produce high performance.

Because of the removal of quantization-sensitive layers in the v2.0 release, researchers directly used full QAT on YOLOv6-S trained with RepOptimizer. Therefore, it was hard to incorporate QAT when it came to matching fake quantizers during training and inference.

Researchers eliminated inserted quantizers through graph optimization to obtain higher accuracy and faster speed. Finally, they compared the distillation-based quantization results from PaddleSlim (table below).

 

YOLOv6-qat-performance
QAT performance of YOLOv6-S compared with other quantized detectors – Source

 

YOLOv6 – 2023 Update

In this latest release, the researchers renovated the network design and the training strategy. They showed the comparison of YOLOv6 with other models at a similar scale in the figure below. The new features of YOLOv6 include the following:

  • They renewed the neck of the detector with a Bi-directional Concatenation (BiC) module to provide more accurate localization signals. SPPF was simplified to form the SimCSPSPPF Block, which brings performance gains with negligible speed degradation.
  • They proposed an anchor-aided training (AAT) strategy to enjoy the advantages of both anchor-based and anchor-free paradigms without touching inference efficiency.
  • They deepened YOLOv6 to have another stage in the backbone and the neck. Therefore, it achieved a new state-of-the-art performance on the COCO dataset at a high-resolution input.
  • They involved a new self-distillation strategy to boost the performance of small models of YOLOv6, in which the heavier branch for DFL is taken as an enhanced auxiliary regression branch during training and is removed at inference to avoid the marked speed decline.

Researchers applied feature integration at multiple scales as a critical and effective component of object detection. They used a Feature Pyramid Network (FPN) to aggregate the high-level semantic and low-level features via a top-down pathway, providing more accurate localization.

 

yolov6-2023 updated architecture diagram
The neck of YOLOv6 (N, S). For M/L, CSPStackRep replaces RepBlocks. (b) The structure of a BiC module – Source

 

Further Enhancements

Subsequently, other works on Bi-directional FPN enhance the ability of hierarchical feature representation. PANet (Path Aggregation Network) adds an extra bottom-up pathway on top of FPN to shorten the information path of low-level and top-level features. That facilitates the propagation of accurate signals from low-level features.

BiFPN introduces learnable weights for different input features and simplifies PAN to achieve better performance with high efficiency. They proposed PRB-FPN to retain high-quality features for accurate localization by a parallel FP structure with bi-directional fusion and associated improvements.

Final Thoughts on YOLOv6

The YOLO models are the standard in object detection methods with their great performance and wide applicability. Here are the conclusions about YOLOv6:

  • Usage: YOLOv6 is already in GitHub, so the users can implement YOLOv6 quickly through the CLI and Python IDE.
  • YOLOv6 tasks: With real-time object detection and improved accuracy and speed, YOLOv6 is very efficient for industry applications.
  • YOLOv6 contributions: YOLOv6’s main contribution is that it eliminates inserted quantizers through graph optimization to obtain higher accuracy and faster speed.

Here are some related articles for your reading:

 

Play Video