YOLOv3: Real-Time Object Detection Algorithm (What’s New?)

Contents

Object detection is considered to be one of the most challenging tasks in the computer vision field. While there are a handful of different object detection algorithms, in this article we will have a closer look at YOLOv3 (You Only Look Once). In particular, we will look at:

  1. What is YOLOv3?
  2. How does it work?
  3. What’s new in YOLOv3?
  4. Disadvantages vs. other Algorithms?
  5. What’s Next?

What is YOLOv3?

YOLOv3 (You Only Look Once, Version 3) is a real-time object detection algorithm that identifies specific objects in videos, live feeds, or images. Versions 1-3 of YOLO were created by Joseph Redmon and Ali Farhadi. The first version of YOLO was created in 2016, and version 3, which is discussed extensively in this article, was made two years later in 2018. YOLO is implemented using the Keras or OpenCV deep learning libraries.

 

YOLOv3 Computer Vision Example
YOLOv3 Computer Vision Example – Source: Medium

Object classification systems are used by Artificial Intelligence (AI) programs to perceive specific objects in a class as subjects of interest. The systems sort objects in images into groups where objects with similar characteristics are placed together, while others are neglected unless programmed to do otherwise.

 

How does YOLOv3 work? (Overview)

YOLO is a Convolutional Neural Network (CNN) for doing object detection. CNN’s are classifier-based systems that can process input images as structured arrays of data and identify patterns between them. YOLO has the advantage of being much faster than other networks and still maintains accuracy.

 

How YOLOv3 works - Source: Mini-YOLOv3: Real-Time Object Detector for Embedded Applications
How YOLOv3 works – Source: Mini-YOLOv3: Real-Time Object Detector for Embedded Applications

It allows the model to look at the whole image at test time so its predictions are informed by the global context in the image. YOLO and other convolutional neural network algorithms “score” regions based on their similarities to predefined classes. High-scoring regions are noted as positive detections of whatever class they most closely identify with. For example, in a live feed of traffic, YOLO can be used to detect different kinds of vehicles depending on which regions of the video score highly in comparison to predefined classes of vehicles.

 

The Architecture at a Glance

The YOLOv3 algorithm first separates an image into a grid. Each grid cell predicts some number of boundary boxes (sometimes referred to as anchor boxes) around objects that score highly with the aforementioned predefined classes. Each boundary box has a respective confidence score of how accurate it assumes that prediction should be, and detects only one object per bounding box. The boundary boxes are generated by clustering the dimensions of the ground truth boxes from the original dataset to find the most common shapes and sizes.

Other comparable algorithms that can carry out the same objective are R-CNN (Region-based Convolutional Neural Networks made in 2015) and Fast R-CNN (R-CNN improvement developed in 2017) and Mask R-CNN. However, unlike systems like R-CNN and Fast R-CNN, YOLO is trained to do classification and bounding box regression at the same time.

 

What’s New in YOLOv3?

There are major differences between YOLOv3 and older versions in terms of speed, precision and specificity of classes. The following paragraphs will give you an overview of what’s new in YOLOv3.

 

Speed

YOLOv2 was using Darknet-19 as its backbone feature extractor, while YOLOv3 now uses Darknet-53. Darknet-53 is a backbone also made by the YOLO creators, Joseph Redmon and Ali Farhadi. Darknet-53 has 53 convolutional layers instead of the previous 19 which makes it more powerful than Darknet-19 and more efficient than competing backbones (ResNet-101 or ResNet-152).

 

YOLOv3 performance comparison
Comparison of backbones. Accuracy, billions of operations (Ops), billion floating point operations per second (BFLOP/s), and frames per second (FPS) for various networks – Source: YOLOv3 Paper

Using the chart provided in the YOLOv3 paper by Redmon and Farhadi, we can see that Darknet-52 is 1.5 times faster than ResNet101. The depicted accuracy doesn’t entail any trade-off between accuracy and speed between Darknet backbones either since it is still as accurate as ResNet-152 yet two times faster.

YOLOv3 is fast and accurate in terms of mean average precision (mAP) and intersection over union (IOU) values as well. It runs significantly faster than other detection methods with comparable performance (hence the name – You only look once). Moreover, you can easily trade-off between speed and accuracy simply by changing the size of the model, and no retraining required.

 

YOLOv3 detection method comparison with other methods
YOLOv3 runs much faster than other detection methods with a comparable performance using an M40/Titan X GPU – Source: Focal Loss for Dense Object Detection
Precision for Small Objects

The chart below (taken and modified from the YOLOv3 paper) shows the average precision (AP) of detecting small, medium, and large images with various algorithms and backbones. The higher the AP, the more accurate it is for that variable.

The precision for small objects in YOLOv2 was incomparable to other algorithms because of how inaccurate YOLO was at detecting small objects. With an AP of 5.0, it paled in comparison to other algorithms like RetinaNet (21.8) or SSD513 (10.2) which had the second-lowest AP for small objects.

 

YOLOv3 comparison for different object sizes
YOLOv3 comparison for different object sizes showing the average precision (AP) for AP-S (small object size), AP-M (medium object size), AP-L (large object size) – Source: Focal Loss for Dense Object Detection

YOLOv3 increased the AP for small objects by 13.3, which is a massive advance from YOLOv2. However, the average precision (AP) for all objects (small, medium, large) is still less than RetinaNet

Specificity of Classes

The new YOLOv3 uses independent logistic classifiers and binary cross-entropy loss for the class predictions during training. These edits make it possible to use complex datasets such as Microsoft’s Open Images Dataset (OID) for YOLOv3 model training. OID contains dozens of overlapping labels, such as “man” and “person” for images in the dataset.

YOLOv3 uses a multilabel approach which allows classes to be more specific and be multiple for individual bounding boxes. Meanwhile, YOLOv2 used a softmax, which is a mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector. Using a softmax makes it so that each bounding box can only belong to one class, which is sometimes not the case; especially with datasets like OID.

 

Disadvantages of YOLOv3 vs. Other Algorithms

The YOLOv3 AP does indicate a trade-off between speed and accuracy for using YOLO when compared to RetinaNet, since RetinaNet training time is greater than YOLOv3. The accuracy of detecting objects with YOLOv3 can be made equal to the accuracy when using RetinaNet by having a larger dataset, which makes it an ideal option for models that can be trained with large datasets. An example of this would be common detection models like traffic detection, where plenty of data can be used to train the model since the number of images of different vehicles is plentiful. YOLOv3 may not be ideal to use with niche models where large datasets can be hard to obtain.

 

What’s Next?

If you enjoyed reading this article, we recommend:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook
Share on email
Related Articles

Want to use Computer Vision applications?

Get the all-in-one Suite to build and deliver Computer Vision Applications. 
Learn more

This website uses cookies. By continuing to browse this site, you agree to this use.