Object detection is considered to be one of the most challenging tasks in the computer vision field. While there are a handful of different object detection algorithms, in this article we will have a closer look at YOLOv3 (You Only Look Once). In particular, we will look at:
- What is YOLOv3?
- How does it work?
- What’s new in YOLOv3?
- Disadvantages vs. other Algorithms?
- What’s Next?
What is YOLOv3?
YOLOv3 (You Only Look Once, Version 3) is a real-time object detection algorithm that identifies specific objects in videos, live feeds, or images. Versions 1-3 of YOLO were created by Joseph Redmon and Ali Farhadi. The first version of YOLO was created in 2016, and version 3, which is discussed extensively in this article, was made two years later in 2018. YOLO is implemented using the Keras or OpenCV deep learning libraries.
Object classification systems are used by Artificial Intelligence (AI) programs to perceive specific objects in a class as subjects of interest. The systems sort objects in images into groups where objects with similar characteristics are placed together, while others are neglected unless programmed to do otherwise.
How does YOLOv3 work? (Overview)
YOLO is a Convolutional Neural Network (CNN) for doing object detection. CNN’s are classifier-based systems that can process input images as structured arrays of data and identify patterns between them. YOLO has the advantage of being much faster than other networks and still maintains accuracy.
It allows the model to look at the whole image at test time so its predictions are informed by the global context in the image. YOLO and other convolutional neural network algorithms “score” regions based on their similarities to predefined classes. High-scoring regions are noted as positive detections of whatever class they most closely identify with. For example, in a live feed of traffic, YOLO can be used to detect different kinds of vehicles depending on which regions of the video score highly in comparison to predefined classes of vehicles.
The Architecture at a Glance
The YOLOv3 algorithm first separates an image into a grid. Each grid cell predicts some number of boundary boxes (sometimes referred to as anchor boxes) around objects that score highly with the aforementioned predefined classes. Each boundary box has a respective confidence score of how accurate it assumes that prediction should be, and detects only one object per bounding box. The boundary boxes are generated by clustering the dimensions of the ground truth boxes from the original dataset to find the most common shapes and sizes.
Other comparable algorithms that can carry out the same objective are R-CNN (Region-based Convolutional Neural Networks made in 2015) and Fast R-CNN (R-CNN improvement developed in 2017) and Mask R-CNN. However, unlike systems like R-CNN and Fast R-CNN, YOLO is trained to do classification and bounding box regression at the same time.
What’s New in YOLOv3?
There are major differences between YOLOv3 and older versions in terms of speed, precision and specificity of classes. The following paragraphs will give you an overview of what’s new in YOLOv3.
YOLOv2 was using Darknet-19 as its backbone feature extractor, while YOLOv3 now uses Darknet-53. Darknet-53 is a backbone also made by the YOLO creators, Joseph Redmon and Ali Farhadi. Darknet-53 has 53 convolutional layers instead of the previous 19 which makes it more powerful than Darknet-19 and more efficient than competing backbones (ResNet-101 or ResNet-152).
Using the chart provided in the YOLOv3 paper by Redmon and Farhadi, we can see that Darknet-52 is 1.5 times faster than ResNet101. The depicted accuracy doesn’t entail any trade-off between accuracy and speed between Darknet backbones either since it is still as accurate as ResNet-152 yet two times faster.
YOLOv3 is fast and accurate in terms of mean average precision (mAP) and intersection over union (IOU) values as well. It runs significantly faster than other detection methods with comparable performance (hence the name – You only look once). Moreover, you can easily trade-off between speed and accuracy simply by changing the size of the model, and no retraining required.
Precision for Small Objects
The chart below (taken and modified from the YOLOv3 paper) shows the average precision (AP) of detecting small, medium, and large images with various algorithms and backbones. The higher the AP, the more accurate it is for that variable.
The precision for small objects in YOLOv2 was incomparable to other algorithms because of how inaccurate YOLO was at detecting small objects. With an AP of 5.0, it paled in comparison to other algorithms like RetinaNet (21.8) or SSD513 (10.2) which had the second-lowest AP for small objects.
YOLOv3 increased the AP for small objects by 13.3, which is a massive advance from YOLOv2. However, the average precision (AP) for all objects (small, medium, large) is still less than RetinaNet
Specificity of Classes
The new YOLOv3 uses independent logistic classifiers and binary cross-entropy loss for the class predictions during training. These edits make it possible to use complex datasets such as Microsoft’s Open Images Dataset (OID) for YOLOv3 model training. OID contains dozens of overlapping labels, such as “man” and “person” for images in the dataset.
YOLOv3 uses a multilabel approach which allows classes to be more specific and be multiple for individual bounding boxes. Meanwhile, YOLOv2 used a softmax, which is a mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector. Using a softmax makes it so that each bounding box can only belong to one class, which is sometimes not the case; especially with datasets like OID.
Disadvantages of YOLOv3 vs. Other Algorithms
The YOLOv3 AP does indicate a trade-off between speed and accuracy for using YOLO when compared to RetinaNet, since RetinaNet training time is greater than YOLOv3. The accuracy of detecting objects with YOLOv3 can be made equal to the accuracy when using RetinaNet by having a larger dataset, which makes it an ideal option for models that can be trained with large datasets. An example of this would be common detection models like traffic detection, where plenty of data can be used to train the model since the number of images of different vehicles is plentiful. YOLOv3 may not be ideal to use with niche models where large datasets can be hard to obtain.
If you enjoyed reading this article, we recommend: