Object detection is considered to be one of the most challenging tasks in the computer vision field. While there are a handful of different object detection algorithms, in this article, we will have a closer look at YOLOv3 (You Only Look Once).
In particular, we will look at:
- What is YOLOv3?
- How does it work?
- What’s new in YOLOv3?
- Disadvantages vs. other Algorithms?
- What’s Next?
What is YOLOv3?
YOLOv3 (You Only Look Once, Version 3) is a real-time object detection algorithm that identifies specific objects in videos, live feeds, or images. YOLO uses features learned by a deep convolutional neural network to detect an object. Versions 1-3 of YOLO were created by Joseph Redmon and Ali Farhadi.
The first version of YOLO was created in 2016, and version 3, which is discussed extensively in this article, was made two years later in 2018. YOLO is implemented using the Keras or OpenCV deep learning libraries.
Object classification systems are used by Artificial Intelligence (AI) programs to perceive specific objects in a class as subjects of interest. The systems sort objects in images into groups where objects with similar characteristics are placed together, while others are neglected unless programmed to do otherwise.
Why the name “you only look once”?
As typical for object detectors, the features learned by the convolutional layers are passed onto a classifier which makes the detection prediction. In YOLO, the prediction is based on a convolutional layer that uses 1×1 convolutions.
YOLO is named “you only look once” because its prediction uses 1×1 convolutions; the size of the prediction map is exactly the size of the feature map before it.
How does YOLOv3 work? (Overview)
YOLO is a Convolutional Neural Network (CNN) for performing object detection in real-time. CNNs are classifier-based systems that can process input images as structured arrays of data and identify patterns between them. YOLO has the advantage of being much faster than other networks and still maintains accuracy.
It allows the model to look at the whole image at test time, so its predictions are informed by the global context in the image. YOLO and other convolutional neural network algorithms “score” regions based on their similarities to predefined classes.
High-scoring regions are noted as positive detections of whatever class they most closely identify with. For example, in a live feed of traffic, YOLO can be used to detect different kinds of vehicles depending on which regions of the video score highly in comparison to predefined classes of vehicles.
The Architecture at a Glance
The YOLOv3 algorithm first separates an image into a grid. Each grid cell predicts some number of boundary boxes (sometimes referred to as anchor boxes) around objects that score highly with the aforementioned predefined classes.
Each boundary box has a respective confidence score of how accurate it assumes that prediction should be and detects only one object per bounding box. The boundary boxes are generated by clustering the dimensions of the ground truth boxes from the original dataset to find the most common shapes and sizes.
Other comparable algorithms that can carry out the same objective are R-CNN (Region-based Convolutional Neural Networks made in 2015) and Fast R-CNN (R-CNN improvement developed in 2017), and Mask R-CNN.
However, unlike systems like R-CNN and Fast R-CNN, YOLO is trained to do classification and bounding box regression at the same time.
What’s New in YOLOv3?
There are major differences between YOLOv3 and older versions occur in terms of speed, precision, and specificity of classes. YOLOv2 and YOLOv3 are worlds apart in terms of accuracy, speed, and architecture. YOLOv2 came out in 2016, two years before YOLOv3.
The following sections will give you an overview of what’s new in YOLOv3.
YOLOv2 was using Darknet-19 as its backbone feature extractor, while YOLOv3 now uses Darknet-53. Darknet-53 is a backbone also made by the YOLO creators Joseph Redmon and Ali Farhadi.
Darknet-53 has 53 convolutional layers instead of the previous 19, making it more powerful than Darknet-19 and more efficient than competing backbones (ResNet-101 or ResNet-152).
Using the chart provided in the YOLOv3 paper by Redmon and Farhadi, we can see that Darknet-52 is 1.5 times faster than ResNet101. The depicted accuracy doesn’t entail any trade-off between accuracy and speed between Darknet backbones either since it is still as accurate as ResNet-152 yet two times faster.
YOLOv3 is fast and accurate in terms of mean average precision (mAP) and intersection over union (IOU) values as well. It runs significantly faster than other detection methods with comparable performance (hence the name – You only look once).
Moreover, you can easily trade-off between speed and accuracy simply by changing the model’s size, and no retraining required.
Precision for Small Objects
The chart below (taken and modified from the YOLOv3 paper) shows the average precision (AP) of detecting small, medium, and large images with various algorithms and backbones. The higher the AP, the more accurate it is for that variable.
The precision for small objects in YOLOv2 was incomparable to other algorithms because of how inaccurate YOLO was at detecting small objects. With an AP of 5.0, it paled compared to other algorithms like RetinaNet (21.8) or SSD513 (10.2), which had the second-lowest AP for small objects.
YOLOv3 increased the AP for small objects by 13.3, which is a massive advance from YOLOv2. However, the average precision (AP) for all objects (small, medium, large) is still less than RetinaNet.
Specificity of Classes
The new YOLOv3 uses independent logistic classifiers and binary cross-entropy loss for the class predictions during training. These edits make it possible to use complex datasets such as Microsoft’s Open Images Dataset (OID) for YOLOv3 model training. OID contains dozens of overlapping labels, such as “man” and “person” for images in the dataset.
YOLOv3 uses a multilabel approach which allows classes to be more specific and be multiple for individual bounding boxes. Meanwhile, YOLOv2 used a softmax, which is a mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.
Using a softmax makes it so that each bounding box can only belong to one class, which is sometimes not the case, especially with datasets like OID.
Disadvantages of YOLOv3 vs. Other Algorithms
The YOLOv3 AP does indicate a trade-off between speed and accuracy for using YOLO when compared to RetinaNet since RetinaNet training time is greater than YOLOv3. However, the accuracy of detecting objects with YOLOv3 can be made equal to the accuracy when using RetinaNet by having a larger dataset, making it an ideal option for models that can be trained with large datasets.
An example of this would be common detection models like traffic detection, where plenty of data can be used to train the model since the number of images of different vehicles is plentiful. On the other hand, YOLOv3 may not be ideal for using niche models where large datasets can be hard to obtain.
The YOLOv3 installation is relatively straightforward. Installing some dependencies and libraries is necessary, and after that, it can easily be used for training models. YOLOv3 can be installed either directly onto a computer or through a notebook (such as Google Colaboratory or Jupyter). For both implementations, the commands remain the same.
The command for installing YOLOv3, assuming all libraries have been installed, is pip install YOLOv3. Before installing anything, however, I’d advise that you make sure the pip version is at least 3.0. You can check the version with the command pip -V.
If for any reason, you are unable to uninstall older versions of pip or can’t directly use pip version 3, you can use the command “pip3 install ___” rather than just “pip.” Libraries to be installed before installing YOLOv3 using pip (along with their installation commands) are:
- OpenCV (Version 3.4 or more recent): pip install opencv-python
- Python (Version 3.6 or more recent):
- Check if you already have python: python –version
- Install python for the first time on Mac or Linux: brew install python (will need homebrew first if you don’t already have it: /bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”)
- Install python for the first time on Windows: use this guide. You will need admin privileges on your computer.
- Tensorflow-gpu (Version 1.5.0 or later): pip install tensorflow
- Keras 2.1.3: pip install keras
Once you’ve downloaded all the above libraries, you can install YOLOv3 with the command pip install YOLOv3.
How to Use YOLOv3
The first step to using YOLOv3 would be to decide on a specific project. YOLOv3 performs real-time detections, so choosing a simple project that has an easy premise, such as detecting a certain kind of animal or car in a video, is ideal for beginners to get started with YOLOv3.
In this section, we will go over the essential steps and what you have to know for using YOLOv3 successfully.
Weights and cfg (or configuration) files can be downloaded from the website of the original creator of YOLOv3: https://pjreddie.com/darknet/yolo. You can also (more easily) use YOLO’s COCO pretrained weights by initializing the model with model = YOLOv3().
Using COCO’s pretrained weights means that you can only use YOLO for object detection with any of the 80 pretrained classes that come with the COCO dataset. This is a good option for beginners because it requires the least amount of new code and customization.
The following 80 classes are available using COCO’s pretrained weights:
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis','snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
Making a Prediction
The convolutional layers included in the YOLOv3 architecture produce a detection prediction after passing the features learned onto a classifier or regressor. These features include the class label, coordinates of the bounding boxes, sizes of the bounding boxes, and more.
Since the prediction with YOLO uses 1 x 1 convolutions (hence the name, “you only look once”), the size of the prediction map is exactly the size of the feature map before it.
In YOLOv3 and its other versions, the way this prediction map is interpreted is that each cell predicts a fixed number of bounding boxes. Then, whichever cell contains the center of the ground truth box of an object of interest is designated as the cell that will be finally responsible for predicting the object. There is a ton of mathematics behind the inner workings of the prediction architecture.
- Anchor Boxes
Although anchor boxes, or bounding boxes, were discussed a little bit at the beginning of this article, there is a bit more detail about implementing them and using them with YOLOv3. Object detectors using YOLOv3 usually predict log-space transforms, which are offsets to predefined “default” bounding boxes. Those specific bounding boxes are called anchors. The transforms are later applied to the anchor boxes to receive a prediction.YOLOv3 in particular has three anchors. This results in the prediction of three bounding boxes per cell (the cell is also called a neuron in more technical terms).
- Non-Maximum Suppression
Objects can sometimes be detected multiple times when more than one bounding box detects the object as a positive class detection. Non-maximum suppression helps avoid this situation and only passes detections if they haven’t already been detected. Using the NMS threshold value and confidence threshold value, NMS is implemented to prevent double detections. It is an imperative part of using YOLOv3 effectively. Here, we briefly described a few of the features that make the predictions possible, such as anchor boxes and non-maximum suppression (NMS) values. This is, however, not a complete representation of all the features that go into creating a successful prediction with YOLOv3. For full descriptions of YOLOv3’s mathematical background, I suggest reading the official YOLOv3 paper linked at the end of this article.
Interpreting the results of a YOLO model prediction is just as nuanced as the actual implementation of the model. Multiple factors go into a successful interpretation and accuracy rating, such as the box confidence score and class confidence score used when creating a YOLOv3 computer vision model.
There are many other ways and features used when interpreting results, but these are just a few. Other YOLOv3 prediction features include the classification loss, loss function, objectness score, and more.
Class Confidence and Box Confidence Scores
Each bounding box has an x, y, w, h, and box confidence score value. The confidence score is the value of how probable a class is contained by that box, as well as how accurate that bounding box is.
The bounding box width and height (w and h) is first set to the width and height of the image given. Then, x and y are offsets of the cell in question and all 4 bounding box values are between 0 and 1. Then, each cell has 20 conditional class probabilities implemented by the YOLOv3 algorithm.
The class confidence score for each final boundary box used as a positive prediction is equal to the box confidence score multiplied by the conditional class probability. The conditional class probability in this context is the probability that the detected object is part of a certain class (the class being the object of interest’s identification). YOLOv3’s prediction, therefore, has 3 values of h, w, and depth.
There is some math that then takes place involving the spatial dimensions of the images and the tensors used in order to produce boundary box predictions, but that is complicated. If you are interested in learning what happens during this stage, I suggest the YOLOv3 Arxiv paper linked at the end of this article.
For the final step, the boundary boxes with high confidence scores (more than 0.25) are kept as final predictions.
The YOLOv3 algorithm has a multitude of credible resources created by the author and makers of the algorithm itself. For any purpose, primary resources are always best for getting accurate information on the topic, but for YOLOv3, these resources are even more important because of all the second-hand information available on its use – especially about newer YOLO versions that weren’t created by the original author.
In researching for this article, the most useful primary resources were:
- YOLOv1, accredited paper on the first version of the architecture: Redmon, Joseph, Divvala, Girshick. “You Only Look Once: Unified, Real-Time Object Detection.” (2015) – Find it here.
- YOLOv3, accredited paper on the third version of YOLO: Redmon, Joseph, and Ali Farhadi. “YOLOv3: An Incremental Improvement.” (2018)- Find it here.
- YOLOv3 source code and algorithm specifics by the original author (Joseph Redmon) – Find it here.
- Results from the Paper for YOLOv3: Paperswithcode, YOLOv3: An Incremental Improvement (uploaded by Redmon and Farhadi) – Find it here.
YOLO is just one of many algorithms used extensively in artificial intelligence. There is another article we have written on the new version of YOLO, YOLOv5, discussing the controversy around the new architecture and its validity.
We suggest you check it out for more information about YOLO, as well as why the original author of YOLO did not make the new versions 4 and 5: YOLOv5 Is Here! Is It Real or a Fake?
If you enjoyed reading this article, we recommend: