YOLOv3: Real-Time Object Detection Algorithm (Guide)

Vidushi Meel

About

Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.

Need Computer Vision?

Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.

Object detection, a fundamental task in computer vision, focuses on recognizing and locating various objects within visual data, enabling machines to interpret and understand their surroundings. While there are a handful of different object detection algorithms we’ll examine YOLOv3 (You Only Look Once).

YOLOv3’s ability to provide accurate and rapid object detection has positioned it as a prominent algorithm in computer vision applications. Particularly, in scenarios where real-time processing is crucial, such as in autonomous vehicles, surveillance systems, and robotics.

This article examines:

What is YOLOv3? (Definition YOLO in Computer Vision)
How does it work?
What’s new in YOLOv3?
Disadvantages vs. other Algorithms?
What’s Next?

About us: Viso.ai provides the leading end-to-end Computer Vision Platform Viso Suite. Leading organizations worldwide use it to develop, deploy, and scale all computer vision applications in one place, with automated infrastructure and no code. Use YOLOv3 out of the box and future-proof your systems. Get a demo for your organization.

Viso Suite is an end-to-end no-code computer vision platform that makes it possible for enterprises to seamlessly integrate CV solutions into their business processes. — Viso Suite – End-to-End Computer Vision and No-Code for Computer Vision Teams

What is YOLOv3? – Definition YOLO

YOLOv3 (You Only Look Once, Version 3) is a real-time object detection algorithm that identifies specific objects in videos, live feeds, or images. The YOLO machine learning algorithm uses features learned by a deep convolutional neural network to detect objects located in an image. Joseph Redmon and Ali Farhadi are the creators of YOLO versions 1-3, with the third version of the YOLO machine learning algorithm as the most accurate version of the original ML algorithm.

Joseph Redmon and Ali Farhadi created the first version of YOLO algorithms in 2016. The two later released Version 3 two years later in 2018. YOLOv3 is an improved version of YOLO and YOLOv2. YOLO is implemented using the Keras or OpenCV deep learning libraries.

The official successors of YOLOv3 are YOLOv4, and the newly released YOLOv7 (2022), which marks the current state-of-the-art object detector. However, unofficial architectures including YOLOv5, YOLOv6, and YOLOv8 have also been released.

Yolov3 computer vision application — YOLOv3 Computer Vision Example in Restaurants

Object classification systems, utilized by Artificial Intelligence (AI) programs, aim to perceive specific objects in a class as subjects of interest. These systems sort objects in images into groups that place objects with similar characteristics together while neglecting others unless programmed to do otherwise. The resulting groups help identify and categorize objects based on their features, contributing to a more nuanced understanding of the predicted class to which each object belongs.

Why the Name “You Only Look Once”?

As typical for object detectors, the features learned by the convolutional layers are passed onto a classifier which makes the detection prediction. In YOLO, the prediction is based on a convolutional layer that uses 1×1 convolutions.

YOLO stands for “you only look once” because its prediction uses 1×1 convolutions. This means that the size of the prediction map is exactly the size of the feature map before it. This efficient use of 1×1 convolutions contributes to streamlining the prediction process, allowing the fully connected layer to leverage the compact representation of features before making detection predictions.

How does YOLOv3 work? (Overview)

YOLO is a Convolutional Neural Network (CNN), a type of deep neural network, for performing object detection in real time. CNNs are classifier-based systems that process input images as structured arrays of data and recognize patterns between them. YOLO has the advantage of being much faster than other networks and still maintains accuracy.

It allows the object detection model to look at the whole image at test time. This means that the global context in the image informs the predictions. YOLO and other convolutional neural network algorithms “score” regions based on their similarities to predefined classes.

High-scoring regions are noted as positive detections of whatever class they most closely identify with. For example, in self-driving car footage, YOLO can be used to detect different kinds of vehicles depending on which regions of the video score highly in comparison to predefined classes of vehicles. This scoring mechanism, involving regional proposals, enables precise and efficient object detection across various scenes.

How YOLOv3 works - Source: Mini-YOLOv3: Real-Time Object Detector for Embedded Applications — How YOLO v3 works – Source

The YOLO Architecture at a Glance

The YOLOv3 algorithm first separates an image into a grid. Each grid cell predicts some number of bounding boxes (sometimes referred to as anchor boxes) around objects that score highly with the aforementioned predefined classes.

Each bounding box has a respective confidence score of how accurate it assumes that prediction should be. Only one object is identified per bounding box. The bounding boxes are generated by clustering the dimensions of the ground truth boxes from the original dataset to find the most common shapes and sizes.

Other comparable algorithms that can carry out the same objective are R-CNN (Region-based Convolutional Neural Networks made in 2015), Fast R-CNN (R-CNN improvement developed in 2017), and Mask R-CNN. However, unlike systems like R-CNN and Fast R-CNN, YOLO can perform classification and bounding box regression at the same time. Thus, ensuring efficient and accurate predictions of the predicted bounding boxes.

Update: Check out our article about the new YOLOv7 model. YOLOv7 is widely expected to become the new industry standard for object detection.

What’s New in YOLO v3?

There are major differences between YOLOv3 and older versions that occur in terms of speed, precision, and specificity of classes. YOLOv2 and YOLOv3 are worlds apart regarding accuracy, speed, and architecture. YOLOv2 came out in 2016, two years before YOLO v3.

The following sections will give you an overview of what’s new in YOLOv3.

Speed

YOLOv2 was using Darknet-19 as its backbone feature extractor, while YOLOv3 now uses Darknet-53. YOLO creators Joseph Redmon and Ali Farhadi are also the creators of the Darknet-53 backbone.

Darknet-53 has 53 convolutional layers instead of the previous 19. This makes it more powerful than Darknet-19 and more efficient than competing backbones (ResNet-101 or ResNet-152).

YOLOv3 performance comparison — Comparison of backbones. Accuracy, billions of operations (Ops), billion floating-point operations per second (BFLOP/s), and frames per second (FPS) for various networks – Source: YOLOv3 Paper

Using the chart in Redmon and Farhadi’s YOLOv3 paper, we can see that Darknet-52 is 1.5 times faster than ResNet101. The depicted accuracy doesn’t entail any trade-off between accuracy and speed between Darknet backbones either. This is because it is still as accurate as ResNet-152 yet two times faster.

YOLOv3 is fast and accurate in terms of mean average precision (mAP) and intersection over union (IOU) values. It runs significantly faster than other detection methods with comparable performance (hence the name – You only look once).

Moreover, you can easily trade-off between speed and accuracy simply by changing the model’s size, without the need for model retraining. Thus, showcasing the versatility of feature extraction within the YOLOv3 architecture.

YOLOv3 detection method comparison with other methods — YOLOv3 runs much faster than previous detection methods with a comparable performance using an M40/Titan X GPU – Source

Precision for Small Objects

The chart below (taken and modified from the YOLOv3 paper) shows the average precision (AP) of detecting small, medium, and large images with various algorithms and backbones. The higher the AP, the more accurate it is for that variable.

The precision for small objects in YOLOv2 was incomparable to other algorithms. This was because of how inaccurate YOLO was at detecting small objects. With an AP of 5.0, it paled compared to other algorithms like RetinaNet (21.8) or SSD513 (10.2), which had the second-lowest AP for small objects.

YOLOv3 comparison for different object sizes showing the average precision (AP) for AP-S (small object size), AP-M (medium object size), AP-L (large object size) – Source: Focal Loss for Dense Object Detection

YOLOv3 increased the AP for small objects by 13.3, which is a massive advance from YOLOv2. However, the average precision (AP) for all objects (small, medium, large) is still less than RetinaNet.

Specificity of Classes

The new YOLOv3 uses independent logistic classifiers and binary cross-entropy loss for the class predictions during training. These edits make it possible to use complex datasets such as Microsoft’s Open Images Dataset (OID) for YOLOv3 model training. OID contains dozens of overlapping labels, such as “man” and “person” for images in the dataset.

YOLO v3 uses a multilabel approach which allows classes to be more specific and be multiple for individual bounding boxes. Meanwhile, YOLOv2 used a softmax, which is a mathematical function that converts a vector of numbers into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.

Using a softmax makes it so that each bounding box can only belong to one class, which is sometimes not the case, especially with datasets like OID.

Object Detection in Farming — Object Detection to recognize animals with YOLO in an agriculture application with Viso Suite

Disadvantages of YOLO v3 vs. Other Algorithms

The YOLOv3 AP does indicate a trade-off between speed and accuracy for using YOLO when compared to RetinaNet since RetinaNet training time is greater than YOLOv3. However, the accuracy of detecting objects with YOLOv3 can become equal to the accuracy when using RetinaNet by having a larger dataset. Thus, an ideal option for models trained with large datasets.

For example, in common object detection models like traffic detection, there is plenty of data available for model training. This is because there is a large number of organic data and vehicle images readily available. However, YOLOv3 may not be ideal for using niche models where large datasets can be hard to obtain.

Installing YOLOv3

The YOLOv3 installation, including pre-trained models, is relatively straightforward. First, it is necessary to install some dependencies and libraries. YOLOv3 can be installed either directly onto a computer or through a notebook (such as Google Colaboratory or Jupyter). For both implementations, the commands remain the same. Assuming all libraries are installed, the command for installing YOLOv3 is pip install YOLOv3.

I will briefly guide you through installing YOLOv3 with the required libraries.

Before installing anything, I’d advise that you make sure the pip version is at least 3.0. You can check the version with the command pip -V.
If for any reason, you are unable to uninstall older versions of pip or can’t directly use pip version 3, you can use the command “pip3 install ___” rather than just “pip.”
Next, we install the required libraries one by one. Starting with OpenCV (Version 3.4 or more recent): pip install opencv-python
Python (Version 3.6 or more recent): Check if you already have Python: python –version
Install Python for the first time on Mac or Linux: brew install python (will need homebrew first if you don’t already have it: /bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”)
Install Python for the first time on Windows: use this guide. You will need admin privileges on your computer.
Tensorflow-gpu (Version 1.5.0 or later): pip install TensorFlow
Keras 2.1.3: pip install kerasOnce you’ve downloaded all the above libraries, you can install YOLOv3 with the command pip install YOLOv3.

How to Use YOLOv3

The first step to using YOLOv3 would be to decide on a specific object detection project. YOLOv3 performs real-time detections. Thus, YOLOv3 is ideal for beginners choosing a simple project with an easy premise.

Software to develop and deploy YOLOv3 model-based applications — The platform Viso Suite provides training, deployment, and development using YOLOv3 AI models

Model Weights

Weights and cfg (or configuration) files are downloadable from the website of the original creator of YOLOv3. Download the model weights and place them into your current directory with the filename “yolov3.weights.”. You can also (more easily) use YOLO’s COCO pre-trained weights by initializing the model with model = YOLOv3().

Using COCO’s pre-trained weights means that you can use YOLO for object detection with the 80 pre-trained classes that come with the COCO dataset. This is a good option for beginners because it requires the least amount of new code and customization.

The following 80 classes are available using COCO’s pre-trained weights:

'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis','snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'

Object Detection Example with YOLO — Object Detection with YOLO using COCO pre-trained classes “dog”, “bicycle”, “truck”.

Making a Prediction With YOLO v3

The convolutional layers included in the YOLOv3 architecture produce a detection prediction after passing the features learned onto a classifier or regressor. These features include the class label, coordinates of the bounding boxes, sizes of the bounding boxes, and more.

Since the prediction with the YOLO machine-learning algorithm uses 1 x 1 convolutions (hence the name, “you only look once”), the size of the prediction map is exactly the size of the feature map before it.

In YOLOv3, the prediction map is interpreted by each cell predicting a fixed number of bounding boxes. Then, whichever cell contains the center of the ground truth box of an object of interest is designated as the cell that will finally be responsible for predicting the object. There is a ton of mathematics behind the inner workings of the prediction architecture.

Anchor Boxes
Although we touched on bounding boxes and classes previously in this article, implementation and use with YOLOv3 require more detail. Object detectors using YOLOv3 usually predict log-space transforms, which are offsets to predefined “default” bounding boxes. Those specific bounding boxes are anchors. The transforms are later applied to the anchor boxes to receive a prediction. YOLOv3 in particular has three anchors. This results in the prediction of three bounding boxes per cell. The cell is also referred to as a neuron in more technical terms.
Non-Maximum Suppression
Objects can sometimes be detected multiple times when more than one bounding box detects the object as a positive class detection. Non-maximum suppression helps avoid this situation and only passes detections if they haven’t already been detected. Using the NMS threshold value and confidence threshold value, we implement NMS to prevent double detections. It is an imperative part of utilizing YOLOv3 effectively. Here, we briefly described a few of the features that make the predictions possible, such as anchor boxes and non-maximum suppression (NMS) values. This is, however, not a complete representation of all the features that go into creating a successful prediction with YOLOv3. For full descriptions of YOLOv3’s mathematical background, I suggest reading the official YOLOv3 paper linked at the end of this article.

Interpreting Results

Interpreting the results of a YOLO model prediction is just as nuanced as the actual implementation of the model. Multiple factors go into a successful interpretation and accuracy rating, such as the box confidence score and class confidence score used when creating a YOLOv3 computer vision model. There are many other ways and features used when interpreting results, but these are just a few. Other YOLOv3 prediction features include classification loss, loss function, objectness score, and more.

Class Confidence and Box Confidence Scores

Each bounding box has an x, y, w, h, and box confidence score value. The confidence score is the value of how probable a class is contained by that box, as well as how accurate that bounding box is.

The bounding box width and height (w and h) are first set to the width and height of the image given. Then, x and y are offsets of the cell in question and all 4 bounding box values are between 0 and 1. Then, each cell has 20 conditional class probabilities implemented by the YOLOv3 algorithm.

The class confidence score for each final bounding box used as a positive prediction is equal to the box confidence score multiplied by the conditional class probability. The conditional class probability in this context is the probability that the detected object is part of a certain class (the class being the object of interest’s identification). YOLOv3’s prediction, therefore, has 3 values of h, w, and depth.

The spatial dimensions of the images and tensors used to produce boundary box predictions involve some high-level math. To learn more about this stage, you may find the YOLOv3 Arxiv paper at the end of this article.

For the final step, the boundary boxes with high confidence scores (more than 0.25) are kept as final predictions.

YOLOv3 prediction results with values above 0.1 being too large and the detection accuracy is too low. – Source

YOLOv3 Resources

The YOLOv3 algorithm has a multitude of credible resources created by the author and makers of the algorithm itself. For any purpose, primary resources are always best for getting accurate information on the topic. However, for YOLO v3, these resources are even more important because of all the second-hand information available on its use.

In researching for this article, the most useful primary resources were:

YOLOv1, accredited paper on the first version of the architecture: Redmon, Joseph, Divvala, Girshick. “You Only Look Once: Unified, Real-Time Object Detection.” (2015) – Find it here.
YOLOv3, accredited paper on the third version of YOLO: ‌Redmon, Joseph, and Ali Farhadi. “YOLOv3: An Incremental Improvement.” (2018) – Find it here.
YOLOv3 source code and algorithm specifics by the original author (Joseph Redmon) – Find it here.
‌Results from the Paper for YOLOv3: Paperswithcode, YOLOv3: An Incremental Improvement (uploaded by Redmon and Farhadi) – Find it here.

Why Use YOLOv3

Since the release of YOLOv3 in April 2018, several other official and unofficial YOLO versions have been released. These include:

YOLOv4: Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao (April 2020)
YOLOv5: Ultralytics (May 2020)
YOLOX: Huawei Noah’s Ark Lab (July 2021)
YOLOv6: Meituan Technical Team (June 2022)
YOLOv7: Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao (July 2022)
YOLOv8: Ultralytics (January 2023)
YOLOv9: Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao (February 2024)

Users may choose to stick with YOLOv3 over newer iterations like YOLOv4 or YOLOv5 due to project requirements. One significant consideration is stability and maturity. YOLOv3 has been in circulation for a more extended period, undergoing extensive testing and validation across diverse applications. This accumulated experience contributes to a perception of stability and reliability, particularly in situations where the latest features of newer versions are not critical.

Additionally, the choice may hinge on practical considerations such as model size and deployment. YOLOv3 is a comparatively lightweight model, making it a suitable option for deployment on edge devices with limited storage capacity or bandwidth constraints. Users who prioritize model compactness and ease of deployment may find YOLOv3 preferable over later versions that might demand more storage and computational resources.

Comparing YOLOv3 and YOLOv5

YOLOv5 was published by the company Ultralytics and is therefore not part of the official YOLO series. This has sparked some controversy in the computer vision community. The architecture is similar to the official YOLOv4 but is based on a different Framework, PyTorch instead of Darknet.

According to the creator of the official YOLOv4, the performance of v5 is similar to the official v4. Check out our article about YOLOv5.

Crowd Face Detection — Object detection applied for face detection and face recognition tasks

Comparing YOLOv3 and YOLOR

YOLO is just one of many algorithms used extensively in artificial intelligence. We’ve discussed the new version of YOLO, YOLOv5, and its surrounding controversy regarding the new architecture and validity. Check out our analysis to learn more about the history of YOLO, and why the original author of YOLO did not make the new versions 4 and 5.

YOLOR (You Only Learn One Representation) is a different, high-performing object detection algorithm. It provides significant performance gains over YOLOv3 and performs very well on the COCO benchmark. YOLOR had been considered state-of-the-art before the release of YOLOv7 in July 2022.

Comparing YOLOv3 and YOLOv7

At its release in 2022, YOLOv7 boasted that it surpassed all known real-time object detectors in terms of accuracy and speed. Check out our article about YOLOv7 and learn how to implement it for business solutions.

A comparison chart of YOLOv7, YOLOR, PPYOLO, YOLOX, YOLOv4, and YOLOv5 on the COCO dataset — YOLOv7 real-time object detector released in July 2022.

Comparing YOLOv3 and YOLOv8

YOLOv8 from Ultralytics was considered state-of-the-art upon its January 2023 release. Newer versions of YOLOv8 are more lightweight and prove to outperform older YOLO versions on the COCO dataset. However, since YOLOv8 is considered an “unofficial” version and licensed under the AGP-3.0 license, many organizations use YOLOv3 for real-time object detection tasks.

YOLO comparison plots — Comparing various YOLO versions – source.

What’s Next for YOLOv3?

While faster and more efficient models have since been released, YOLOv3 remains a popular and reliable object detection model.

If you are looking for a business solution to implement a custom computer vision application based on YOLOv3 or other AI models, check out the next-gen computer vision platform Viso Suite. The end-to-end solution covers the entire computer vision lifecycle, with image annotation, AI model training, and AI model management for YOLO v3 and all other popular deep learning models, including YOLO v7 and YOLOv8.

Viso Suite provides a powerful enterprise solution to power private on-device AI vision. It leverages Edge AI to avoid storing or sending video data to the cloud. The Suite provides the most comprehensive features and no-code automation.

Get in touch and request a demo with our team.

If you enjoyed reading this article, we recommend:

Learn more about Object Detection and Object Tracking
Read about how using YOLO for people counting on AI hardware
View the top Deep Learning Frameworks you need to know
Everything you need to know about Mask R-CNN
What is Pattern Recognition?

People in meeting room, example of object detection

Object Detection in 2024: The Definitive Guide

Complete overview of Object Detection in 2023. Introduction to the most popular Computer Vision and Deep Learning Object Detection Algorithms.

Representation Learning: Unlocking the Hidden Structure of Data

Representation learning is a subfield of machine learning focused on transforming data into compressed representations.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
ZCAMPAIGN_CSRF_TOKEN	session	This cookie is used to distinguish between humans and bots.
zfccn	session	Zoho sets this cookie for website security when a request is sent to campaigns.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_177371481_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
zabUserId	1 year	This cookie is set by Zoho and identifies whether users are returning or visiting the website for the first time
zabVisitId	one year	Used for identifying returning visits of users to the webpage.
zft-sdc	24hours	It records data about the user's navigation and behavior on the website. This is used to compile statistical reports and heat maps to improve the website experience.
zps-tgr-dts	1 year	These cookies are used to measure and analyze the traffic of this website and expire in 1 year.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
2d719b1dd3	session	This cookie has not yet been given a description. Our team is working to provide more information.
4662279173	session	This cookie is used by Zoho Page Sense to improve the user experience.
ad2d102645	session	This cookie has not yet been given a description. Our team is working to provide more information.
zc_consent	1 year	No description available.
zc_show	1 year	No description available.
zsc2feeae1d12f14395b6d5128904ae3746	1 minute	This cookie has not yet been given a description. Our team is working to provide more information.