Object Detection in 2021: The Definitive Guide

People in meeting room, example of object detection

Object detection is considered one of the most challenging problems in this field of computer vision, as it involves the combination of object classification and object localization within images or videos. Since computer vision is a category of artificial intelligence that allows computer systems to “see” their environments, detecting objects in surrounding areas is imperative and object detection simplifies that process.

In this article you will learn about:

  • Basics of object detection
  • Use cases of object detection
  • Popular object detection algorithms
    • YOLO
    • R-CNN
    • Mask R-CNN
    • SqueezeDet
    • MobileNet

What is Object Detection?

Object detection has been undergoing rapid change within the computer science community due to its increasing amount of use cases. As computer vision techniques are used more frequently to automated processes such as remote animal monitoring and more, object detection advances at a similar pace.


Object Detection with bounding boxes
Object Detection with bounding boxes

Object recognition refers to a collection of related tasks for identifying objects in digital photographs. The goal of object detection is to recognize instances of a predefined set of object classes (i.e. people, vehicles, animals, etc.) and record the locations of each detected object in the image. It accomplishes this by producing bounding boxes around the intended classes.


Object Detection Use Cases and Applications

The use cases involving object detection are varied and plentiful. It has been implemented in computer vision programs used for a range of applications, from sports production to productivity analytics. For example, Tesla’s Autopilot AI heavily utilizes object detection to perceive environmental and surrounding threats such as oncoming vehicles or obstacles. Here, I will review a few common use cases for object detection, but this is only an overview, not a comprehensive list (due to the many use cases that object detection is used for in general).

  • Retail. Strategically placed people counting devices throughout a retail store can gather data through deep learning about where customers spend their time, and for how long. Customer analytics can improve retail stores’ understanding of consumer interaction and improve store layout optimization.
  • Agriculture. Object detection is used in agriculture for tasks such as counting, animal monitoring, and produce evaluation. Damaged produce can be detected while it is in processing using machine learning algorithms.
  • Healthcare. Object detection has allowed for many breakthroughs in the medical community. Because medical diagnostics rely heavily on the study of images, scans, and photographs, object detection involving CT and MRI scans has become extremely useful for diagnosing disease.


Object Detection in Farming
Object Detection Use Case in Farming – Animal Monitoring


Most Popular Object Detection Algorithms

Popular algorithms used to carry out object detection include convolutional neural networks (R-CNN, Region-Based Convolutional Neural Networks), Fast R-CNN, and YOLO (You Only Look Once). The R-CNN’s are in the R-CNN family, while YOLO is part of the single shot detector family. In the following, we will provide an overview and differences of the popular object detection algorithms.


Object detection overview of popular algorithms
Object detection overview of popular algorithms
YOLO – You Only Look Once

As a real-time object detection system, YOLO utilizes a single neural network. The latest release of ImageAI v2.1.0 now supports training a custom YOLO model to detect any kind and number of objects. Convolutional neural networks are instances of classifier-based systems where the system repurposes classifiers or localizers to perform detection and applies the detection model to an image at multiple locations and scales. Using this process, “high scoring” regions of the image are considered detections. Simply put, the regions which look most like the training images given are identified positively.

The YOLO system predicts bounding boxes using dimension clusters as anchor boxes. Most bounding boxes have certain height-width ratios, so instead of directly predicting a bounding box, YOLO predicts off-sets from a predetermined set of boxes with particular height-width ratios, called anchor boxes. The anchor boxes are generated by clustering the dimensions of the ground truth boxes from the original dataset to find the most common shapes and sizes. The network predicts an objectness score for each bounding box using logistic regression which occurs when the bounding box prior overlaps a ground truth object by more than any other bounding box prior.

YOLO allows the model to look at the whole image at test time, so its predictions are informed by the global context in the image.

Unlike systems like R-CNN and Fast R-CNN (CNN meaning Convolutional Neural Network), it’s trained to do classification and bounding box regression at the same time. This makes it a more elegant and streamlined algorithm. However, this also makes it more difficult to pick out smaller objects. YOLO is therefore much faster than most convolutional neural networks in its application, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN.

Unfortunately, this does produce a trade-off between speed and accuracy. Systems using YOLO as the primary object detection algorithm require thousands of data inputs to train with in exchange for high accuracy rates. Read more about YOLOv3 here.

R-CNN – Region-based Convolutional Neural Networks

Region-based convolutional neural networks or regions with CNN features (R-CNNs) are a pioneering approach that applies deep models to object detection. R-CNN models first select several proposed regions from an image (for example, anchor boxes are one type of selection method) and then label their categories and bounding boxes (e.g. offsets). These labels are created based on predefined classes given to the program. They then use a convolutional neural network to perform forward computation to extract features from each proposed area.

In R-CNN, the inputted image is first divided into nearly two thousand region sections and then a convolutional neural network is applied for each region respectively. The size of the regions is calculated and the correct region is inserted into the neural network. It can be inferred that a detailed method like that can produce time constraints. Training time is far greater than YOLO because it classifies and creates bounding boxes individually, and a neural network is applied on one region at a time.

In 2015, Fast R-CNN was developed with the intention to cut down significantly on train time. While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image. This is very comparable to YOLO’s architecture, but YOLO remains a faster alternative to Fast R-CNN because of the simplicity of the code.

At the end of the network is a novel method known as ROI Pooling, which slices out each Region of Interest from the network’s output tensor, reshapes, and classifies it. This makes Fast R-CNN more accurate than the original R-CNN, but still slower than YOLO. However, fewer data inputs are required to train Fast R-CNN and R-CNN because of the thorough analysis techniques they utilize.

Mask R-CNN

Mask R-CNN is an advancement of Fast R-CNN. The difference between the two is that Mask R-CNN added a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, it can run at 5 fps.


SqueezeDet is the name of a deep neural network for computer vision that was released in 2016. SqueezeDet was specifically developed for autonomous driving, where it performs object detection using computer vision techniques. Like YOLO, it is a single-shot detector algorithm. In SqueezeDet, convolutional layers are used only to extract feature maps but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of SqueezeDet models only contains single forward passes of neural networks, allowing them to be extremely fast.


MobileNet is a single-shot multi box detection network used to run object detection tasks. This model is implemented using the Caffe framework. The model output is a typical vector containing the tracked object data, as previously described.


What’s Next?

Object detection is increasingly important for computer vision applications in any industry. If you enjoyed reading this article, I suggest reading:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook
Share on email
Related Articles

Join 6,300+ Fellow
AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

Sign up to receive news and other stories from viso.ai. Your information will be used in accordance with viso.ai's privacy policy. You may opt out at any time.

Want to use Computer Vision applications?

Get the all-in-one Suite to build and deliver Computer Vision Applications. 
Learn more

This website uses cookies. By continuing to browse this site, you agree to this use.