What is Object Tracking? – An Introduction

Object detection with multiple cars in a desert setting

What is Object Tracking?

Object tracking is an application of deep learning where the program takes an initial set of object detections and develops a unique identification for each of the initial detections and then tracks the detected objects as they move around frames in a video.

In other words, object tracking is the task of automatically identifying objects in a video and interpreting them as a set of trajectories with high accuracy.

Often, there’s an indication around the object being tracked, for example, a surrounding square that follows the object, showing the user where the object is on the screen.

Uses and Types of Object Tracking

Object tracking is used for a variety of use cases involving different types of input footage. Whether or not the anticipated input will be an image or a video, or a real-time video vs. a prerecorded video, impacts the algorithms used for creating object tracking applications.

The kind of input also impacts the category, use cases, and applications of object tracking. Here, we will briefly describe a few popular uses and types of object tracking, such as video tracking, visual tracking, and image tracking.

Video Tracking

Video tracking is an application of object tracking where moving objects are located within video information. Hence, video tracking involves both live, real-time footage and also recorded video files.

The processes used to execute video tracking tasks differ based on which type of video input is targeted. This will be discussed more in-depth when we compare batch and online tracking methods later in this article.


Visual Tracking

Visual tracking or visual target-tracking is a research topic in computer vision that is applied in a large range of everyday scenarios. The goal of visual tracking is to estimate the future position of a visual target that was initialized without the availability of the rest of the video.

Image Tracking

Image tracking is meant for detecting two-dimensional images of interest in a given input. That image is then continuously tracked as they move in the setting.

Image tracking is ideal for datasets with highly contrasting images (ex. black and white), asymmetry, few patterns, and multiple identifiable differences between the image of interest and other images in the image set.

Image tracking relies on computer vision to detect and augment images after image targets are predetermined.

Pros and Cons of Object Tracking

What are the common challenges and advantages of Object Tracking? The main challenges usually stem from issues in the image that make it difficult for object tracking models to effectively perform detections on the images.

Here, we will discuss the few most common issues with the task of tracking objects and methods of preventing or dealing with these challenges.

1. Training and Tracking Speed

Algorithms for tracking objects are supposed to not only accurately perform detections and localize objects of interest but also do so in the least amount of time possible. Enhancing tracking speed is especially imperative for real-time object tracking models.

To manage the time taken for a model to perform, the algorithm used to create the object tracking model needs to be either customized or chosen carefully. Fast R-CNN and Faster R-CNN can be used to increase the speed of the most common R-CNN approach.

Since CNNs (Convolutional Neural Networks) are commonly used for object detection, CNN modifications can be the differentiating factor between a faster object tracking model and a slower one. Design choices besides the detection framework also influence the balance between speed and accuracy of an object detection model.

2. Background Distractions

The backgrounds of inputted images or images used to train object tracking models also impact the accuracy of the model. Busy backgrounds of objects meant to be tracked can make it harder for small objects to be detected.

With a blurry or single color background, it is easier for an AI system to detect and track objects. Backgrounds that are too busy, have the same color as the object, or that are too cluttered can make it hard to track results for a small object or a lightly colored object.

3. Multiple Spatial Scales

Objects meant to be tracked can come in a variety of sizes and aspect ratios. These ratios can confuse the object tracking algorithms into believing objects are scaled larger or smaller than their actual size. The size misconceptions can negatively impact detections or detection speed.

To combat the issue of varying spatial scales, programmers can implement techniques such as feature maps, anchor boxes, image pyramids, and feature pyramids.

  • Anchor Boxes: Anchor boxes are a compilation of bounding boxes that have a specified height and width. The boxes are meant to acquire the scale and aspect ratios of objects of interest. They are chosen based on the average object size of the objects in a given dataset. Anchor boxes allow various types of objects to be detected without having the bounding box coordinates alternated during localization.
  • Feature Maps: A feature map is the output image of a layer when a Convolutional Neural Network (CNN) is used to capture the result of applying filters to that input image. Feature maps allow a deeper understanding of the features being detected by a CNN. Single-shot detectors have to take into account the issue of multiple scales because they detect objects with just one pass through a CNN framework. This will occur in a detection decrease for small images. Small images can lose signal during downsampling in the pooling layers, which is when the CNN was trained on a low subset of those smaller images. Even if the number of objects is the same, downsampling can occur because the CNN wasn’t able to detect the small images and count them towards the sample size. To prevent this, multiple feature maps can be used to allow single-shot detectors to look for objects within CNN layers – including earlier layers with higher resolution images. Single-shot detectors are still not an ideal option for small object tracking because of the difficulty they experience when detecting small objects. Tight groupings can prove especially difficult. For instance, overhead drone shots of a group of herd animals will be difficult to track using single-shot detectors.
  • Image and Feature Pyramid Representations: Feature pyramids, also known as multi-level feature maps because of their pyramidal structure, are a preliminary solution for object scale variation when using object tracking datasets. Hence, feature pyramids model the most useful information regarding objects of different sizes in a top-down representation and therefore make it easier to detect objects of varying sizes. Strategies such as image pyramids and feature pyramids are useful for preventing scaling issues. The feature pyramid is based on multi-scale feature maps, which uses less computational energy than image pyramids. This is because image pyramids consist of a set of resized versions of one input image that are then sent to the detector at testing.
Image and Feature Pyramid Representations
The concept of feature pyramid execution frameworks – Source
4. Occlusion

Occlusion has a lot of definitions. In medicine, occlusion is the “blockage of a blood vessel” due to the vessel merging to a close; in deep learning, it has a similar meaning.

In AI vision tasks using deep learning, occlusion happens when multiple objects come too close together (merge). This causes issues for object tracking systems because the occluded objects are seen as one or simply track the object incorrectly. The system can get confused and identify the initially tracked object as a new object.

Occlusion sensitivity prevents this misidentification by allowing the user to understand which parts of an image are the most important for the object tracking system to classify. Occlusion sensitivity refers to a measure of the network’s sensitivity to occlusion in different data regions. It is done using small subsets of the original dataset.

Levels of Object Tracking

Object Tracking consists of multiple subtypes because it is such a broad application. Levels of object tracking differ depending on the number of objects being tracked.

Multiple Object Tracking

The definition of “multiple object tracking” is right in the name. Instead of tracking just one object, multiple object tracking (also sometimes called “multi-target tracking”) attempts to analyze videos in order to identify objects that belong to more than one predetermined class.

What Distinguishes Multiple Object Tracking (MOT) from General Object Detection? Object detections typically produce a collection of bounding boxes as outputs. Multiple object tracking often has little to no prior training regarding the appearance and number of targets. Bounding boxes are identified using their height, width, coordinates, and other parameters. Meanwhile, MOT algorithms additionally assign a target ID to each bounding box. This target ID is known as a detection, and it is important because it allows the model to distinguish among objects within a class.

For example, instead of identifying all cars in a photo where multiple cars are present as just “car,” MOT algorithms attempt to identify different cars as being different from each other rather than all falling under the “car” label. For a visual representation of this metaphor, refer to the image below.


Single Object Tracking

Single Object Tracking (SOT) creates bounding boxes that are given to the tracker based on the first frame of the input image. Single Object Tracking is also sometimes known as Visual Object Tracking.

SOT implies that one singular object is tracked, even in environments involving other objects. Single Object Trackers are meant to focus on one given object rather than multiple. The object of interest is determined in the first frame, which is where the object to be tracked is initialized for the first time. The tracker is then tasked with locating that unique target in all other given frames.

SOT falls under the detection-free tracking category, which means that it requires manual initialization of a fixed number of objects in the first frame. These objects are then localized in consequent frames. A drawback of detection-free tracking is that it cannot deal with scenarios where new objects appear in the middle frames. SOT models should be able to track any given object.

Algorithm Specifics

Multiple Object Tracking (MOT) Algorithm Introduction

Most multiple object tracking algorithms incorporate an approach called tracking-by-detection. The tracking-by-detection method involves an independent detector that is applied to all image frames to obtain likely detections, and then a tracker, which is run on the set of detections. Hereby, the tracker attempts to perform data association (for example, linking the detections to obtain complete trajectories). The detections extracted from video inputs are used to guide the tracking process by connecting them and assigning identical IDs to bounding boxes containing the same target.

  • Batch method: Batch tracking algorithms use information from future video frames when deducing the identity of an object in a certain frame. Batch tracking algorithms use non-local information regarding the object. This methodology results in a better quality of tracking.
  • Online method: While batch tracking algorithms access future frames, online tracking algorithms only use present and past information to come to conclusions regarding a certain frame.

Online tracking methods for performing MOT generally perform worse than batch methods because of the limitation of batch methods staying constrained to the present frame. However, this methodology is sometimes necessary because of the use case.

For example, real-time problems requiring the tracking of objects, like navigation or autonomous driving, do not have access to future video frames, which is why online tracking methods are still a viable option.

Multiple Object Tracking Algorithm Stages

Most multiple object tracking algorithms contain a basic set of steps that remain constant as algorithms vary. Most of the so-called multi-target tracking algorithms share the following stages:

  • Stage #1: Designation or Detection: Targets of interest are noted and highlighted in the designation phase. The algorithm analyzes input frames to identify objects that belong to target classes. Bounding boxes are used to perform detections as part of the algorithm.
  • Stage #2: Motion: Feature extraction algorithms analyze detections to extract appearance and interaction features. A motion predictor, in most cases, is used to predict subsequent positions of each tracked target.
  • Stage #3: Recall: Feature predictions are used to calculate similarity scores between detection couplets. Those scores are then used to associate detections that belong to the same target. IDs are assigned to similar detections, and different IDs are applied to detections that are not part of pairs.

Some object tracking models are created using these steps separately from each other, while others combine and use the steps in conjunction. These differences in algorithm processing create unique models where some are more accurate than others.

Popular Object Tracking Algorithms

Convolutional Neural Networks (CNN) remain the most used and reliable network for object tracking. However, multiple architectures and algorithms are being explored as well. Among these algorithms are Recurrent Neural Networks (RNNs), Autoencoders (AEs), Generative Adversarial Networks (GANs), Siamese Neural Networks (SNNs), and custom neural networks.

Although object detectors can be used to track objects if it is applied frame-by-frame, this is a computationally limiting and therefore a rather inefficient method of performing object tracking. Instead, object detection should be applied once, and then the object tracker can handle every frame after the first. This is a more computationally effective and less cumbersome process of performing object tracking.

1. OpenCV

OpenCV is a popular choice because it has so many algorithms built into OpenCV that are specifically optimized for the needs and objectives of object tracking.

Specific object trackers within OpenCV include the BOOSTING, MIL, KCF, CSRT, MedianFlow, TLD, MOSSE, and GOTURN trackers. Each of these trackers is best for different goals. For example, CSRT is best when the user requires a higher object tracking accuracy and can tolerate slower FPS throughput.

Picking a tracker depends on the advantages and disadvantages of that tracker and the benefits you would like your object tracker to have. For a complete article on the similarities and differences of OpenCV object tracking algorithm processes, we suggest you read OpenCV Object Tracking by Adrian Rosebrock.

2. DeepSORT

DeepSORT is a good object tracking algorithm choice, and it is one of the most widely used object tracking frameworks. Appearance information is integrated within the algorithm, which vastly improves DeepSORT performance. Because of the integration, objects are trackable through longer periods of occlusion – reducing the number of identity switches.

For complete information on the inner workings of DeepSORT and specific algorithmic differences between DeepSORT and other algorithms, we suggest the article “Object Tracking using DeepSORT in TensorFlow 2” by Anushka Dhiman.


MATLAB is a numeric computing platform, which makes it different in its implementation compared to DeepSORT and OpenCV, but it is nevertheless a fine choice for tracking tasks.

The Computer Vision Toolbox in MATLAB provides video tracking algorithms, such as continuously adaptive mean shift (CAMShift) and Kanade-Lucas-Tomasi (KLT) for tracking a single object or for use as building blocks in a more complex tracking system.

4. MDNet

MDNet is a fast and accurate, CNN-based visual tracking algorithm inspired by the R-CNN object detection network. It functions by sampling candidate regions and passing them through a CNN. The CNN is typically pre-trained on a vast dataset and refined at the first frame in an input video.

Therefore, MDNet is most useful for real-time object tracking use cases. However, while it suffers from high computational complexity in terms of speed and space, it still is an accurate option.

The computation-heavy aspects of MDNet can be minimized by performing RoI (Region of Interest) Pooling, however, which is a relatively effective way of avoiding repetitive observations and accelerating inference.

What’s Next?

Object Tracking is used to identify objects in a video and interpreting them as a set of trajectories with high accuracy. Therefore, the main challenge is to hold a balance between computational efficiency and performance.

Check our article People Counting System: How To Make Your Own in Less Than 10 Minutes where we detect and track people in real-time video footage.

If you enjoyed reading this article and want to read about related topics, check out the following articles:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook
Share on email
Related Articles

Join 6,300+ Fellow
AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

Sign up to receive news and other stories from viso.ai. Your information will be used in accordance with viso.ai's privacy policy. You may opt out at any time.

Want to use Computer Vision applications?

Get the all-in-one Suite to build and deliver Computer Vision Applications. 
Learn more

This website uses cookies. By continuing to browse this site, you agree to this use.