Person re-identification (Re-ID) is used to retrieve a person of interest across multiple, non-overlapping cameras. With the advancement of Deep Neural Networks and the increasing demand for intelligent video surveillance, this problem has gained significantly increased interest in the computer vision community.
This article will cover the following aspects:
- What is Person Re-Identification?
- What are the main challenges?
- How does Re-Identification with Deep Learning work?
- The next step: Unsupervised Re-Identification
- Outlook and what to expect in the future
What Is Person Re-Identification?
Person Re-Identification Problem
Person re-identification is a specific person retrieval problem across non-overlapping, disjoint cameras. Re-ID aims to determine whether a person-of-interest has appeared in another place at a distinct time captured by a different camera or even the same camera at a different time instant. A person’s query can be represented by an image, a video sequence, and even a text description.
The field of re-identification is a widely studied research field. With the urgent demand for public safety and an increasing number of surveillance cameras, the re-identification of people is also an important goal with great practical importance.
Challenges of Person Re-Identification
Re-identification is challenging due to various viewpoints, low-image resolutions, illumination changes, unconstrained poses, occlusions, heterogeneous modalities, complex camera environments, background clutter, unreliable bounding box generation, and more. All those factors lead to greatly varying settings and uncertainty.
Additionally, for practical model deployment, the dynamically updated camera network, a large-scale gallery with efficient retrieval, group uncertainty, unseen testing scenarios, incremental model updating, and changing clothes also greatly increase the difficulties.
These challenges are the main reason that re-identification is still considered to be an unsolved problem for real-world applications.
Re-ID with Deep Learning Methods
Early approaches mainly focus on hand-crafted feature construction with body structures or distance metric learning. However, with the advancement of deep learning, person re-identification has achieved promising performance on the popular benchmarks.
However, there is still a large gap between the research-oriented scenarios and practical vision re-identification applications.
How Re-Identification With Deep Learning Works
The following shows the concept of a practical person re-identification system to solve the problem of pedestrian retrieval across multiple surveillance cameras. Generally, building a person re-identification system requires five main steps:
- Video Data Collection: The primary requirement is the availability of raw video data from surveillance cameras. Such cameras are usually placed in different places under varying environments. Often, the raw visual data contains a large amount of complex and noisy background clutter.
- Bounding Box Generation: People in the video data are detected using person detection and tracking algorithms. Bounding boxes that contain the person images are extracted from the video data.
- Training Data Annotation: The cross-camera labels are annotated. Training data annotation is usually essential for discriminative Re-identification model learning due to the large cross-camera variations. For large domain shifts, the training data usually needs to be annotated in every new scenario.
- Model Training: In the training phase, a discriminative and robust Re-ID model is trained with the previously annotated person images or videos. This is the core of the development of a re-identification system and is widely researched. Extensive models have been developed to handle the various challenges, concentrating on feature representation learning, distance metric learning, or their combinations.
- Pedestrian Retrieval: The testing phase conducts the pedestrian retrieval. Given a query for a person-of-interest and a gallery set, the Re-ID model extracts feature representations learned in the previous stage. A ranking list is obtained by sorting the calculated query-to-gallery similarity (probability of ID-match).
State-of-the-Art Re-Identification: Closed-World
The widely studied “closed-world” setting is usually applied under research assumptions and has achieved relevant advances using deep learning techniques on several datasets. Typically, a standard closed-world Re-ID system contains three main components:
- Feature Representation Learning, which focuses on developing feature construction strategies.
- Deep Metric Learning for designing the training objectives with different loss functions or sampling strategies.
- Ranking Optimization to optimize the retrieved ranking list.
Outlook in the Next Era of Re-Identification: Open-World
With the performance saturation in a closed-world setting, the research focus for person Re-ID has recently moved to the open-world setting, facing more challenging issues:
- Heterogeneous Re-ID by matching person images across heterogeneous modalities. This includes re-identification between depth and RGB images, text-to-image re-identification, visible-to-infrared re-identification, and cross-resolution re-identification.
- End-to-end Re-ID from the raw images or videos. This alleviates the reliance on the additional step for bounding boxes generation.
- Noise-robust Re-ID. This includes partial Re-ID with heavy occlusion, Re-ID with sample noise caused by detection or tracking errors, and Re-ID with label noise caused by annotation error.
- Open-set person Re-ID. When the correct match does not occur in the gallery Open-set Re-identification is usually formulated as a person verification problem, such as discriminating whether two person-images belong to the same identity.
- Semi- or unsupervised Re-ID with limited or unavailable annotated labels.
Unsupervised Re-Identification with Deep Learning
In recent years, video-based re-identification has made great advances. Video sequences provide visual and temporal information that can be obtained using object tracking algorithms in practical video surveillance applications.
However, the annotation difficulty limits the scalability of supervised methods in large-scale camera networks, which drives the need for unsupervised video re-identification.
The difference between unsupervised learning and supervised learning is the availability of labels (annotated data). An intuitive idea for unsupervised learning is to estimate Re-identification labels as accurately as possible, which is called “cross-camera label estimation”.
The estimated labels are subsequently used in feature learning to train robust re-ID models.
With the success of deep learning, Unsupervised Re-ID has achieved increasing attention in recent years. Within three years, the unsupervised Re-ID performance for the Market-1501 dataset has increased significantly, the Rank-1 accuracy increased from 54.5% to 90.3%, and mAP increased from 26.3% to 76.7%. Even given the promising achievements, the current unsupervised Re-identification is still underdeveloped and has to be further improved.
There is still a large gap between the unsupervised and supervised Re-ID. For example, the rank-1 accuracy of supervised ConsAtt has achieved 96.1% on the Market-1501 dataset, while the highest accuracy of unsupervised SpCL is about 90.3%. Recently, researchers demonstrated that unsupervised learning with large-scale unlabeled training data has the ability to outperform supervised learning on various tasks.
Person Re-identification (Re-ID) solves a visual retrieval problem by searching for the queried person from a gallery of disjoint cameras. Deep learning techniques paved the way for important breakthroughs in recent years.
In the future, we expect to see several breakthroughs in supervised Re-identification methods for open-world settings, using unsupervised Re-identification techniques to overcome the bottlenecks of data annotation.
If you want to learn more about related topics, we recommend the following articles:
- Read about Federated Learning for distributed training
- A Guide to Deep Face Recognition Technology
- Learn about Edge Intelligence to deploy Deep Learning models
- The Deep Neural Network and three popular types of DNNs