Image segmentation is one of the key applications in the Computer Vision domain. It has been applied in various domains, for example in medical or intelligent transportation. Since the emergence of Deep Neural Networks (DNN), image segmentation has made tremendous progress.
This article aims to provide an easy-to-understand overview of image segmentation. In particular, you will learn about:
- What is Image Segmentation?
- What are popular applications?
- Semantic vs. Instance Segmentation
- Most popular Datasets
What is Image Segmentation?
Image segmentation is the task of clustering parts of an image together that belong to the same object class. This process is also called pixel-level classification. In other words, it involves partitioning images (or video frames) into multiple segments or objects.
Applications of Image Segmentation
Image segmentation plays a central role in a broad range of real-world computer vision applications, including road sign detection, biology, the evaluation of construction materials, or video surveillance. Also, it is important for autonomous vehicles and Advanced Driver Assistance Systems (ADAS) to detect navigable surfaces or to apply pedestrian detection.
Furthermore, image segmentation is widely applied in medical applications, such as tumor boundary extraction or measurement of tissue volumes. Here, an opportunity is to design standardized image databases that can be used to evaluate fast-spreading new diseases and pandemics (for example for AI vision applications of coronavirus control).
Deep Learning based Image Segmentation has been successfully applied to segment satellite images in the field of remote sensing, including techniques for urban planning or precision agriculture. Also, images collected by drones (UAVs) have been segmented using Deep Learning based techniques, offering the opportunity to address important environmental problems related to climate change.
Semantic vs. Instance Segmentation
Image segmentation can be formulated as a classification problem of pixels with semantic labels (semantic segmentation) or partitioning of individual objects (instance segmentation). Semantic segmentation performs pixel-level labeling with a set of object categories (for example people, trees, sky, cars) for all image pixels.
It is generally a more difficult undertaking than image classification, which is used to predict a single label for the entire image or frame. Instance segmentation extends the scope of semantic segmentation further by detecting and delineating all the objects of interest in an image.
Image Segmentation and Deep Learning
Multiple image segmentation algorithms have been developed. Earlier methods include thresholding, histogram-based bundling, region growing, k-means clustering, or watersheds. More advanced algorithms, however, are based on active contours, graph cuts, conditional and Markov random fields, and sparsity-based methods. Over the last years, Deep Learning models have introduced a new segment of image segmentation models with remarkable performance improvements. Deep Learning based image segmentation models often achieve the best accuracy rates on popular benchmarks, resulting in a paradigm shift in the field.
Most Popular Image Segmentation Datasets
Recently, due to the success of Deep Learning models in a wide range of vision applications, there has been a substantial amount of research aimed at developing image segmentation approaches using Deep Learning. At present, there are many general datasets related to image segmentation. The most popular image segmentation datasets are:
The PASCAL Visual Object Classes (VOC) Challenge provides publicly available image datasets and annotations. The PASCAL VOC is one of the most popular datasets in computer vision, with annotated images available for 5 tasks—classification, segmentation, detection, action recognition, and person layout. A high number of popular segmentation algorithms have been evaluated on this dataset. For segmentation tasks, the PASCAL VOS supports 21 classes of object labels: vehicles, household, animals, airplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat, cow, dog, horse, sheep, and person. Pixels are labeled as background if they do not belong to any of these classes. The training/validation data of the PASCAL VOC has 11’530 images containing 27’450 ROI annotated objects and 6’929 segmentations.
The Microsoft Common Objects in Context (MS COCO) is a large-scale object detection, segmentation, and captioning dataset. COCO includes images of complex everyday scenes, containing common objects in their natural contexts. Therefore, COCO is based on a total of 2.5 million labeled instances in 328k images, containing photos of 91 object types which would be recognized easily by a 4-year-old person.
The large-scale database focuses on the semantic understanding of urban street scenes. It contains a diverse set of stereo video sequences recorded in street scenes from 50 cities, 5’000 fully annotated images, and a set of 20’000 weakly annotated frames. Also, the collection time spans several months, which covers seasons of spring, summer, and fall. Cityscapes includes semantic and dense pixel annotations of 30 classes, grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset is especially important for autonomous driving applications.
ADE20K offers a standard training and evaluation platform for scene parsing algorithms. The ADE20K dataset contains over 20’000 scenecentric images that are annotated with objects and object parts, it provides 150 semantic categories. Unlike other datasets, ADE20K includes object segmentation mask and parts segmentation mask. There are 20’210 images in the training set, 2’000 images in the validation set, and 3’000 images in the testing set.
The YouTube-Objects Dataset is composed of videos collected from YouTube by querying for the names of 10 object classes. In particular, it includes objects from the 10 PASCAL VOC classes airplane, bird, boat, car, cat, cow, dog, horse, motorbike, and train. The original dataset was developed for object detection with weak annotations and did not contain pixel-wise annotations. Therefore, a fully annotated YouTube Video Object Segmentation dataset (YouTube-VOS) was released that contains 4’453 YouTube video clips and 94 object categories.
The KITTI dataset is one of the most popular datasets for mobile robotics and autonomous driving. It contains hours of videos of traffic scenarios captured by driving around the mid-sized city of Karlsruhe (on highways and in rural areas). Averagely, in every image, up to 15 cars and 30 pedestrians are visible. The main tasks of this dataset are road detection, stereo reconstruction, optical flow, visual odometry, 3D object detection, and 3D tracking. The original dataset does not contain ground truth for semantic segmentation, but researchers have manually annotated parts of the dataset.
There are multiple other datasets available for image segmentation purposes, such as the SUN database (16’873 fully annotated images), Shadow detection/Texture segmentation vision dataset, Berkeley segmentation dataset, the Semantic Boundaries Dataset (SBD), PASCAL Part, SYNTHIA, Adobe’s Portrait Segmentation or the LabelMe images database.
If you enjoyed reading this article, we recommend:
- Read about Object Detection or Face Detection in 2021
- Read about Pose Estimation state of the art
- Learn about the real-time object detection algorithm YOLOv3
- Our guide about OID vs. COCO – differences and similarities