• Train




          Data Collection

          Building Blocks​

          Device Enrollment

          Monitoring Dashboards

          Video Annotation​

          Application Editor​

          Device Management

          Remote Maintenance

          Model Training

          Application Library

          Deployment Manager

          Unified Security Center

          AI Model Library

          Configuration Manager

          IoT Edge Gateway

          Privacy-preserving AI

          Ready to get started?

          Expert Services
  • Why Viso Suite
  • Pricing
Close this search box.

Image Segmentation with Deep Learning (Guide)


Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.

Need Computer Vision?

Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.

Image segmentation is one of the key applications in the Computer Vision domain. This article aims to provide an easy-to-understand overview of image segmentation and instance segmentation. In particular, you will learn about the following:

  1. What is Image Segmentation?
  2. The meaning of Instance Segmentation
  3. What are popular applications?
  4. Semantic vs. Instance Segmentation
  5. Most popular image segmentation datasets


About us: Viso.ai provides the leading end-to-end Computer Vision Platform Viso Suite. Global organizations use it to develop, deploy, and scale all computer vision applications in one place, with automated infrastructure. Get a personal demo.

Viso Suite – End-to-End Computer Vision and No-Code for Computer Vision Teams


What is Image Segmentation?

One of the most important operations in Computer Vision is Segmentation. Image segmentation is the process of dividing an image into multiple parts or regions that belong to the same class. This task of clustering is based on specific criteria, for example, color or texture.

This process is also called pixel-level classification. In other words, it involves partitioning images (or video frames) into multiple segments or objects.


Semantic image segmentation of aerial drone images.
Semantic image segmentation of aerial drone images. The scene is parted with every pixel belonging to a specific class, such as “building”, “road”, or “tree”.


In the last 40 years, various segmentation methods have been proposed, ranging from MATLAB image segmentation and traditional computer vision methods to state-of-the-art deep learning methods. Especially with the emergence of Deep Neural Networks (DNN), image segmentation applications have made tremendous progress.


Image Segmentation Sample
Annotated image for semantic image segmentation tasks in autonomous driving – Source: Sample from the Mapillary Vistas Dataset


Image Segmentation Techniques

There are various image segmentation techniques available, and each technique has its own advantages and disadvantages.

  1. Thresholding: Thresholding is one of the simplest image segmentation techniques, where a threshold value is set, and all pixels with intensity values above or below the threshold are assigned to separate regions.
  2. Region growing: In region growing, the image is divided into several regions based on similarity criteria. This segmentation technique starts from a seed point and grows the region by adding neighboring pixels with similar characteristics.
  3. Edge-based segmentation: Edge-based segmentation techniques are based on detecting edges in the image. These edges represent boundaries between different regions and are detected using edge detection algorithms.
  4. Clustering: Clustering techniques group pixels into clusters based on similarity criteria. These criteria can be color, intensity, texture, or any other feature.
  5. Watershed segmentation: Watershed segmentation is based on the idea of flooding an image from its minima. In this technique, the image is treated as a topographic relief, where the intensity values represent the height of the terrain.
  6. Active contours: Active contours, also known as snakes, are curves that deform to find the boundary of an object in an image. These curves are controlled by an energy function that minimizes the distance between the curve and the object boundary.
  7. Deep learning-based segmentation: Deep learning techniques, such as Convolutional Neural Networks (CNNs), have revolutionized image segmentation by providing highly accurate and efficient solutions. These techniques use a hierarchical approach to image processing, where multiple layers of filters are applied to the input image to extract high-level features. Read more about the basics of a Convolutional Neural Network.
  8. Graph-based segmentation: This technique represents an image as a graph and partitions the image based on graph theory principles.
  9. Superpixel-based segmentation: This technique groups a set of similar image pixels to form larger, more meaningful regions, called superpixels.


Segment Anything Model (SAM) Example Application
The Segment Anything Model (SAM) is a cutting-edge algorithm developed by Meta AI for zero training image segmentation.


Applications of Image Segmentation

Image segmentation problems play a central role in a broad range of real-world computer vision applications, including road sign detection, biology, the evaluation of construction materials, or video security and surveillance. Also, self-driving cars and Advanced Driver Assistance Systems (ADAS) need to detect navigable surfaces or apply pedestrian detection.


KITTI image segmentation dataset
KITTI dataset sample for image segmentation – Source: KITTI


Furthermore, image segmentation is widely applied in medical imaging applications, such as tumor boundary extraction or measurement of tissue volumes. Here, an opportunity is to design standardized image databases that can be used to evaluate fast-spreading new diseases and pandemics (for example, for AI vision applications of coronavirus control).


Medical Imaging Application for Instance Segmentation
Medical imaging application for the instance segmentation in Dental Medicine using UNet


Deep Learning-based Image Segmentation has been successfully applied to segment satellite images in the field of remote sensing, including techniques for urban planning or precision agriculture. Also, images collected by drones (UAVs) have been segmented using Deep Learning techniques, offering the opportunity to address important environmental problems related to climate change.


YOLOv7-mask for instance segmentation
YOLOv7-mask algorithm for instance segmentation. YOLOv7 is one of the best-performing real-time algorithms.


Semantic vs. Instance Segmentation

Image segmentation can be formulated as a classification problem of pixels with semantic labels (semantic segmentation) or partitioning of individual objects (instance segmentation). Semantic segmentation performs pixel-level class labeling with a set of object categories (for example, people, trees, sky, cars) for all image pixels.

It is generally a more difficult undertaking than image classification, which predicts a single label for the entire image or frame. Instance segmentation extends the scope of semantic segmentation further by detecting and delineating all the objects of interest in an image.


Image segmentation mapping
Image Segmentation with different instances of the same class (individual buildings, houses)


Image Segmentation and Deep Learning

Multiple image segmentation algorithms have been developed. Earlier methods include thresholding, histogram-based bundling, region growing, k-means clustering, or watersheds. However, more advanced algorithms are based on active contours, graph cuts, conditional and Markov random fields, and sparsity-based methods.

Over the last few years, Deep Learning models have introduced a new segment of image segmentation models with remarkable performance improvements. Deep Learning based image segmentation models often achieve the best accuracy rates on popular benchmarks, resulting in a paradigm shift in the field.


ADE20K image segmentation dataset
ADE20K dataset for image segmentation – Source: ADE20K


Most Popular Image Segmentation Datasets

Due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of research aimed at developing image segmentation approaches using deep learning. At present, there are many general datasets related to image segmentation. The most popular image segmentation datasets are:



The PASCAL Visual Object Classes (VOC) Challenge provides publicly available image datasets and annotations. The PASCAL VOC is one of the most popular datasets in computer vision, with annotated images available for 5 tasks—classification, segmentation, detection, action recognition, and person layout. A high number of popular segmentation algorithms have been evaluated on this dataset.

For segmentation tasks, the PASCAL VOS supports 21 classes of object labels: vehicles, households, animals, airplanes, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat, cow, dog, horse, sheep, and person.

Pixels in the image are labeled as “background” if they do not belong to any of these classes. The training/validation data of the PASCAL VOC has 11’530 images containing 27’450 ROI annotated objects and 6’929 segmentations.



The Microsoft Common Objects in Context (MS COCO) is a large-scale object detection, segmentation, and captioning dataset. COCO includes images of complex everyday scenes containing common objects in their natural contexts.

Therefore, COCO is based on a total of 2.5 million labeled segmented instances in 328k images, containing photos of 91 object types that would be recognized easily by a 4-year-old person. For more information about COCO, check out our article What is the COCO Dataset? What you need to know.


MS Coco sample image segmentation
MS COCO dataset image segmentation example



The large-scale database focuses on the semantic understanding of urban street scenes. It contains a diverse set of stereo video sequences recorded in street scenes from 50 cities, 5’000 fully annotated images, and a set of 20’000 weakly annotated frames.

Also, the collection time spans several months, which covers the seasons of spring, summer, and fall. Cityscapes include semantic and dense pixel annotations of 30 classes, grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset is especially important for autonomous driving applications.



ADE20K offers a standard training and evaluation platform for scene parsing algorithms. The ADE20K dataset contains over 20’000 scene-centric images annotated with objects and object parts, and it provides 150 semantic categories.

Unlike other datasets, ADE20K includes an object segmentation mask and a parts segmentation mask. There are 20’210 color images in the training set, 2’000 images in the validation set, and 3’000 images in the testing set.



The YouTube-Objects Dataset is composed of videos collected from YouTube by querying for the names of 10 object classes. In particular, it includes objects from the 10 PASCAL VOC classes airplane, bird, boat, car, cat, cow, dog, horse, motorbike, and train.

The original dataset was developed for object detection with weak annotations and did not contain pixel-wise annotations. Therefore, a fully annotated YouTube Video Object Segmentation dataset (YouTube-VOS) was released containing 4’453 YouTube video clips and 94 object categories.



The KITTI dataset is one of the most popular datasets for mobile robotics and autonomous driving. It contains hours of videos of traffic scenarios captured by driving around the mid-sized city of Karlsruhe (on highways and in rural areas). On average, in every image, up to 15 cars and 30 pedestrians are visible.

The main tasks of this dataset are road detection, stereo reconstruction, optical flow, visual odometry, 3D object detection, and 3D tracking. The original dataset does not contain ground truth for semantic segmentation, but researchers have manually annotated parts of the dataset.


Other Datasets

There are multiple other datasets available for image segmentation purposes, such as the SUN database (16’873 fully annotated images), Shadow detection/Texture segmentation vision dataset, Berkeley segmentation dataset, the Semantic Boundaries Dataset (SBD), PASCAL Part, SYNTHIA, Adobe’s Portrait Segmentation or the LabelMe images database.



OMG-Seg is a framework proposed by Li, et. al. in 2024 that provides 10 segmentation tasks in one model. The tasks include image, video, open-vocabulary segmentation, and more.


What’s Next?

In past years, image and instance segmentation methods have made great progress. Hence, image segmentation accelerates the development of real-world applications across industries, including tumor detection, material detection on construction sites, and, most prominently, autonomous driving.

If you enjoyed reading this article, we recommend the following:

Follow us

Related Articles

Join 6,300+ Fellow
AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

Sign up to receive news and other stories from viso.ai. Your information will be used in accordance with viso.ai's privacy policy. You may opt out at any time.
Play Video

Join 6,300+ Fellow
AI Enthusiasts

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy.

Build any Computer Vision Application, 10x faster

All-in-one Computer Vision Platform for businesses to build, deploy and scale real-world applications.