• Train




          Data Collection

          Building Blocks​

          Device Enrollment

          Monitoring Dashboards

          Video Annotation​

          Application Editor​

          Device Management

          Remote Maintenance

          Model Training

          Application Library

          Deployment Manager

          Unified Security Center

          AI Model Library

          Configuration Manager

          IoT Edge Gateway

          Privacy-preserving AI

          Ready to get started?

          Expert Services
  • Why Viso Suite
  • Pricing

Image Segmentation with Deep Learning (Guide)


Viso Suite is the no-code computer vision platform to build, deploy and scale any application 10x faster.

Follow the blog

Need Computer Vision?

Viso Suite is only all-in-one business platform to build and deliver computer vision without coding. Learn more.

Image segmentation is one of the key applications in the Computer Vision domain. This article aims to provide an easy-to-understand overview of image segmentation and instance segmentation. In particular, you will learn about:

  1. What is Image Segmentation?
  2. The meaning of Instance Segmentation
  3. What are popular applications?
  4. Semantic vs. Instance Segmentation
  5. Most popular image segmentation datasets


About us: Viso.ai provides the leading end-to-end Computer Vision Platform Viso Suite. Global organizations use it to develop, deploy and scale all computer vision applications in one place, with automated infrastructure. Get a personal demo.

Viso Suite – End-to-End Computer Vision and No-Code for Computer Vision Teams


What is Image Segmentation?

One of the most important operations in Computer Vision is Segmentation. Image segmentation is the task of clustering parts of an image together that belong to the same object class. This process is also called pixel-level classification. In other words, it involves partitioning images (or video frames) into multiple segments or objects.


Semantic image segmentation of aerial drone images. The scene is parted into different classes such as “building”, “road”, “tree”.

In the last 40 years, various segmentation methods have been proposed, ranging from MATLAB image segmentation and traditional computer vision methods to the state of the art deep learning methods. Especially with the emergence of Deep Neural Networks (DNN), image segmentation has made tremendous progress.


Image Segmentation Sample
Annotated image for semantic image segmentation – Source: Sample from the Mapillary Vistas Dataset


Applications of Image Segmentation

Image segmentation plays a central role in a broad range of real-world computer vision applications, including road sign detection, biology, the evaluation of construction materials, or video surveillance. Also, autonomous vehicles and Advanced Driver Assistance Systems (ADAS) need to detect navigable surfaces or apply pedestrian detection.


KITTI image segmentation dataset
KITTI dataset sample for image segmentation – Source: KITTI

Furthermore, image segmentation is widely applied in medical applications, such as tumor boundary extraction or measurement of tissue volumes. Here, an opportunity is to design standardized image databases that can be used to evaluate fast-spreading new diseases and pandemics (for example, for AI vision applications of coronavirus control).

Deep Learning-based Image Segmentation has been successfully applied to segment satellite images in the field of remote sensing, including techniques for urban planning or precision agriculture. Also, images collected by drones (UAVs) have been segmented using Deep Learning based techniques, offering the opportunity to address important environmental problems related to climate change.


YOLOv7-mask for instance segmentation
YOLOv7-mask algorithm for instance segmentation. YOLOv7 is one of the best-performing real-time algorithms.


Semantic vs. Instance Segmentation

Image segmentation can be formulated as a classification problem of pixels with semantic labels (semantic segmentation) or partitioning of individual objects (instance segmentation). Semantic segmentation performs pixel-level labeling with a set of object categories (for example, people, trees, sky, cars) for all image pixels.

It is generally a more difficult undertaking than image classification, which predicts a single label for the entire image or frame. Instance segmentation extends the scope of semantic segmentation further by detecting and delineating all the objects of interest in an image.


Image Segmentation with different instances (individual buildings, houses)


Image Segmentation and Deep Learning

Multiple image segmentation algorithms have been developed. Earlier methods include thresholding, histogram-based bundling, region growing, k-means clustering, or watersheds. However, more advanced algorithms are based on active contours, graph cuts, conditional and Markov random fields, and sparsity-based methods.

Over the last few years, Deep Learning models have introduced a new segment of image segmentation models with remarkable performance improvements. Deep Learning based image segmentation models often achieve the best accuracy rates on popular benchmarks, resulting in a paradigm shift in the field.


ADE20K image segmentation dataset
ADE20K dataset for image segmentation – Source: ADE20K


Most Popular Image Segmentation Datasets

Due to Deep Learning models’ success in a wide range of vision applications, there has been a substantial amount of research aimed at developing image segmentation approaches using Deep Learning. At present, there are many general datasets related to image segmentation. The most popular image segmentation datasets are:



The PASCAL Visual Object Classes (VOC) Challenge provides publicly available image datasets and annotations. The PASCAL VOC is one of the most popular datasets in computer vision, with annotated images available for 5 tasks—classification, segmentation, detection, action recognition, and person layout. A high number of popular segmentation algorithms have been evaluated on this dataset.

For segmentation tasks, the PASCAL VOS supports 21 classes of object labels: vehicles, household, animals, airplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, TV/monitor, bird, cat, cow, dog, horse, sheep, and person.

Pixels are labeled as background if they do not belong to any of these classes. The training/validation data of the PASCAL VOC has 11’530 images containing 27’450 ROI annotated objects and 6’929 segmentations.



The Microsoft Common Objects in Context (MS COCO) is a large-scale object detection, segmentation, and captioning dataset. COCO includes images of complex everyday scenes containing common objects in their natural contexts.

Therefore, COCO is based on a total of 2.5 million labeled segmented instances in 328k images, containing photos of 91 object types that would be recognized easily by a 4-year-old person. For more information about COCO, check out our article What is the COCO Dataset? What you need to know.


MS Coco sample image segmentation
MS COCO dataset image segmentation example



The large-scale database focuses on the semantic understanding of urban street scenes. It contains a diverse set of stereo video sequences recorded in street scenes from 50 cities, 5’000 fully annotated images, and a set of 20’000 weakly annotated frames.

Also, the collection time spans several months, which covers the seasons of spring, summer, and fall. Cityscapes include semantic and dense pixel annotations of 30 classes, grouped into 8 categories (flat surfaces, humans, vehicles, constructions, objects, nature, sky, and void). The dataset is especially important for autonomous driving applications.



ADE20K offers a standard training and evaluation platform for scene parsing algorithms. The ADE20K dataset contains over 20’000 scenecentric images annotated with objects and object parts, and it provides 150 semantic categories.

Unlike other datasets, ADE20K includes an object segmentation mask and a parts segmentation mask. There are 20’210 images in the training set, 2’000 images in the validation set, and 3’000 images in the testing set.



The YouTube-Objects Dataset is composed of videos collected from YouTube by querying for the names of 10 object classes. In particular, it includes objects from the 10 PASCAL VOC classes airplane, bird, boat, car, cat, cow, dog, horse, motorbike, and train.

The original dataset was developed for object detection with weak annotations and did not contain pixel-wise annotations. Therefore, a fully annotated YouTube Video Object Segmentation dataset (YouTube-VOS) was released containing 4’453 YouTube video clips and 94 object categories.



The KITTI dataset is one of the most popular datasets for mobile robotics and autonomous driving. It contains hours of videos of traffic scenarios captured by driving around the mid-sized city of Karlsruhe (on highways and in rural areas). Averagely, in every image, up to 15 cars and 30 pedestrians are visible.

The main tasks of this dataset are road detection, stereo reconstruction, optical flow, visual odometry, 3D object detection, and 3D tracking. The original dataset does not contain ground truth for semantic segmentation, but researchers have manually annotated parts of the dataset.


Other Datasets

There are multiple other datasets available for image segmentation purposes, such as the SUN database (16’873 fully annotated images), Shadow detection/Texture segmentation vision dataset, Berkeley segmentation dataset, the Semantic Boundaries Dataset (SBD), PASCAL Part, SYNTHIA, Adobe’s Portrait Segmentation or the LabelMe images database.


What’s Next?

In past years, image and instance segmentation methods have made great progress. Hence, image segmentation accelerates the development of real-world applications across industries, including tumor detection, material detection on construction sites, and most prominently, autonomous driving.

If you enjoyed reading this article, we recommend:

Related Articles
Play Video

Join 6,300+ Fellow
AI Enthusiasts

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy.

Build any Computer Vision Application, 10x faster

The No Code Computer Vision Platform to build, deploy and scale real-world applications. Learn more

HP Enterprise Logo

Schedule a live demo

Not interested?

We’re always looking to improve, so please let us know why you are not interested in using Computer Vision with Viso Suite.