• Train




          Data Collection

          Building Blocks​

          Device Enrollment

          Monitoring Dashboards

          Video Annotation​

          Application Editor​

          Device Management

          Remote Maintenance

          Model Training

          Application Library

          Deployment Manager

          Unified Security Center

          AI Model Library

          Configuration Manager

          IoT Edge Gateway

          Privacy-preserving AI

          Ready to get started?

          Expert Services
  • Why Viso Suite
  • Pricing
Close this search box.

Everything about Mask R-CNN: A Beginner’s Guide


Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.

Need Computer Vision?

Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.

Mask R-CNN is a Convolutional Neural Network (CNN) and state-of-the-art in terms of image segmentation. This variant of a Deep Neural Network detects objects in an image and generates a high-quality segmentation mask for each instance.

In this article, I will provide a simple and high-level overview of Mask R-CNN. Then, we will discuss the basic concepts required to understand what Mask R-CNN is and how it works:

  1. Convolutional Neural Networks (CNN)
  2. Region-Based Convolutional Neural Networks (R-CNN)
  3. Faster R-CNN with Region Proposal Networks (RPN)
  4. Mask R-CNN and how it works
  5. Example projects and applications


Mask R-CNN Demo Sample
Mask R-CNN Demo Sample

To understand the differences between Mask RCNN, Faster RCNN vs. RCNN, we first have to understand what a CNN is and how it works.

What is a Convolutional Neural Network (CNN)?

A Convolutional Neural Network (CNN) is a type of artificial neural network used in image recognition and processing that is optimized to process pixel data. Therefore, Convolutional Neural Networks are the fundamental and basic building blocks for the computer vision task of image segmentation (CNN segmentation).

The Convolutional Neural Network Architecture consists of three main layers:

  1. Convolutional layer : This layer helps to abstract the input image as a feature map via the use of filters and kernels.
  2. ROI Pooling layer : This layer helps to downsample feature maps by summarizing the presence of features in patches of the feature map.
  3. Fully connected layer:  Fully connected layers connect every neuron in one layer to every neuron in another layer.

Combining the layers of a CNN enables the designed neural network to learn how to identify and recognize the object of interest in an image. Simple Convolutional Neural Networks are built for image classification and object detection with a single object in the image.

Convolutional Neural Networks Concept
Concept of the CNN architecture: How a convolutional neural network works.

In a more complex situation with multiple objects in an image, a simple CNN architecture isn’t optimal. For those situations, Mask R-CNN is a state-of-the-art architecture, that is based on R-CNN (also referred to as RCNN).

What is R-CNN?

R-CNN or RCNN, stands for Region-Based Convolutional Neural Network, it is a type of machine learning model that is used for computer vision tasks, specifically for object detection.

To understand what RCNN is, we will look next into the RCNN architecture.

How does R-CNN work?

The following image depicts the concept of region-based CNN (R-CNN). This approach utilizes bounding boxes across the object regions, which then evaluates convolutional networks independently on all the Regions of Interest (ROI) to classify multiple image regions into the proposed class.

The RCNN architecture was designed to solve image detection tasks. Also, R-CNN architecture forms the basis of Mask R-CNN and it was improved into what we know as Faster R-CNN.

Concept of R-CNN - Region-based Convolutional Networks
Concept of R-CNN – Region-based Convolutional Networks

What is Faster R-CNN?

Fast R-CNN is an improved version of R-CNN architectures with two stages:

  1. Region Proposal Network (RPN). RPN is simply a Neural Network that proposes multiple objects that are available within a particular image.
  2. Fast R-CNN. This extracts features using RoIPool (Region of Interest Pooling) from each candidate box and performs classification and bounding-box regression. RoIPool is an operation for extracting a small feature map from each RoI in detection, considering various aspect ratios.

    Concept of how Region proposal Networks work (RPN)
    Concept of Region proposal Networks (RPN)


The Faster R-CNN model advances this stream by learning the attention mechanism with a Region Proposal Network and Fast R-CNN architecture. The reason why “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2’000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image, and a feature map is generated from it.

Furthermore, Faster R-CNN is an optimized form of R-CNN because it is built to enhance computation speed (run R-CNN much faster).


Fast R-CNN
Fast R-CNN

The main difference between Fast and Faster RCNN is that that Fast R-CNN uses selective search for generating Regions of Interest, while Faster R-CNN uses a “Region Proposal Network” (RPN).

Let’s move on to see how Faster R-CNN was used in building Mask-R-CNN.

What is Mask R-CNN?

Mask R-CNN, or Mask-RCNN, is a Convolutional Neural Network (CNN) and state-of-the-art in terms of image segmentation and instance segmentation. Mask R-CNN was developed on top of Faster R-CNN, a Region-Based Convolutional Neural Network.

The first step to understanding how Mask R-CNN work requires an understanding of the concept of Image Segmentation.

The computer vision task Image Segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as image objects). This segmentation is used to locate objects and boundaries (lines, curves, etc.).

There are 2 main types of image segmentation that fall under Mask R-CNN:

  1. Semantic Segmentation
  2. Instance Segmentation
Semantic Segmentation

Semantic segmentation classifies each pixel into a fixed set of categories without differentiating object instances. In other words, semantic segmentation deals with the identification/classification of similar objects as a single class from the pixel level.

As shown in the image above, all objects were classified as a single entity (person). Semantic segmentation is otherwise known as background segmentation because it separates the subjects of the image from the background.

Differences of segmantic segmentation and instance segmentation
Differences of Semantic Segmentation versus Instance Segmentation
Instance Segmentation

Instance Segmentation, or Instance Recognition, deals with the correct detection of all objects in an image while also precisely segmenting each instance. It is, therefore, the combination of object detection, object localization, and object classification. In other words, this type of segmentation goes further to give a clear distinction between each object classified as similar instances.

As shown in the example image above, for Instance Segmentation, all objects are persons, but this segmentation process separates each person as a single entity. Semantic segmentation is otherwise known as foreground segmentation because it accentuates the subjects of the image instead of the background.

How does Mask R-CNN work?

Mask R-CNN was built using Faster R-CNN. While Faster R-CNN has 2 outputs for each candidate object, a class label and a bounding-box offset, Mask R-CNN is the addition of a third branch that outputs the object mask. The additional mask output is distinct from the class and box outputs, requiring the extraction of a much finer spatial layout of an object.

Mask R-CNN is an extension of Faster R-CNN and works by adding a branch for predicting an object mask (Region of Interest) in parallel with the existing branch for bounding box recognition.

Advantages of Mask R-CNN
  • Simplicity: Mask R-CNN is simple to train.
  • Performance: Mask R-CNN outperforms all existing, single-model entries on every task.
  • Efficiency: The method is very efficient and adds only a small overhead to Faster R-CNN.
  • Flexibility: Mask R-CNN is easy to generalize to other tasks. For example, it is possible to use Mask R-CNN for human pose estimation in the same framework.


Mask R-CNN - The Mask R-CNN Framework for Instance Segmentation
Mask R-CNN – The Mask R-CNN Framework for Instance Segmentation


The key element of Mask R-CNN is the pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN. Mask R-CNN adopts the same two-stage procedure with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions.

Furthermore, Mask R-CNN is simple to implement and train given the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs. Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation.

Projects with Mask R-CNN

R-CNN was used to improve OpenStreetMap by adding baseball, soccer, tennis, football, and basketball fields.

OSM mapping with Mask R-CNN
OpenStreetMap Mapping Example with Mask RCNN

The project applied the Mask R-CNN algorithm to detect features with the goal of identifying sports fields in satellite images. Sports fields are a good fit for the Mask R-CNN algorithm: They are visible for detection in the satellite images regardless of the tree cover (unlike buildings). Also, the method is efficient because sports fields are “blob” shape and not a line shape (unlike streets).

In a similar application, satellite imagery has been used to create maps for use by humanitarian organizations. In this project, Mask R-CNN was used for the mapping task.


Mask R-CNN applied to satellite imagery for neighborhood mapping

Visit this article about image segmentation to explore more use cases and applications of Mark R-CNN and similar algorithms. Popular applications include autonomous vehicles and medical applications, such as tumor detection or even detecting features related to the coronavirus.

What’s Next With Mask R-CNN?

If you enjoyed reading this article, I recommend:

References and Papers

  1. Region Proposal Network (RPN) — Backbone of Faster R-CNN – Source
  2. Demystifying Region Proposal Network (RPN) – Source
  3. Mask R-CNN – Source
  4. Introduction to Pooling Layers for Convolutional Neural Networks – Source
  5. Faster R-CNN: Down the rabbit hole of modern object detection – Source
  6. R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms – Source
  7. Mark R-CNN Demo – Source

Follow us

Related Articles

Join 6,300+ Fellow
AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

Sign up to receive news and other stories from viso.ai. Your information will be used in accordance with viso.ai's privacy policy. You may opt out at any time.
Play Video

Join 6,300+ Fellow
AI Enthusiasts

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy.

Build any Computer Vision Application, 10x faster

All-in-one Computer Vision Platform for businesses to build, deploy and scale real-world applications.