Mask R-CNN: A Beginner’s Guide


Mask R-CNN is a convolutional neural network and state of the art in terms of image segmentation. The deep neural network detects objects in an image and generates a high-quality segmentation mask for each instance.

In this article, I will provide a simple and high-level overview of Mask R-CNN. We will discuss the basic concepts required to understand what Mask R-CNN is and how it works:

  1. Convolutional Neural Networks (CNN)
  2. Region-Based Convolutional Neural Networks (R-CNN)
  3. Faster R-CNN with Region Proposal Networks (RPN)
  4. Mask R-CNN and how it works
  5. Example projects and applications
Mask R-CNN Demo Sample
Mask R-CNN Demo Sample

Convolutional Neural Networks

CNNs (Convolutional Neural Networks) are the fundamental and basic building blocks for image segmentation. There are three main layers that comprise the CNN architecture:

  1. Convolutional layer : This layer helps to abstract the input image as a feature map via the use of filters and kernels.
  2. Pooling layer : This layer helps to downsample feature maps by summarizing the presence of features in patches of the feature map.
  3. Fully connected layer:  Fully connected layers connect every neuron in one layer to every neuron in another layer.

Combining these layers enables the designed network to learn how to identify the object of interest in the image. Simple CNNs are built for image classification and object detection with a single object in the image.

Convolutional Neural Networks Concept
Concept of CNN (Convolutional Neural Networks)

In a more complex situation with multiple objects in an image, a simple CNN architecture isn’t optimal. For those situations, Mask R-CNN is a state-of-the-art architecture. Region-Based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision and specifically object detection.

The first step to understanding how Mask R-CNN works we need to understand the concept of Image Segmentation. Image Segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as image objects). This is used to locate objects and boundaries (lines, curves, etc.). There are 2 main types of image segmentation that fall under Mask R-CNN:

  1. Semantic Segmentation
  2. Instance Segmentation
Semantic Segmentation

Semantic segmentation deals with the classification of each pixel into a fixed set of categories without differentiating object instances. In other words, semantic segmentation deals with identification/classification of similar objects as a single class from the pixel level.

As shown in the image above, all objects were classified as a single entity (person). Semantic segmentation is otherwise known as background segmentation because it separates the subjects of the image from the background.

Differences of segmantic segmentation and instance segmentation
Differences of Semantic Segmentation versus Instance Segmentation
Instance Segmentation

Instance Segmentation deals with the correct detection of all objects in an image while also precisely segmenting each instance. It is therefore the combination of object detection, object localization, and object classification. In other words, this type of segmentation goes further to give a clear distinction between each object classified as similar instances.

As shown in the example image above for Instance Segmentation, all objects are persons, but this segmentation process separates each person as a single entity. Semantic segmentation is otherwise known as foreground segmentation because it accentuates the subjects of the image instead of the background.



The following image depicts the concept of region-based CNN (R-CNN). This approach utilizes the use of bounding-box across the object regions which then evaluates convolutional networks independently on all the Region of Interest (ROI) to classify the regions into the proposed class. The architecture of R-CNN forms the basis of how Mask R-CNN was built. Moreover, this architecture was improved into what we know as Faster R-CNN.

Concept of R-CNN - Region-based Convolutional Networks
Concept of R-CNN – Region-based Convolutional Networks


Faster R-CNN

Fast R-CNN is an improved version of R-CNN architectures with two stages:

  1. Region Proposal Network (RPN). RPN is simply a Neural Network that proposes multiple objects that are available within a particular image.
  2. Fast R-CNN. This extracts features using RoIPool (Region of Interest Pooling) from each candidate box and performs classification and bounding-box regression. RoIPool is an operation for extracting a small feature map from each RoI in detection.

    Concept of how Region proposal Networks work (RPN)
    Concept of Region proposal Networks (RPN)


Faster R-CNN advances this stream by learning the attention mechanism with a Region Proposal Network and Fast R-CNN architecture. The reason why “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2’000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.


Fast R-CNN
Fast R-CNN

Furthermore, Faster R-CNN is an optimized form of R-CNN in the sense that it is built to enhance computation speed (run R-CNN much faster). Let’s go ahead to see how Faster R-CNN was used in building Mask-R-CNN.



Mask R-CNN

Mask R-CNN was built using Faster R-CNN. Faster R-CNN has 2 outputs for each candidate object, a class label, and a bounding-box offset.

Mask R-CNN - The Mask R-CNN Framework for Instance Segmentation
Mask R-CNN – The Mask R-CNN Framework for Instance Segmentation

Mask R-CNN is the addition of a third branch that outputs the object mask. The additional mask output is distinct from the class and box outputs, requiring extraction of a much finer spatial layout of an object.

Mask R-CNN is an extension of Faster R-CNN by adding a branch for predicting an object mask (Region of Interest) in parallel with the existing branch for bounding box recognition. One simple advantage of Mask R-CNN over Faster R-CNN is the fact that it is easy to generalize to other tasks like pose estimation

The key element of Mask R-CNN, is the pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN. Mask R-CNN adopts the same two-stage procedure with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions.

Mask R-CNN is simple to implement and train given the Faster R-CNN framework which facilitates a wide range of flexible architecture designs. Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation.


Projects with Mask R-CNN

R-CNN was used to improve OpenStreetMap by adding baseball, soccer, tennis, football, and basketball fields.

OSM mapping with Mask R-CNN
OpenStreetMap Mapping Example with Mask R-CNN

The project applied the Mask R-CNN algorithm to detect features with the goal to identify sports fields in satellite images. Sports fields are a good fit for the Mask R-CNN algorithm: They are visible for detection in the satellite images regardless of the tree cover (unlike buildings). Also, the method is efficient because sports fields are “blob” shape and not a line shape (unlike streets).

In a similar application, satellite imagery has been used to create maps for use by humanitarian organizations. In this project, Mask R-CNN was used for the mapping task.



Visit this article about image segmentation to explore more use cases and applications of Mark R-CNN and similar algorithms. Popular applications include autonomous vehicles and medical applications, for example, tumor detection or even the detection of features related to the coronavirus.


What’s Next?

If you enjoyed reading this article, I recommend:


  1. Region Proposal Network (RPN) — Backbone of Faster R-CNN – Source
  2. Demystifying Region Proposal Network (RPN) – Source
  3. Mask R-CNN – Source
  4. Introduction to Pooling Layers for Convolutional Neural Networks – Source
  5. Faster R-CNN: Down the rabbit hole of modern object detection – Source
  6. R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms – Source
  7. Mark R-CNN Demo – Source
Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook
Share on email
Related Articles

Want to use Computer Vision applications?

Get the all-in-one Suite to build and deliver Computer Vision Applications. 
Learn more

This website uses cookies. By continuing to browse this site, you agree to this use.