• Train

          Develop

          Deploy

          Operate

          Data Collection

          Building Blocks​

          Device Enrollment

          Monitoring Dashboards

          Video Annotation​

          Application Editor​

          Device Management

          Remote Maintenance

          Model Training

          Application Library

          Deployment Manager

          Unified Security Center

          AI Model Library

          Configuration Manager

          IoT Edge Gateway

          Privacy-preserving AI

          Ready to get started?

          Overview
          Whitepaper
          Expert Services
  • Why Viso Suite
  • Pricing
Search
Close this search box.

MediaPipe: Google’s Open Source Framework for ML solutions (2024 Guide)

About

Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.

Contents
Need Computer Vision?

Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.

MediaPipe is a cross-platform pipeline framework to build custom machine learning solutions for live and streaming media. The framework was open-sourced by Google and is currently in the alpha stage.

In this article, we will cover the following:

  • What is MediaPipe?
  • What is a computer vision pipeline?
  • Difference between AI models and Applications
  • MediaPipe Architecture and Concept
  • How to get started

 

About us: viso.ai provides the leading end-to-end Computer Vision Platform Viso Suite. Global organizations use it to develop, deploy and scale all computer vision applications in one place. The solution integrates MediaPipe along with infrastructure to cover the entire AI vision application lifecycle. Get a demo for your organization.

Viso Suite – End-to-End Computer Vision and No-Code for Computer Vision Teams

 

What is MediaPipe?

MediaPipe is an open-source framework for building pipelines to perform computer vision inference over arbitrary sensory data such as video or audio. Using MediaPipe, such a perception pipeline can be built as a graph of modular components.

MediaPipe is currently in alpha at v0.7, and there may still be breaking API changes. Stable APIs are expected by v1.0.

 

Template-based feature matching (KNIFT) with MediaPipe to match US dollar bills
Template-based feature matching (KNIFT) with MediaPipe to match US dollar bills – Source

 

What is a computer vision pipeline?

In computer vision pipelines, those components include model inference, media processing algorithms, data transformations, etc. Sensory data such as video streams enter the graph, and perceived descriptions such as object-localization or face-keypoint streams exit the graph.

 

MediaPipe HandDetection
MediaPipe HandDetection Demo using the Visualizer Tool

 

Who can use MediaPipe?

MediaPipe was built for machine learning (ML) teams and software developers who implement production-ready ML applications, or students and researchers who publish code and prototypes as part of their research work.

 

What is MediaPipe used for?

The MediaPipe framework is mainly used for rapid prototyping of perception pipelines with AI models for inferencing and other reusable components. It also facilitates the deployment of computer vision applications into demos and applications on different hardware platforms.

The configuration language and evaluation tools enable teams to incrementally improve computer vision pipelines.

 

MediaPipe Box Tracking paired with ML inference
MediaPipe Box Tracking paired with ML inference for Object Detection – Source

 

 

What ML solutions can you build in MediaPipe?

C++JSPythonAndroidiOSCoral
Face Detection✔✔✔✔✔✔
Face Mesh✔✔✔✔✔
Iris Detection✔✔✔
Hands Detection✔✔✔✔✔
Pose Detection✔✔✔✔✔
Holistic✔✔✔✔✔
Selfie Segmentation✔✔✔✔✔
Hair Segmentation✔✔
Object Detection✔✔✔✔
Box Tracking✔✔✔
Instant Motion Trackingâś”
Objectron (3D)✔✔✔✔
KNIFTâś”
AutoFlipâś”
MediaSequenceâś”
YouTube 8Mâś”

Most MediaPipe solutions use different TFLite models (TensorFlow Lite) for hand, iris, or face landmarks, palm detection, pose detection, Template-based Feature Matching (KNIFT), or 3D Object Detection (Objectron) tasks.

 

What are the advantages of MediaPipe?

  • Advantage #1 – End-to-end acceleration: Use common hardware to build-in fast ML inference and video processing, including GPU, CPU, or TPU.
  • Advantage #2 – Build once, deploy anywhere: The unified framework is suitable for Android, iOS, desktop, edge, cloud, web, and IoT platforms.
  • Advantage #3 – Ready-to-use solutions: Prebuilt ML solutions demonstrate the full power of the MediaPipe framework.
  • Advantage #4 – Open source and free: The framework is licensed under Apache 2.0, fully extensible, and customizable.

 

How to use MediaPipe

Start Developing with MediaPipe

MediaPipe provides example code and demos for MediaPipe in Python and MediaPipe in JavaScript. The MediaPipe solutions can be run with only a few lines of code.

To further customize the solutions and build your own, use MediaPipe in C++, Android, and iOS. To get started, follow this guide to install MediaPipe and start building example applications in C++ or on mobile. You find the MediaPipe source code in the MediaPipe Github repo.

Can I use MediaPipe on Windows?

Currently, MediaPipe supports Ubuntu Linux, Debian Linux, macOS, iOS, and Android. The MediaPipe framework is based on a C++ library (C++ 11), making it relatively easy to port to additional platforms.

Business Applications – MediaPipe with Viso Suite

The Viso Suite enterprise computer vision platform seamlessly integrates MediaPipe. The Viso platform adds no-code development, infrastructure scaling, edge device management, model training, and more to cover the entire AI vision application lifecycle beyond the features of MediaPipe.

 

AI models vs. Applications

Typically, image or video input data is fetched as separate streams and analyzed using neural networks such as TensorFlow, PyTorch, CNTK, or MXNet. Such models process data in a simple and deterministic method: one input generates one output, which allows very efficient processing execution.

MediaPipe, on the other hand, operates at much higher-level semantics and allows more complex and dynamic behavior. For example, one input can generate zero, one, or multiple outputs, which cannot be modeled with neural networks. Video processing and AI perception require streaming processing compared to batching methods.

OpenCV 4.0 introduced the Graph API to build sequences of OpenCV image processing operations in the form of a graph. In contrast, MediaPipe allows operations on arbitrary data types and has native support for streaming time-series data, making it much more suitable for analyzing audio and sensor data.

 

ML solutions in MediaPipe
ML solutions in MediaPipe – Source

 

The Concept of MediaPipe

MeidaPipe Framework consists of three main elements:

  1. A framework for inference from sensory data (audio or video)
  2. A set of tools for performance evaluation, and
  3. Re-usable components for inference and processing (calculators)

The main components of MediaPipe:

  • Packet: The basic data flow unit is called a “packet.” It consists of a numeric timestamp and a shared pointer to an immutable payload.
  • Graph: Processing takes place inside a graph which defines the flow paths of packets between nodes. A graph can have any number of input and outputs, and branch or merge data.
  • Nodes: Nodes are where the bulk of the graph’s work takes place. They are also called “calculators” (for historical reasons) and produce or consume packets. Each node’s interface defines a number of in- and output ports.
  • Streams: A stream is a connection between two nodes that carries a sequence of packets with increasing timestamps.

There are more advanced components, such as Side packets, Packet ports, Input policies, etc., about which you can read more here. To visualize a graph, copy and paste the graph into the MediaPipe Visualizer.

 

MediaPipe Architecture

MediaPipe allows a developer to prototype a pipeline incrementally. A vision pipeline is defined as a directed graph of components, where each component is a node (“Calculator”). In the graph, the calculators are connected by data “Streams.” Each stream represents a time-series of data “Packets.”

Together, the calculators and streams define a data-flow graph. The packets which flow across the graph are collated by their timestamps within the time-series. Each input stream maintains its own queue to allow the receiving node to consume the packets at its own pace.

The pipeline can be refined incrementally by inserting or replacing calculators anywhere in the graph. Developers can also define custom calculators. While the graph executes calculators in parallel, each calculator executes on at most one thread at a time. This constraint ensures that the custom calculators can be defined without specialized expertise in multithreaded programming.

 

GPU Support

MediaPipe supports GPU computing and rendering nodes and allows to combine multiple GPU nodes and mix them with CPU-based nodes. There are several GPU APIs on mobile platforms (OpenGL ES, Metal, Vulkan).

There is no single cross-API GPU abstraction. Individual nodes can be written using different APIs, allowing them to take advantage of platform-specific features when needed. This enables GPU and CPU nodes to provide advantages of encapsulation and composability while maintaining efficiency.

 

Tracer Module

The MediaPipe tracer module records timing events across the graph to record events with several data fields (time, packet timestamp, data Id, node Id, stream ID). The tracer also reports histograms of various resources, such as the elapsed CPU time across each calculator and stream.

The tracer module records timing information on demand and can be enabled using a configuration setting (in GraphConfig). The user can also completely omit the tracer module code using a compiler flag.

The recorded timing data enables reporting and visualization of individual packet flows and individual calculator executions. The recorded timing data helps diagnose several problems, such as unexpected real-time delays, memory accumulation due to packet buffering, and collating packets at different frame rates.

The aggregated timing data can be used to report average and extreme latencies for performance tuning. Also, the timing data can be explored to identify the nodes along the critical path, whose performance determines end-to-end latency.

 

MediaPipe Visualizer
MediaPipe Visualizer with Timeline view (top) and Graph view (bottom) – Source.
Visualizer Tool

The MediaPipe visualizer is a tool for understanding the topology and overall behavior of the pipelines. It provides a timeline view and a graph view. In the timeline view, the user can load a pre-recorded trace file and see the precise timings of data as it moves through threads and calculators (nodes).

In the graph view, the user can visualize the topology of a graph at any point in time, including the state of each calculator and the packets being processed or being held in its input queues.

 

The Bottom Line

Google MediaPipe provides a framework for cross-platform, customizable machine learning solutions for live and streaming media to deliver live ML anywhere. The powerful tool is suitable for creating computer vision pipelines and complex applications.

MediaPipe is fully integrated into the Viso Suite platform that provides an end-to-end solution for AI vision applications. With Viso Suite, businesses can take full advantage of MediaPipe while covering the entire lifecycle of computer vision in one unified solution, with no-code and automated infrastructure.

If you are looking for an enterprise-grade solution to build customizable ML solutions, request a live demo or speak to our team.

 

Explore other articles about cutting-edge computer vision

Follow us

Related Articles

Join 6,300+ Fellow
AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

Sign up to receive news and other stories from viso.ai. Your information will be used in accordance with viso.ai's privacy policy. You may opt out at any time.
Play Video

Join 6,300+ Fellow
AI Enthusiasts

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy.

Build any Computer Vision Application, 10x faster

All-in-one Computer Vision Platform for businesses to build, deploy and scale real-world applications.