Robot Navigation with Vision Language Maps

Build, deploy, operate computer vision at scale

One platform for all use cases
Connect all your cameras
Flexible for your needs

Compared to robotic systems, humans are excellent navigators of the physical world. Physical processes aside, this largely comes down to innate cognitive abilities still lacking in most robotics:

The ability to localize landmarks at varying ontological levels, such as a “book” being “on a shelf” or “in the living room”
Being able to quickly determine whether there is a navigable path between two points based on the environment layout

Early robotic navigation systems relied on basic line-following systems. These eventually evolved into navigation based on visual perception, provided by cameras or LiDAR, to construct geometric maps. Later on, Simultaneous Localization and Mapping (SLAM) systems were integrated to provide the ability to plan routes through environments.

About us: Viso Suite is our end-to-end computer vision infrastructure for enterprises. By providing a single location to develop, deploy, manage, and secure the application development process, Viso Suite omits the need for point solutions. Enterprise teams can boost productivity and lower operation costs with full-scale features to accelerate the ML pipeline. Book a demo with our team of experts to learn more.

Multimodal Robot Navigation – Where Are We Now?

More recent attempts to endow robotics with the same capabilities have centered around building geometric maps for path planning and parsing goals from natural language commands. However, this approach struggles when it comes to generalizing for new or previously unseen instructions. Not to mention environments that change dynamically or are ambiguous in some way.

Furthermore, learning methods directly optimize navigation policies based on end-to-end language commands. While this method is not inherently bad, it does require vast amounts of data to train models.

Current Artificial Intelligence (AI) and deep learning models are adept at matching object images to natural language descriptions by leveraging training on internet-scale data. However, this capability does not translate well to mapping the environments containing the said objects.

New research aims to integrate multimodal inputs to enhance robotic navigation in complex environments. Instead of basing route planning on one-dimensional visual input, these systems combine visual, audio, and language cues. This allows for creating a richer context and improving situational awareness.

Introducing AVLMaps and VLMaps – A New Paradigm for Robot Navigation?

One potentially groundbreaking area of study in this field relates to so-called VLMaps (Visual Language Maps) and AVLMaps (Audio Visual Language Maps). The recent papers “Visual Language Maps for Robot Navigation” and “Audio Visual Language Maps for Robot Navigation” by Chenguang Huang and co. explore the prospect of using these models for robotic navigation in great detail.

VLMaps directly fuses visual-language features from pre-trained models with 3D reconstructions of the physical environment. This enables precise spatial localization of navigation goals anchored in natural language commands. It can also localize landmarks and spatial references for landmarks.

The main advantage is that this allows for zero-shot spatial goal navigation without additional data collection or finetuning.

An image showing a self-navigating robot in an office environment. A text box shows the natural language instructions that the robot is meant to parse, "First move to the plant, then 3 meters south, then go between the keyboard and the bowl." The navigable path planned by the robot according to the instructions is showing using lines and dots. — The image shows an example of a robotic navigation system using VLMaps that may parse a natural language instruction. Note that it localizes objects (“the plant,” “keyboard,” “the bowl”) as well as spatial references (“3 meters south,” “between the keyboard and the bowl”). – Source

This approach allows for more accurate execution of complex navigational tasks and the sharing of these maps with different robotic systems.

AVLMaps are based on the same approach but also incorporate audio cues to construct a 3D voxel grid using pre-trained multimodal models. This makes zero-shot multimodal goal navigation possible by indexing landmarks using textual, image, and audio inputs. For example, this would allow a robot to carry out a navigation goal such as “go to the table where the beeping sound is coming from.”

Image showing a laboratory simulating a scenario in which there are multiple audio inputs from the environment, such as a door knocking, glass breaking, and baby crying. The robot is given the instruction "go in between the sound of the glass breaking and this image." The image shows an object within the lab. — Laboratory example of an environment where audio signals may be helpful in disambiguating goals. By processing sound input from its surroundings, a robot will create new navigation goals based on instructions. – Source

Audio input can enrich the system’s world perception and help disambiguate goals in environments with multiple potential targets.

VLMaps: Integrating Visual-Language Features with Spatial Mapping

Related work in AI and computer vision has played a pivotal role in developing VLMaps. For instance, the maturation of SLAM techniques has greatly advanced the ability to translate semantic information into 3D maps. Traditional approaches either relied on densely annotated 3D volumetric maps with 2D semantic segmentation Convolutional Neural Networks (CNNs) or object-oriented methods to build 3D Maps.

While progress has been made in generalizing these models, it’s heavily constrained by operating on a predefined set of semantic classes. VLMaps overcomes this limitation by creating open-vocabulary semantic maps that allow natural language indexing.

Improvements in Vision and Language Navigation (VLN) have also led to the ability to learn end-to-end policies that follow route-based instructions on topological graphs of simulated environments. However, until now, their real-world applicability has been limited by a reliance on topological graphs and a lack of low-level planning capabilities. Another downside is the need for huge data sets for training.

For VLMaps, the researchers were influenced by pre-trained language and vision models, such as LM-Nav and CoW (CLIP on Wheels). The latter performs zero-shot language-based object navigation by leveraging CLIP-based saliency maps. While these models can navigate to objects, they struggle with spatial queries, such as “to the left of the chair” and “in between the TV and the sofa.”

VLMaps extend these capabilities by supporting open-vocabulary obstacle maps and complex spatial language indexing. This allows navigation systems to build queryable scene representations for LLM-based robot planning.

An image showing examples of zero-shot navigation using VLMaps. From left to right, the robot is shown executing instructions, such as "navigate to the plant," "navigate to the keyboard and the laptop twice," etc. — VLMaps enables a robot to perform complex zero-shot spatial goal navigation tasks given natural language commands without additional data collection or model finetuning. – Source

Key Components of VLMaps

Several key components in the development of VLMaps allow for building a spatial map representation that localizes landmarks and spatial references based on natural language.

Building a Visual-Language Map

VLMaps uses a video feed from robots combined with standard exploration algorithms to build a visual-language map. The process involves:

Visual Feature Extraction: Using models like CLIP to extract visual-language features from image observations.
3D Reconstruction: Combining these features with 3D spatial data to create a comprehensive map.
Indexing: Enabling the map to support natural language queries, allowing for indexing and localization of landmarks.

Mathematically, suppose VV represents the visual features and LL represents the language features. In that case, their fusion can be represented as M=f(V, L)M = f(V, L), where MM is the resulting visual-language map.

Diagram of the VLMap architecture. The left diagram shows the creation process of the VLMap while the right shows the mechanism for indexing landmarks. — Diagram of the VLMap architecture from the original research paper. – Source

Localizing Open-Vocabulary Landmarks

To localize landmarks in VLMaps using natural language, an input language list is defined with representations for each category in text form. Examples include [“chair”, “sofa”, “table”] or [“furniture”, “floor”]. This list is converted into vector embeddings using the pre-trained CLIP text encoder.

The map embeddings are then flattened into matrix form. The pixel-to-category similarity matrix is computed, with each element indicating the similarity value. Applying the argmax operator and reshaping the result gives the final segmentation map, which identifies the most related language-based category for each pixel.

Generating Open-Vocabulary Obstacle Maps

Using a Large Language Model (LLM), VLMap interprets commands and breaks them into subgoals, allowing for specific directives like “in between the sofa and the TV” or “three meters east of the chair.”

The LLM generates executable Python code for robots, translating high-level instructions into parameterized navigation tasks. For example, commands such as “move to the left side of the counter” or “move between the sink and the oven” are converted into precise navigation actions using predefined functions.

Image showing lines of code generated by the LLM from natural language instructions. The original natural language instructions are shown as comments. An example of the navigational goals generated in high-level code is "robot.move_in_between(‘sink’, ‘oven’)" — Example of the high-level code generated by the LLM from natural language commands (shown in comments). – Source

AVLMaps: Enhancing Navigation with Audio, Visual, and Language Cues

AVLMaps largely builds on the same approach used in developing VLMaps, but extended with multimodal capabilities to process auditory input as well. In AVLMaps, objects can be directly localized from natural language instructions using both visual and audio cues.

For testing, the robot was also provided with an RGB-D video stream and odometry information, but this time with an audio track included.

AVLMaps' architecture showing how visual, visual-language, and audio localization features are derived from raw data. In this case, an RGB-D video, audio stream, and odometry readings. — AVLMaps’ architecture shows how visual, visual-language and audio localization features are derived from raw data. In this case, an RGB-D video, audio stream, and odometry readings. – Source

Module Types

In AVLMaps, the system uses four modules to build a multimodal features database. They are:

Visual Localization Module: Localizes a query image in the map using a hierarchical scheme, computing both local and global descriptors in the RGB stream.
Object Localization Module: Uses open-vocabulary segmentation (OpenSeg) to generate pixel-level features from the RGB image, associating them with back-projected depth pixels in 3D reconstruction. It computes cosine similarity scores for all point and language features, selecting top-scoring points in the map for indexing.
Area Localization Module: The paper proposes a sparse topological CLIP features map to identify coarse visual concepts, like “kitchen area.” Also, using cosine similarity scores, the model calculates confidence scores for predicting locations.
Audio Localization Module: Partitions an audio clip from the stream into segments using silence detection. Then, it computes audio-lingual features for each using AudioCLIP to come up with matching scores for predicting locations based on odometry information.

The key differentiator of AVLMaps is its ability to disambiguate goals by cross-referencing visual and audio features. In the paper, this is achieved by creating heatmaps with probabilities for each voxel position based on the distance to the target. The model multiplies the results from heatmaps for different modalities to predict the target with the highest probabilities.

Image showing the cross-modal reasoning process for AVLMaps. To the left, there are four separate heatmaps of an environment for sound GT, object GT, sound prediction, and object prediction. AVLMaps combines the sound and object prediction maps to produce a multimodal heatmap with combined probability scores for navigation goals. — Example of multimodal heatmaps created by AVLMaps to localize navigation goals with cross-modal probability scores. – Source

VLMaps and AVLMaps vs. Other Methods for Robot Navigation

Experimental results show the promise of utilizing techniques like VLMaps for robotic navigation. Looking at the object, various models were generated for the object type “chair,” for example, it’s clear that VLMaps is more discerning in its predictions.

Image comparing the results of object mapping the object type "chair" in a simulated environment. The top-down view is compared with the mappings of the "ground truth," CLIP, CoW,and VLMaps. — Resulting in object maps for the object type “chair” created by different models for the same environment. GT represents the “ground truth” mapping in this scenario. – Source

In multi-object navigation, VLMaps significantly outperformed conventional models. This is largely because VLMaps don’t suffer from generating as many false positives as the other methods.

Table comparing the results, in percentages, of VLMaps against other goal navigation models. VLMaps achieved the highest success rates across all categories, achieving 59%, 34%, 22%, and 15% success rates for navigating to 1, 2, 3, and 4 subgoals in a row, respectively. — Results, in %, comparing VLMaps to other models in a multi-object navigational task with multiple sub-goals. – Source

VLMaps also achieves much higher zero-shot spatial goal navigation success rates than the other open-vocabulary zero-shot navigation baseline alternatives.

Table comparing the results, in percentages, of VLMaps against other goal navigation models for zero-shot navigation tasks. VLMaps achieved the highest success rates across all categories, achieving 62%, 33%, 14%, and 10% success rates for navigating to 1, 2, 3, and 4 subgoals in a row, respectively. — Results, in %, of VLMaps’ experimental test runs compared with other goal navigation models for zero-shot navigation tasks. – Source

Another area where VLMaps shows promising results is in cross-embodiment navigation to optimize route planning. In this case, VLMaps generated different obstacle maps for robot embodiments, a ground-based LoCoBot, and a flying drone. When provided with a drone map, the drone significantly improved its performance by creating navigation maps to fly over obstacles. This shows VLMap’s efficiency at both 2D and 3D spatial navigation.

Image grid showing the navigation capablities of VLMaps integrated with a ground-based LoCoBot and flying drone. Both are given the instruction "Navigate to the laptop." While the LoCoBot created a navigation map with a route to go around objects, the drone could take a direct path flying over them. — Given the same natural language instruction, a ground-based and air-based drone will come up with different navigation routes. – Source

Similarly, during testing, AVLMaps outperformed VLMaps with both standard AudioCLIP and wav2clip in solving ambiguous goal navigation tasks. For the experiment, robots were made to navigate to one sound goal and one object goal in a sequence.

Table showing the results of AVLMaps' multimodal ambiguous goal navigation. AVLMaps achieved 46.2% success rate for sound goals and 55.5% success rate for object goals. — Results, in %, of AVLMaps’ multimodal ambiguous goal navigation experiment. – Source

What’s Next for Robotic Navigation?

While models like VLMaps and AVLMaps show potential, there is still a long way to go. To mimic the navigational capabilities of humans and be useful in more real-life situations, we need systems with even higher success rates in carrying out complex, multi-goal navigational tasks.

Furthermore, these experiments used basic, drone-like robotics. We have yet to see how these advanced navigational models can be combined with more human-like systems.

Another active area of research is Event-based SLAM. Instead of relying purely on sensory information, these systems can use events to disambiguate goals or open up new navigational opportunities. Instead of using single frames, these systems capture changes in lighting and other characteristics to identify environmental events.

As these methods evolve, we can expect increased adoption in fields like autonomous vehicles, nanorobotics, agriculture, and even robotic surgery.

To learn more about the world of AI and computer vision, check out the viso.ai blog:

GoogLeNet Explained: The Inception Model that Won ImageNet
Object Tracking: Technical Guide and Use Cases
ImageNet Dataset: Evolution & Applications
Facial Recognition: An Easy-to-Understand Overview

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
ZCAMPAIGN_CSRF_TOKEN	session	This cookie is used to distinguish between humans and bots.
zfccn	session	Zoho sets this cookie for website security when a request is sent to campaigns.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_177371481_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
zabUserId	1 year	This cookie is set by Zoho and identifies whether users are returning or visiting the website for the first time
zabVisitId	one year	Used for identifying returning visits of users to the webpage.
zft-sdc	24hours	It records data about the user's navigation and behavior on the website. This is used to compile statistical reports and heat maps to improve the website experience.
zps-tgr-dts	1 year	These cookies are used to measure and analyze the traffic of this website and expire in 1 year.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
2d719b1dd3	session	This cookie has not yet been given a description. Our team is working to provide more information.
4662279173	session	This cookie is used by Zoho Page Sense to improve the user experience.
ad2d102645	session	This cookie has not yet been given a description. Our team is working to provide more information.
zc_consent	1 year	No description available.
zc_show	1 year	No description available.
zsc2feeae1d12f14395b6d5128904ae3746	1 minute	This cookie has not yet been given a description. Our team is working to provide more information.