Monocular depth estimation is a computer vision task where an AI model tries to predict the depth information of a scene from a single image. In this process, the model estimates the distance of objects in a scene from one camera viewpoint. Monocular depth estimation has many applications and has been widely used in autonomous driving, robotics, and more. Depth estimation is considered one of the hardest computer vision tasks, as it requires the model to understand complex relationships between objects and their depth information. This means many factors come into play when estimating the depth of a scene. Lighting conditions, occlusion, and texture can greatly affect the results.
We will explore monocular depth estimation to understand how it works, where it’s used, and how to implement it with Python tutorials. So, let’s get started.
About us: Viso Suite is end-to-end computer vision infrastructure for enterprises. Housed in a single platform, teams can manage a range of tasks from people counting to object detection and movement estimation. To see how Viso Suite can benefit your organization, book a demo with our team of experts.
Understanding Monocular Depth Estimation
Depth estimation is a crucial step towards understanding scene geometry from 2D images. The goal of monocular depth estimation is to predict the depth value of each pixel. This is called inferring depth information, using only one RGB input image. Depth estimation techniques analyze visual details such as perspective, shading, and texture to estimate the relative distances of objects in an Image. The output of a depth estimation model is typically a depth map.
To train AI models on depth-maps we will first have to generate depth-maps. Depth estimation is a task that helps machines see the world in 3D, just like we do. This gives us an accurate sense of distances and enhances our interactions with our surroundings. We use a few common technologies to generate depth maps with cameras. For example, Time-of-Flight and Light Detection and Ranging (LiDAR), are popular depth-sensing technologies engineers use in fields like robotics, industrial automation, and autonomous vehicles. Next, let’s explain these important computer vision (CV) technologies.
How Does Depth Estimation Work?
Within the world of depth sensing technologies there is no single solution to every application, in some cases, engineers may even use a combination of methods to achieve the desired results. A robot or an autonomous vehicle can use cameras and sensors with embedded software to sense depth information utilizing popular methods. These methods usually consist of a signal that can be anything from light or sound to particles. Then some algorithms are applied to calculate the Time-of-flight and extract information from that.
A good example is stereo depth estimation, unlike monocular depth estimation it works by using 2 cameras with sensors taking images in parallel. This is like human binocular vision because engineers set two cameras a few centimeters apart. The embedded software detects the matching features in the images. Since each image will have a different offset of the detected features, the software uses the offset to calculate the depth of the point through a method called triangulation.
Most stereo-depth cameras use active sensing and a patterned light projector, that casts a pattern on surfaces, that helps identify flat or textureless objects. These cameras typically use near-infrared (NIR) sensors, enabling them to detect both the projected infrared pattern and visible light. Other techniques like LiDAR use light in the form of a laser that turns on and off rapidly to measure distances from which software can calculate depth measurements. This is often used in creating 3D maps of places, it can be used to explore caves, historical sites, and any earth surface. On the other hand, monocular depth estimation relies on using one image to predict the depth map, using AI techniques for accurate predictions. Let’s look at the different AI techniques used in monocular depth estimation.
AI Techniques In Monocular Depth Estimation
While stereo depth estimation techniques are useful for some scenarios, advancements in artificial intelligence have opened the door for new use cases of depth estimation, such as monocular depth estimation. With the power of machine learning engineers can train and infer machine learning models to predict depth maps from a single image. This in turn led to advancements in fields like autonomous driving, and augmented reality. The main advantage is that specialized equipment is not needed to sense the depth of information. In this section, we will explore the AI techniques used for monocular depth estimation.
Supervised Learning for Monocular Depth Estimation
Artificial neural networks (ANNs) since their invention have been instrumental in solving problems like monocular depth estimation. There are multiple ways a neural network can be trained, and one of those is supervised learning. In supervised learning, the model is trained on data with labels, where a neural network can learn relationships between the images and their depth maps, and make predictions based on the learned relationships. Researchers widely use convolutional neural networks (CNNs). CNNs can learn an implicit relation between color pixels and depth. Combined with post-processing and deep-learning approaches CNNs are now the most widely used backbone for depth estimation models.
Since building and training those CNNs is a difficult task, researchers usually use a pre-trained model and apply the important concept of transfer learning. Transfer learning is applied to a model that has been trained on a general dataset to make it work for a more specific use case. Some popular U-net-based architectures that researchers use as backbones for fine-tuned monocular depth estimation models are the following.
- ResNet
- DenseNet
However, more modern architecture can be Vision Transformers (ViT), those transformative models were introduced as alternatives to CNNs in computer vision applications. ViTs use a self-attention block in the architecture allowing it to have a higher capacity and result in superior performance. Those models mostly rely on an encoder-decoder architecture that can be made into different variations for different use cases. Compared to CNN-based architectures, ViT-based ones have more than a 28% performance increase in depth estimation.
While these methods work great with supervised learning, they rely heavily on large labeled datasets which are expensive, time-consuming, and can have biases. Next, let’s explore other training methods.
Unsupervised and Self-Supervised Learning for Monocular Depth Estimation
Most monocular depth estimation approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of ground-truth depth data for training. However, other unsupervised and self-supervised approaches achieve great results in depth prediction with easier-to-obtain binocular stereo footage. Researchers can leverage stereo-image pairs during the model’s training to allow neural networks to learn implicit relations between the pairs.
The core idea is to train the network to reconstruct one image of the stereo pair from the other. By learning to do this, the network implicitly learns about the depth of the scene. Combined with other approaches like left-right consistency unsupervised approaches can result in state-of-the-art performance. Left-right consistency allows the network to predict disparity maps for both the left and right images, and the training loss encourages these disparity maps to be consistent with each other.
Self-supervised learning is another approach researchers have taken for depth estimation. One of the popular studies uses a video sequence to train the neural network. The neural network learns the difference between frame A and frame B using pose estimation. The network tries to reconstruct frame B from frame A, compare the reconstruction, and minimize the error. Furthermore, the researchers of this study used a few techniques to improve the performance.
Those techniques include auto-masking where objects that are stationary in every frame are masked to not confuse the model, and full-resolution multi-scale to improve quality and accuracy. That being said, depth-estimation approaches are constantly evolving, and researchers are finding new ways to make accurate depth maps from single images. So, next, let’s get into a step-by-step tutorial to build a depth-estimation model.
Step-by-Step Tutorial: Using a Depth Estimation Model
Now that we have explored the theoretical concepts of monocular depth estimation, let’s roll up our sleeves for a practical implementation with Python. In this tutorial, we will go through the process of building and using a depth estimation model. We will be utilizing the Keras framework with Tensorflow, and building upon the provided example by Keras. However, some prior knowledge of Python and machine learning concepts will be beneficial for this section. In this example, we’ll adapt and improve upon the code from the Keras tutorial on monocular depth estimation and we’ll structure it as follows.
- Setup and Data Preparation
- Building the Data Pipeline
- Building the Model and Defining the Loss
- Model Training and Inference
So, let’s start with the setup and data preparation for this tutorial.
Setup and Data Preparation
For this tutorial, we will use Kaggle as our environment and the Dense Indoor and Outdoor Depth (DIODE) Dataset to train our model. So, let’s start by preparing our environment and importing the needed libraries. I created a new Kaggle notebook and enabled GPU acceleration.
import os os.environ["KERAS_BACKEND"] = "tensorflow" import sys import tensorflow as tf import keras from keras import layers from keras import ops import pandas as pd import numpy as np import cv2 import matplotlib.pyplot as plt keras.utils.set_random_seed(123)
Those imports give us all the libraries we need for the monocular depth estimation model. We are using the OS, SYS, OpenCV (CV2) Tensorflow, Keras, Numpy, Pandas, and Matplot. Keras and TensorFlow are going to be the backend, OS and SYS will help us with data loading, CV2 will help us process the images, and Numpy and Pandas to facilitate between the loading and processing.
Next, let’s download the data, as mentioned previously we will use the DIODE dataset, however, we will only use the validation dataset because the full dataset is over 80GB which is too large for our purpose. The validation data is 2.6GBs which is easier to handle and better for our purpose so we will use that.
annotation_folder = "/kaggle/working/dataset/" if not os.path.exists(os.path.abspath(".") + annotation_folder): annotation_zip = keras.utils.get_file( "val.tar.gz", cache_subdir=os.path.abspath(annotation_folder), # Extract to /kaggle/working/dataset/ origin="http://diode-dataset.s3.amazonaws.com/val.tar.gz", extract=True, )
This code downloads the validation set of the DIODE dataset to the Kaggle/working folder, and it will extract it in a folder called dataset in there. So, now we have the dataset installed in our Kaggle workspace. Next, let’s prepare this data and process it to become suitable for use in training our model.
df_list = [] # To Store Both Indoor and Outdoor for scene_type in ["indoors", "outdoor"]: path = os.path.join("/kaggle/working/dataset/val", scene_type) filelist = [] for root, dirs, files in os.walk(path): for file in files: filelist.append(os.path.join(root, file)) filelist.sort() data = { "image": [x for x in filelist if x.endswith(".png")], "depth": [x for x in filelist if x.endswith("_depth.npy")], "mask": [x for x in filelist if x.endswith("_depth_mask.npy")], } df = pd.DataFrame(data) df = df.sample(frac=1, random_state=42) df_list.append(df) # Append the dataframe to the list # Concatenate the dataframes df = pd.concat(df_list, ignore_index=True) #Check if Paths are correct print(df.iloc[0]['image']) print(df.iloc[0]['depth']) print(df.iloc[0]['mask'])
Don’t be intimidated by the code, what this basically does is it goes through the files we downloaded, and appends the file names into a Pandas data frame. Since we will be using both indoor and outdoor images from the dataset we use 3 For loops, that first go through the indoors folder, we put the “.png” image files in a column, the depth values in a column, and the masks in another.
Building The Data Pipeline
For monocular depth estimation, we use the depth values and the mask to generate a depth map that we will use to train the model alongside the original images. We will build a pipeline function that essentially does the following.
- Read a Pandas data frame with paths for the RGB image, the depth, and the depth mask files.
- Load and resize the RGB images.
- Reads the depth and depth mask files, processes them to generate the depth map image, and resizes it.
- Return the RGB images and the depth map images for each batch.
Typically in machine learning, data pipelines are built as classes, this makes it easier to use the pipeline as many times as needed. In this tutorial, we will build it as a function that uses some popular data processing methods that will help us train our model efficiently.
def load_and_preprocess_data(df_row, img_size=(256, 256)): """ Loads and preprocesses image and depth map from a DataFrame row """ img_path = df_row['image'] depth_path = df_row['depth'] mask_path = df_row['mask'] img = cv2.imread(img_path) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) img = cv2.resize(img, img_size) img = tf.image.convert_image_dtype(img, tf.float32) # Use tf.image.convert_image_dtype depth_map = np.load(depth_path).squeeze() mask = np.load(mask_path) mask = mask > 0 max_depth = min(300, np.percentile(depth_map, 99)) depth_map = np.clip(depth_map, 0.1, max_depth) depth_map = np.log(depth_map, where=mask) print("Min/Max depth before preprocessing:", np.min(depth_map), np.max(depth_map)) depth_map = np.ma.masked_where(~mask, depth_map) depth_map = np.clip(depth_map, 0.1, np.log(max_depth))# Clip after masking depth_map = cv2.resize(depth_map, img_size) depth_map = np.expand_dims(depth_map, axis=-1) depth_map = tf.image.convert_image_dtype(depth_map, tf.float32) print("Min/Max depth after preprocessing:", np.min(depth_map), np.max(depth_map))# Use tf.image.convert_image_dtype return img, depth_map
Now let’s visualize some of our data.
import matplotlib.pyplot as plt def visualize_data(image, depth_map, mask): """ Visualizes the image and its corresponding depth map with mask applied. """ # Apply mask to depth map masked_depth = depth_map * mask fig, axes = plt.subplots(1, 3, figsize=(15, 5)) axes[0].imshow(image) axes[0].set_title('Image') # Use plt.cm.jet colormap axes[1].imshow(depth_map, cmap=plt.cm.jet) axes[1].set_title('Raw Depth Map') # Use plt.cm.jet colormap axes[2].imshow(masked_depth, cmap=plt.cm.jet) axes[2].set_title('Masked Depth Map') plt.savefig("visualization_example.jpg") plt.show() # Example usage for i in range(3): img, depth_map = load_and_preprocess_data(df.iloc[i]) # Load the mask mask_path = df.iloc[i]['mask'] mask = np.load(mask_path) mask = cv2.resize(mask, (img.shape[1], img.shape[0])) # Resize mask to match image mask = np.expand_dims(mask, axis=-1) # Add channel dimension visualize_data(img, depth_map, mask)
Building the Model and Defining the Loss
Now we have reached what can be the trickiest part of this tutorial, but it is interesting so keep up. For this tutorial, we will use an architecture for the model as follows.
- ResNet50 Encoder as a Backbone
- 5 Decoder Layers
- An Output layer
A possible improvement can be to add a bottleneck layer and optimize the decoder/encoder layers. This architecture is simple and allows us to achieve decent results. Let’s get to the code.
def create_depth_estimation_model(input_shape=(256, 256, 3)): """ Creates a depth estimation model with ResNet50 encoder and U-Net decoder. """ # Encoder inputs = Input(shape=input_shape) base_model = ResNet50(weights='imagenet', include_top=False, input_tensor=inputs) # Get feature maps from encoder skip_connections = [ base_model.get_layer("conv1_relu").output, # (None, 128, 128, 64) base_model.get_layer("conv2_block3_out").output, # (None, 64, 64, 256) base_model.get_layer("conv3_block4_out").output, # (None, 32, 32, 512) base_model.get_layer("conv4_block6_out").output, # (None, 16, 16, 1024) ] # Decoder up1 = UpSampling2D(size=(2, 2))(base_model.output) # (None, 32, 32, 2048) concat1 = concatenate([up1, skip_connections[3]], axis=-1) # (None, 32, 32, 3072) conv1 = Conv2D(1024, 3, activation='relu', padding='same')(concat1) conv1 = Conv2D(1024, 3, activation='relu', padding='same')(conv1) up2 = UpSampling2D(size=(2, 2))(conv1) # (None, 64, 64, 1024) concat2 = concatenate([up2, skip_connections[2]], axis=-1) # (None, 64, 64, 1536) conv2 = Conv2D(512, 3, activation='relu', padding='same')(concat2) conv2 = Conv2D(512, 3, activation='relu', padding='same')(conv2) up3 = UpSampling2D(size=(2, 2))(conv2) # (None, 128, 128, 512) concat3 = concatenate([up3, skip_connections[1]], axis=-1) # (None, 128, 128, 768) conv3 = Conv2D(256, 3, activation='relu', padding='same')(concat3) conv3 = Conv2D(256, 3, activation='relu', padding='same')(conv3) up4 = UpSampling2D(size=(2, 2))(conv3) # (None, 256, 256, 256) concat4 = concatenate([up4, skip_connections[0]], axis=-1) # (None, 256, 256, 320) conv4 = Conv2D(128, 3, activation='relu', padding='same')(concat4) conv4 = Conv2D(128, 3, activation='relu', padding='same')(conv4) up5 = UpSampling2D(size=(2, 2))(conv4) # (None, 512, 512, 128) conv5 = Conv2D(64, 3, activation='relu', padding='same')(up5) conv5 = Conv2D(64, 3, activation='relu', padding='same')(conv5) # Output layer output = Conv2D(1, 1, activation='linear')(conv5) # or 'sigmoid' model = Model(inputs=inputs, outputs=output) return model
Utilizing Keras and Tensorflow, we have built the architecture that we encompassed within a function. The image size used here is 256×256 so that can be increased if needed but it would increase the training time. Next, we should define a loss function that will optimize the model as it’s training, for the loss function we can go as complex or as simple as needed. In this tutorial, we will use a moderate approach. A simple mean squared error loss function combined with Huber loss.
from tensorflow.keras import backend as K def custom_loss(y_true, y_pred): mse_loss = K.mean(K.square(y_true - y_pred)) huber_loss = tf.keras.losses.huber(y_true, y_pred) # Combine the losses total_loss = mse_loss + 0.1 * huber_loss return total_loss
Each of those losses has a weight, which we defined to be 0.1 here. Lastly, we need to split the data and run it through our data function to feed it to the model next.
images = [] depth_maps = [] for index, row in df.iterrows(): img, depth_map = load_and_preprocess_data(row) images.append(img) depth_maps.append(depth_map) images = np.array(images) depth_maps = np.array(depth_maps) X_train, X_val, y_train, y_val = train_test_split( images, depth_maps, test_size=0.2, random_state=42 )
Model Training and Inferencing
To train the model we built, we will have to compile it and fit it to the data we have.
with tf.device('/GPU:0'): # Use the first available GPU model = create_depth_estimation_model() model.compile(optimizer='adam', loss=custom_loss, metrics=['mae']) history = model.fit( X_train, y_train, epochs=60, batch_size=32, validation_data=(X_val, y_val), shuffle=True )
So, here we compile the model and create it on the GPU, and then we fit it. I did not implement many hyperparameters in this case, I used the number of epochs to train the model, and I enabled the shuffle to try and prevent overfitting. The batch size is 32 which is a good value for our Kaggle environment. There could be more hyperparameters in there, like the learning rate. This training would take around 10-15 minutes. Next, we can define a small function to prepare an input image to test the trained model.
def load_and_preprocess_image(image_path, img_size=(256, 256)): """Loads and preprocesses a single image.""" img = cv2.imread(image_path) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) img = cv2.resize(img, img_size) img = tf.image.convert_image_dtype(img, tf.float32) return img new_image = load_and_preprocess_image("/kaggle/input/keras-depth-bee-image/bee.jpg")
This is a simple function that does a similar thing to what the “load_and_preprocess_data” function did. Then we can infer the model using the simple line below.
predicted_depth = model.predict(np.expand_dims(new_image, axis=0))
Now, we tested our image with the trained model. Let’s view the results.
import matplotlib.pyplot as plt plt.figure(figsize=(10, 5)) plt.subplot(1, 2, 1) plt.imshow(new_image) plt.title("Original Image") plt.subplot(1, 2, 2) plt.imshow(predicted_depth[0, :, :, 0], cmap=plt.cm.jet) # Remove batch dimension and channel plt.title("Predicted Depth Map") plt.show()
In summary, building a monocular depth estimation model from scratch can be an extensive task. However, it is a great way to learn a very important task in computer vision. The model in this tutorial is just a simple demonstration, the results are not going to be great because of the simplicity and the shortcuts we took. Furthermore, we can try a pre-trained model with a few lines of code and see the difference.
Inferring a Pre-Trained Model
In this section, we will use a simple inference on the Depth AnythingV2 model, which achieves state-of-the-art results on benchmark datasets like KITTI. Moreover, to use this model we only need the few lines of code below.
from transformers import pipeline from PIL import Image # load pipe pipe = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Small-hf") # load image url = '/kaggle/input/keras-depth-bee-image/bee.jpg' image = Image.open(url) # inference depth = pipe(image)["depth"]
If we save the “depth” variable we can see the result of the depth estimation which is also pretty fast considering that we are using the small variation of the model.
With this, we have concluded the tutorial, however, this is only a starting step to building monocular depth estimation models. Those models are a wide research area in CV and are seeing constant improvements. This is because they have a wide range of use cases, monocular depth estimation is important for autonomous vehicles, robotics, health, or even agriculture and history.
The Future Of Monocular Depth Estimation
As we have seen, monocular depth estimation is a challenging but important task in computer vision. Applications span from autonomous driving, robotics, and augmented reality, to 3D modeling. The field is still improving, with researchers exploring new implementations and theories and pushing the boundaries. Deep learning with transformers is one promising area. This includes exploring architectures like Vision Transformers (ViT) that have shown promising results in many computer vision tasks including monocular depth estimation.
Furthermore, researchers try integrating monocular depth estimation with other computer vision tasks. Object detection, semantic segmentation, and scene understanding combined with depth estimation, can create more comprehensive AI systems that can interact with the world more effectively.
The future of monocular depth estimation is bright, with ongoing research promising to deliver more accurate, efficient, and versatile solutions. As these advancements continue, we can expect to see even more innovative applications emerge, transforming industries and enhancing our interaction with the world around us.
FAQs
Q1. What is monocular depth estimation?
Monocular depth estimation is a computer vision technique for estimating depth information from a single image.
Q2. Why is monocular depth estimation important?
Monocular depth estimation is crucial for various applications where understanding 3D scene geometry from a single image is necessary. This includes:
- Autonomous driving.
- Robotics.
- Augmented reality (AR).
Q3. What are the challenges in monocular depth estimation?
Estimating depth from a single image is inherently ambiguous, as multiple 3D scenes can produce the same 2D projection. This makes monocular depth estimation challenging. Key challenges include occlusions, textureless regions, and scale ambiguity.