What Is Self-Supervised Learning
Self-supervised learning has drawn massive attention for its excellent data efficiency and generalization ability. This approach allows neural networks to learn more with fewer labels, smaller samples, or fewer trials.
Recent self-supervised learning models include frameworks such as Pre-trained Language Models (PTM), Generative Adversarial Networks (GAN), Autoencoder and its extensions, Deep Infomax, and Contrastive Coding.
Background of Supervised Learning
The term “self-supervised learning” was first introduced in robotics, where the training data is automatically labeled by finding and exploiting the relations between different input signals from sensors. The term was then borrowed by the field of machine learning.
The self-supervised learning approach can be described as “the machine predicts any parts of its input for any observed part”. The learning includes obtaining “labels” from the data itself by using a “semiautomatic” process. Also, it is about predicting parts of data from other parts. Here, the “other parts” could be incomplete, transformed, distorted, or corrupted fragments. In other words, the machine learns to “recover” whole, or parts of, or merely some features of its original input.
Supervised Learning Is “Filling in the Blanks”
People often tend to confuse the terms Unsupervised Learning (UL) and Self-Supervised Learning (SSL). Self-supervised learning can be considered as a branch of unsupervised learning since there is no manual labeling involved. More precisely, unsupervised learning focuses on detecting specific data patterns (such as clustering, community discovery, or anomaly detection), while self-supervised learning aims at recovering missing parts, which is still in the paradigm of supervised settings.
The Bottlenecks of Supervised Learning for Computer Vision
Deep neural networks have shown excellent performance on various machine learning tasks, especially on supervised learning in computer vision. Modern computer vision systems achieve outstanding results by performing a wide range of challenging vision tasks, such as object detection, image recognition, or semantic image segmentation.
However, supervised learning is trained over a specific task with a large manually labeled dataset which is randomly divided into training, validation, and test sets. Therefore, the success of deep learning-based computer vision relies on the availability of a large amount of annotated data which is time-consuming and expensive to acquire.
Besides the expensive manual labeling, supervised learning also suffers from generalization error, spurious correlations, and adversarial attacks.
Advantages of Self-Supervised Learning
For some scenarios, building large labeled datasets to develop computer vision algorithms is not practically feasible:
- Most real-world computer vision applications involve visual categories that are not part of a standard benchmark dataset.
- Also, some applications underlay a dynamic nature where visual categories or their appearance changes over time.
Hence, self-supervised learning could be developed that are able to successfully learn to recognize new concepts by leveraging only a small amount of labeled examples.
The ultimate goal is enabling machines to understand new concepts quickly after seeing only a few examples that are labeled, similar to how fast humans are able to learn.
Self-Supervised Visual Representation Learning
Learning from unlabeled data that is much easier to acquire in real-world applications is part of a large research effort. Recently, the field of self-supervised visual representation learning has recently demonstrated the most promising results.
Self-supervised learning techniques define pretext tasks that can be formulated using only unlabeled data but do require higher-level semantic understanding in order to be solved. Therefore, models trained for solving these pretext tasks learn representations that can be used for solving other downstream tasks of interest, such as image recognition.
In the computer vision community, multiple self-supervised methods have been introduced.
- Learning representation methods were able to linearly separate between the 1’000 ImageNet categories.
- Diverse self-supervision techniques were used for predicting the spatial context, colorization, equivariance to transformations alongside unsupervised techniques such as clustering, generative modeling, and exemplar learning.
Recent research about self-supervised learning of image representations from videos:
- Methods were used to analyze the temporal context of frames in video data.
- Temporal coherence was exploited in a co-training setting by early work on learning convolutional neural networks (CNNs) for visual object detection and face detection.
- Self-supervised models are performing well on tasks such as surface normal estimation, detection, and navigation.
In summary, supervised learning works well but requires both many labeled samples and a significant amount of data. Self-supervised learning is about training a machine by showing examples instead of programming it. This field is considered to be key to the future of deep learning-based systems. If you enjoyed reading this article, we recommend:
- How does Deep Reinforcement Learning work?
- Complete list of real-world computer vision applications
- Guide to Image Recognition