Today’s boom in computer vision (CV) started at the beginning of the 21st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection, and segmentation tasks.
In this article, we dive into some of the most significant research papers that triggered the rapid development of computer vision. We split them into two categories – classical CV approaches, and papers based on deep-learning. We chose the following papers based on their influence, quality, and applicability.
Classic Computer Vision Papers
- Gradient-based Learning Applied to Document Recognition (1998)
- Distinctive Image Features from Scale-Invariant Keypoints (2004)
- Histograms of Oriented Gradients for Human Detection (2005)
- SURF: Speeded Up Robust Features (2006)
Papers Based on Deep-Learning Models
- ImageNet Classification with Deep Convolutional Neural Networks (2012)
- Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)
- GoogLeNet – Going Deeper with Convolutions (2014)
- ResNet – Deep Residual Learning for Image Recognition (2015)
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015)
- YOLO: You Only Look Once: Unified, Real-Time Object Detection (2016)
- Mask R-CNN (2017)
- EfficientNet – Rethinking Model Scaling for Convolutional Neural Networks (2019)
About us: Viso Suite is the end-to-end computer vision solution for enterprises. With a simple interface and features that give machine learning teams control over the entire ML pipeline, Viso Suite makes it possible to achieve a 3-year ROI of 695%. It is also compatible with all popular computer vision frameworks and models. Book a demo to learn more about how Viso Suite can help solve business problems.
All images in this article are attributed to the papers in which they were originally published.
Classic Computer Vision Papers
Gradient-based Learning Applied to Document Recognition (1998)
The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. They introduced the concept of a trainable Graph Transformer Network (GTN) for handwritten character and word recognition. They researched (non) discriminative gradient-based techniques for training the recognizer without manual segmentation and labeling.
Characteristics of the model:
- LeNet-5 CNN contains 6 convolution layers with multiple feature maps (156 trainable parameters).
- The input is a 32×32 pixel image and the output layer is composed of Euclidean Radial Basis Function units (RBF) one for each class (letter).
- The training set consists of 30000 examples, and the authors achieved a 0.35% error rate on the training set (after 19 passes).
Find the LeNet paper here.
Distinctive Image Features from Scale-Invariant Keypoints (2004)
David Lowe (2004), proposed a method for extracting distinctive invariant features from images. He used them to perform reliable matching between different views of an object or scene. The paper introduced Scale Invariant Feature Transform (SIFT) while transforming image data into scale-invariant coordinates relative to local features.
Model characteristics:
- The method generates large numbers of features that densely cover the image over the full range of scales and locations.
- The model needs to match at least 3 features from each object – to reliably detect small objects in cluttered backgrounds.
- For image matching and recognition, the model extracts SIFT features from a set of reference images stored in a database.
- The SIFT model matches a new image by individually comparing each feature from the new image to this previous database (Euclidian distance).
Find the SIFT paper here.
Histograms of Oriented Gradients for Human Detection (2005)
The authors Navneet Dalal and Bill Triggs researched the feature sets for robust visual object recognition, by using a linear SVM-based human detection as a test case. They experimented with grids of Histograms of Oriented Gradient (HOG) descriptors that significantly outperform existing feature sets for human detection.
Authors achievements:
- The histogram method gave near-perfect separation from the original MIT pedestrian database.
- For good results – the model requires fine-scale gradients and fine orientation binning, i.e. high-quality local contrast normalization in overlapping descriptor blocks.
- Researchers examined a more challenging dataset containing over 1800 annotated human images with many pose variations and backgrounds.
- In the standard detector, each HOG cell appears four times with different normalizations and improves performance to 89%.
Find the HOG paper here.
SURF: Speeded Up Robust Features (2006)
Herbert Bay, Tinne Tuytelaars, and Luc Van Goo presented a scale- and rotation-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features). It outperforms previously proposed schemes concerning repeatability, distinctiveness, and robustness while computing much faster. The authors relied on integral images for image convolutions, furthermore utilizing the leading existing detectors and descriptors.
Authors achievements:
- Applied a Hessian matrix-based measure for the detector, and a distribution-based descriptor, simplifying these methods to the essential.
- Presented experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.
- SURF showed strong performance – SURF-128 with an 85.7% recognition rate, followed by U-SURF (83.8%) and SURF (82.6%).
Find the SURF paper here.
Papers Based on Deep-Learning Models
ImageNet Classification with Deep Convolutional Neural Networks (2012)
Alex Krizhevsky and his team won the ImageNet Challenge in 2012 by researching deep convolutional neural networks. They trained one of the largest CNNs at that moment over the ImageNet dataset used in the ILSVRC-2010 / 2012 challenges and achieved the best results reported on these datasets. They implemented a highly-optimized GPU of 2D convolution, thus including all required steps in CNN training, and published the results.
Model characteristics:
- The final CNN contained five convolutional and three fully connected layers, and the depth was quite significant.
- They found that removing any convolutional layer (each containing less than 1% of the model’s parameters) resulted in inferior performance.
- The same CNN, with an extra sixth convolutional layer, was used to classify the entire ImageNet Fall 2011 release (15M images, 22K categories).
- After fine-tuning on ImageNet-2012 it gave an error rate of 16.6%.
Find the ImageNet paper here.
Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)
Karen Simonyan and Andrew Zisserman (Oxford University) investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, specifically focusing on very deep convolutional networks (VGG). They proved that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.
Authors achievements:
- Their ImageNet Challenge 2014 submission secured the first and second places in the localization and classification tracks respectively.
- They showed that their representations generalize well to other datasets, where they achieved state-of-the-art results.
- They made two best-performing ConvNet models publicly available, in addition to the deep visual representations in CV.
Find the VGG paper here.
GoogLeNet – Going Deeper with Convolutions (2014)
The Google team (Christian Szegedy, Wei Liu, et al.) proposed a deep convolutional neural network architecture codenamed Inception. They intended to set the new state of the art for image classification and object detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of their architecture was the improved utilization of the computing resources inside the network.
Authors achievements:
- A carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
- Their submission for ILSVRC14 was called GoogLeNet, a 22-layer deep network. Its quality was assessed in the context of classification and detection.
- They added 200 region proposals coming from multi-box increasing the coverage from 92% to 93%.
- Lastly, they used an ensemble of 6 ConvNets when classifying each region which improved results from 40% to 43.9% accuracy.
Find the GoogLeNet paper here.
ResNet – Deep Residual Learning for Image Recognition (2015)
Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun presented a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. They reformulated the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions.
Authors achievements:
- They evaluated residual nets with a depth of up to 152 layers – 8× deeper than VGG nets, but still having lower complexity.
- This result won 1st place on the ILSVRC 2015 classification task.
- The team also analyzed the CIFAR-10 with 100 and 1000 layers, achieving a 28% relative improvement on the COCO object detection dataset.
- Moreover – in ILSVRC & COCO 2015 competitions, they won 1st place on the tasks of ImageNet detection, ImageNet localization, COCO detection/segmentation.
Find the ResNet paper here.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (2015)
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN) with full-image convolutional features with the detection network, therefore enabling nearly cost-free region proposals. Their RPN was a fully convolutional network that simultaneously predicted object bounds and objective scores at each position. Also, they trained the RPN end-to-end to generate high-quality region proposals, which Fast R-CNN used for detection.
Authors achievements:
- Merged RPN and fast R-CNN into a single network by sharing their convolutional features. In addition, they applied neural networks with “attention” mechanisms.
- For the very deep VGG-16 model, their detection system had a frame rate of 5fps on a GPU.
- Achieved state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
- In ILSVRC and COCO 2015 competitions, faster R-CNN and RPN were the foundations of the 1st-place winning entries in several tracks.
Find the Faster R-CNN paper here.
YOLO: You Only Look Once: Unified, Real-Time Object Detection (2016)
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi developed YOLO, an innovative approach to object detection. Instead of repurposing classifiers to perform detection, the authors framed object detection as a regression problem. In addition, they spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
Redmon did not continue his work in the world of AI due to ethics concerns. However, members of the open-source computer vision community have continued to release their own versions in the years following. Depending on the version, some that are considered legitimate and published with a paper, and some that are not. These YOLO models are fast and easy to use, making them a popular choice for students, hobbyists, and developers. Some of these models are YOLOv7, YOLOv8, and YOLOv10.
Model characteristics:
- The base YOLO model processed images in real-time at 45 frames per second.
- A smaller version of the network, Fast YOLO, processed 155 frames per second, while still achieving double the mAP of other real-time detectors.
- Compared to state-of-the-art detection systems, YOLO was making more localization errors but was less likely to predict false positives in the background.
- YOLO learned very general representations of objects and outperformed other detection methods, including DPM and R-CNN, when generalizing natural images.
Find the YOLO paper here.
Mask R-CNN (2017)
Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick (Facebook) presented a conceptually simple, flexible, and general framework for object instance segmentation. Their approach could detect objects in an image, while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
Model characteristics:
- Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
- Showed great results in all three tracks of the COCO suite of challenges. Also, it includes instance segmentation, bounding box object detection, and person keypoint detection.
- Mask R-CNN outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.
- The model served as a solid baseline and eased future research in instance-level recognition.
Find the Mask R-CNN paper here.
EfficientNet – Rethinking Model Scaling for Convolutional Neural Networks (2019)
The authors (Mingxing Tan, Quoc V. Le) of EfficientNet studied model scaling and identified that carefully balancing network depth, width, and resolution can lead to better performance. They proposed a new scaling method that uniformly scales all dimensions of depth resolution using a simple but effective compound coefficient. They demonstrated the effectiveness of this method in scaling up MobileNet and ResNet.
Authors achievements:
- Designed a new baseline network and scaled it up to obtain a family of models, called EfficientNets. It had much better accuracy and efficiency than previous ConvNets.
- EfficientNet-B7 achieved state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet.
- It also transferred well and achieved state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with much fewer parameters.
Find the EfficientNet paper here.
Future Computer Vision Research
We’re seeing more and more of computer vision adapted for applications across industries as the technology matures. This is all thanks to the achievements and research contributed by the papers we’ve included in this article. We’re optimistic and excited about the future of AI and are eager to continue to facilitate real-world applications.