This article will cover image recognition, an application of Artificial Intelligence (AI), and computer vision. Specifically, you will learn about:
- What image recognition is
- How image recognition works
- Traditional and modern deep learning image recognition
- Neural networks image recognition and deep learning
- The best and most popular image recognition algorithms
- How to use Python for image recognition
- Examples and applications
What is Image Recognition?
Image Recognition is the task of identifying objects of interest within an image and recognizing which category they belong to. Photo recognition and picture recognition are terms that are used interchangeably.
When we visually see an object or scene, we automatically identify objects as different instances and associate them with individual definitions. However, visual recognition is a highly complex task for machines to perform.
Image recognition using artificial intelligence is a long-standing research problem in the computer vision field. While different methods evolved over time, the common goal of image recognition is the classification of detected objects into different categories. Therefore, it is also called object recognition.
In past years, machine learning, in particular deep learning technology, has achieved big successes in many computer vision and image understanding tasks. Hence, deep learning image recognition methods achieve the best results in terms of performance (computed frames per second/FPS) and flexibility. Later in this article, we will cover the best performing deep learning algorithms and AI models for image recognition.
Meaning and Definition of Image Recognition
In the area of Computer Vision, terms such as Segmentation, Classification, Recognition, and Detection are often used interchangeably, and the different tasks overlap. While this is mostly unproblematic, things get confusing if your workflow requires you to specifically perform a particular task.
Image Recognition vs. Computer Vision
The terms image recognition and computer vision are often used interchangeably but are actually different. In fact, image recognition is an application of computer vision that includes a set of tasks, including object detection and image classification.
Image Recognition vs. Object Localization
Object localization is another subset of computer vision often confused with image recognition. Object localization refers to identifying the location of one or more objects in an image and drawing a bounding box around their perimeter. However, object localization does not include the classification of detected objects.
Image Recognition vs. Image Detection
The terms image recognition and image detection are often used in place of each other. However, there are important technical differences.
Image Detection is the task of taking an image as input and finding various objects within it. An example is face detection, where algorithms aim to find face patterns in images (see the example below). When we strictly deal with detection, we do not care whether the detected objects are significant in any way. The goal of image detection is only to distinguish one object from another to determine how many distinct entities are present within the picture. Thus, bounding boxes are drawn around each separate object.
On the other hand, image recognition is the task of identifying the objects of interest within an image and recognize which category or class they belong to.
How does Image Recognition work?
Using traditional Computer Vision
The conventional computer vision approach of image recognition is a sequence of image filtering, segmentation, feature extraction, and rule-based classification.
However, the traditional computer vision approach requires a high level of expertise, a lot of engineering time, and contains many parameters that need to be manually determined, while the portability to other tasks is pretty limited.
Using Machine Learning and Deep Learning
Image recognition with machine learning, on the other hand, uses algorithms to learn hidden knowledge from a dataset of good and bad samples (Supervised Learning). The most popular machine learning method is deep learning, where multiple hidden layers are used in a model.
The introduction of deep learning in combination with powerful AI hardware and GPUs enabled great breakthroughs in the field of image recognition. With deep learning, image classification and face recognition algorithms achieve above human-level performance and real-time object detection.
In addition, we have seen a recent jump in algorithm inference performance. In 2017, the Mask RCNN algorithm was the fastest real-time object detector on the MS COCO benchmark, with an inference time of 330ms per frame. In comparison, the YOLOR algorithm that was released in 2021 achieves inference times of 12ms on the same benchmark, even surpassing the popular YOLOv4 and YOLOv3 deep learning algorithms.
Compared to the traditional computer vision approach in early image processing 20 years ago, deep learning requires only engineering knowledge of a machine learning tool, not expertise in specific machine vision areas to create handcrafted features. Also, special implementations of deep learning need only tens of learning samples.
However, deep learning requires manual labeling of data to annotate good and bad samples (Image Annotation). The process of learning from data that is labeled by humans is called supervised learning. The process of creating such labeled data to train AI models requires time-consuming human work, for example, to annotate standard traffic situations in autonomous driving.
The Process of Image Recognition Systems
There are a few steps that are at the backbone of how image recognition systems work.
- Dataset with training data
The image recognition models require training data (video, picture, photo, etc.). Neural networks need those training images from an acquired dataset to create perceptions of how certain classes look.
For example, an image recognition model that detects different poses (pose estimation model) would need multiple instances of different human poses to understand what makes poses unique from each other.
- Training of Neural Networks
The images from the created dataset are fed into a neural network algorithm. This is the deep or machine learning aspect of creating an image recognition model. The training of an image recognition algorithm makes it possible for convolutional neural networks image recognition to identify specific classes. There are multiple well-tested frameworks that are widely used for these purposes today.
- Model Testing
The trained model needs to be tested with images that are not part of the training dataset. This is used to determine the usability, performance, and accuracy of the model. Therefore, about 80-90% of the complete image dataset is used for model training, while the remaining data is reserved for model testing. The model performance is measured based on a set of parameters that indicate the percent confidence of accuracy per test image, incorrect identifications, and more. Read our article about how the evaluate the performance of machine learning models.
Image Recognition with Machine Learning
Before GPUs (Graphical Processing Unit) became powerful enough to support massively parallel computation tasks of neural networks, traditional machine learning algorithms have been the gold standard for image recognition.
Machine Learning Image Recognition Models
Let’s look at the three most popular image recognition machine learning models.
- Support Vector Machines
SVMs work by making histograms of images containing the target objects and also of images that don’t. The algorithm then takes the test picture and compares the trained histogram values with the ones of various parts of the picture to check for matches.
- Bag of Features Models
Bag of Features models like Scale Invariant Feature Transformation (SIFT) and Maximally stable extremal regions (MSER) work by taking the image to be scanned and a sample photo of the object to be found as reference. It then tries to pixel match the features from the sample photo to various parts of the target image to see if matches are found.
- Viola-Jones Algorithm
A widely-used facial recognition algorithm from pre-CNN (Convolutional Neural Network) times, Viola-Jones works by scanning faces and extracting features that are then passed through a boosting classifier. This, in turn, generates a number of boosted classifiers that are used to check test images. For a successful match to be found, a test image must generate a positive result from each of these classifiers.
Deep Learning Image Recognition Models
In image recognition, the use of Convolutional Neural Networks (CNN) is also named Deep Image Recognition. CNNs are unmatched by traditional machine learning methods. Not only are CNNs faster and deliver the best detection results, but they can also detect multiple instances of an object from within an image, even if the image is slightly warped, stretched, or altered in some other form.
In Deep Image Recognition, Convolutional Neural Networks even outperform humans in tasks such as classifying objects into fine-grained categories such as the particular breed of dog or species of bird.
The most popular deep learning models such as YOLO, SSD, and RCNN use convolution layers to parse an image or photo. During training, each layer of convolution acts like a filter that learns to recognize some aspect of the image before it is passed on to the next.
One layer processes colors, another layer shapes, and so on. In the end, a composite result of all these layers is collectively taken into account when determining if a match has been found.
Popular Image Recognition Algorithms
For image recognition or photo recognition, a few algorithms are a cut above the rest. While all of these are deep learning algorithms, their fundamental approach towards how they recognize different classes of objects varies. Let’s take a look at some that are popularly used these days.
Faster Region-based CNN (Faster RCNN)
Faster RCNN (Region-based Convolutional Neural Network) is the best performer in the R-CNN family of image recognition algorithms, including R-CNN and Fast R-CNN.
It uses a Region Proposal Network (RPN) for feature detection along with a Fast RCNN for image recognition, which makes it a significant upgrade over its predecessor (Note: Fast RCNN vs. Faster RCNN). Faster RCNN can process an image under 200ms, while Fast RCNN takes 2 seconds or more.
Single Shot Detector (SSD)
RCNNs draw bounding boxes around a proposed set of points on the image, some of which may be overlapping. Single Shot Detectors discretize this concept by dividing the image up into default bounding boxes in the form of a grid over different aspect ratios.
It then combines the feature maps obtained from processing the image at the different aspect ratios to naturally handle objects of varying sizes. This makes SSDs very flexible, accurate, and easy to train. An implementation of SSD can process an image within 125ms.
You Only Look Once (YOLO)
YOLO stands for You Only Look Once, and true to its name, the algorithm processes a frame only once using a fixed grid size and then determines whether a grid box contains an image or not.
For this purpose, the object detection algorithm uses a confidence metric and multiple bounding boxes within each grid box. However, it does not go into the complexities of multiple aspect ratios or feature maps, and thus, while this produces results faster, they may be somewhat less accurate than SSD.
One of the most popular YOLO models is its third version, named YOLOv3. The sleekest variant of YOLO called Tiny YOLO can process a video at up to 244 fps or 1 image at 4 ms.
How to apply Image Recognition
Image Recognition with Python
Python is the programming language of choice for most Computer Vision Engineers. It supports a huge number of libraries specifically designed for AI workflows – including image recognition.
- Step #1: To get your computer set up to perform python image recognition tasks, you need to download Python and install the packages needed to run image recognition jobs, including Keras.
- Step #2: Keras is a high-level deep learning API for running AI applications. It runs on TensorFlow/Python and helps end-users deploy machine learning and AI applications using easy-to-understand code.
- Step #3: If your machine does not have a graphics card, you can use free GPU instances online on Google Colab. For the purpose of classifying animals, there is a well-labeled dataset known as “Animals-10” that you can find on Kaggle. The dataset is totally free to download.
- Step #4: Once you have obtained the online dataset from Kaggle by getting an API token, you can then start coding in Python after reuploading the necessary files to Google Drive.
For more details on platform-specific implementations, several well-written articles on the internet take you step-by-step through the process of setting up an environment for AI on your machine or on your Colab that you can use.
Image Recognition API (Cloud) vs. Edge AI
APIs provide an easy way to perform picture recognition by calling a cloud-based API service such as Amazon Rekognition (AWS Cloud). A popular way to perform object recognition on images is with the Google Vision API that can be used to perform object or face detection, text recognition, or handwriting recognition.
An Image Recognition API such as TensorFlow’s Object Detection API is a powerful tool for developers to quickly build and deploy image recognition software if the use case allows data offloading (sending visuals to a cloud server). The use of an API for image recognition is used to retrieve information about the image itself (image classification) or contained objects (object detection).
To learn about how image recognition APIs work, which one to choose, and the limitations of APIs for recognition tasks, I recommend you to check out our review of the best paid and free Computer Vision APIs in 2021.
While computer vision APIs can be used to process individual images, Edge AI systems are used to perform video recognition tasks in real-time, by moving machine learning in close proximity to the data source (Edge Intelligence). This allows real-time AI image processing as visual data is processed without data-offloading (uploading data to the cloud), allowing higher inference performance and robustness required for production-grade systems.
Image Recognition AI Platform
If you don’t want to start from scratch and use pre-configured infrastructure, you might want to check out ai vision low-code platforms that provide the popular open-source image recognition software out-of-the-box. For example, Viso Suite is a computer vision platform to build and deploy real-time image recognition apps.
What is Image Recognition Used for?
In all industries, AI image recognition technology is becoming increasingly imperative. Its applications provide economic value in industries such as healthcare, retail, security, agriculture, and many more. To see an extensive list of computer vision and image recognition applications, I recommend you our list of the 56 Most Popular Computer Vision Applications in 2021.
Face Analysis and identification
Face analysis is a prominent image recognition application. Therefore, image recognition software employs AI algorithms for simultaneous face detection, face pose estimation, face alignment, gender recognition, smile detection, age estimation, and face recognition using a deep convolutional neural network.
The facial analysis with computer vision allows systems to recognize identity, intentions, emotional and health states, age, or ethnicity. Some tools aim to quantify levels of perceived attractiveness.
Other face recognition-related tasks involve face image identification, face recognition, and face verification that involves vision processing methods to find and match a detected face with images of faces in a database. Deep learning recognition methods are able to identify people on photos or videos even as they age or in challenging illumination situations.
One of the most popular and open-source software libraries to build AI face recognition applications is named DeepFace, which is able to analyze images and videos. To learn more about facial analysis with AI and video recognition, I recommend checking out our article about Deep Face Recognition.
Medical Image Analysis
Visual recognition technology is widely used in the medical industry to make computers understand images that are routinely acquired throughout the course of treatment. Medical image analysis is becoming a highly profitable subset of artificial intelligence. For example, there are multiple works regarding the identification of melanoma, a deadly skin cancer. Deep learning image recognition allows tumor monitoring across time, for example, to detect abnormalities in breast cancer scans.
Agricultural visual AI systems use novel techniques that have been trained to detect the type of animal and its actions. AI image recognition is used for animal monitoring in farming, where livestock can be monitored remotely for disease detection, changes in behavior, or giving birth.
Pattern and Objects Detection
AI photo recognition and video recognition technologies are useful for identifying people, patterns, logos, objects, places, colors, and shapes. The customizability of image recognition allows it to be used in conjunction with multiple software programs. For example, after an image recognition program is specialized to detect people, it can be used for people counting, a popular computer vision application in retail stores.
To learn everything you need to know about cutting-edge pattern detection, I recommend reading our article What is Pattern Recognition?.
Automated Plant Image Identification
Image-based plant identification has seen rapid development and is already used in research and nature management. A research paper from July 2021 analyzed the identification accuracy of image identification to determine plant family, growth forms, lifeforms, and regional frequency.
Results indicate high recognition accuracy, where 79.6% of the 542 species in about 1500 photos were correctly identified, while the plant family was correctly identified for 95% of the species.
Food Image Recognition
Deep learning image recognition of different types of food is applied for computer-aided dietary assessment. Computer vision systems were developed to improve the accuracy of current measurements of dietary intake by analyzing the food images captured by mobile devices.
Currently, convolutional neural networks (CNN) such as ResNet and VGG are state-of-the-art in image recognition use cases.
In 2021 computer vision research, Vision Transformers (ViT) have recently been used for Image Recognition tasks and have shown promising results. ViT models achieve the accuracy of convolutional neural networks (CNNs) at 4x higher computational efficiency.
After reading about what image recognition is and how photo or picture recognition works, you might want to explore other articles related to this topic: