NEWS AI

Top 6 Most Favored Object Detection Models in 2024 ｜YOLOv10, EfficientDet, DETR, etc

DFRobot Jun 29 2024 332263

Object detection is an important field in computer vision and artificial intelligence, allowing computer programs to "see" their surroundings by identifying objects in images or videos. With advancements in deep learning technology, the accuracy of object detection has reached unprecedented levels. There are now many cutting-edge object detection models to choose from. This article will introduce and compare several popular object detection models for the year 2024. Whether you are a developer of computer vision or machine learning applications, or simply an enthusiast in the field, this article will help you choose the right model for your next project.

What are object detection models?

Object detection models are a class of machine learning models designed to automatically detect, locate, and identify specific objects in digital images or videos. These models learn features from data using deep learning techniques and apply the learned patterns to new input images to predict which objects are present in the image, as well as their exact locations and bounding boxes.

Common object detection algorithms fall into four main categories:

Traditional image processing techniques, such as edge detection for identifying boundaries, techniques for separating objects from the background, and the use of Histogram of Oriented Gradients (HOG) to display shape and appearance based on gradient directions.
Single-stage deep learning algorithms, such as YOLO models, EfficientDet, and RetinaNet. These methods are faster compared to other types but often require higher accuracy.
Two-stage deep learning algorithms, including various R-CNN models, which achieve rapid separation of objects from the background with faster speed and higher accuracy.
Transformer-based object detection algorithms, such as DETR, which utilize self-attention mechanisms to capture global dependencies in images, enabling end-to-end direct object recognition and localization.

Object detection models combine convolutional layers for feature extraction with specialized layers like Region Proposal Networks (RPN) or anchor-based mechanisms to generate bounding boxes of objects of interest. Additionally, these models often incorporate state-of-the-art techniques like Non-Maximum Suppression (NMS) to filter out redundant detections and improve overall detection accuracy.

These models can accurately and efficiently recognize objects in real-time, making them indispensable tools in applications such as autonomous driving, video surveillance, and object recognition.

Why is object detection needed?

Object detection models have applications in various fields, including:

Autonomous vehicles: Object detection systems can identify pedestrians, other vehicles, traffic lights, and obstacles in city roads, enabling autonomous vehicles to react correctly.
Surveillance and security: Object detection can be used at airports or border crossings to identify and track suspicious luggage or individuals, assisting security personnel in preventing and responding to security threats.
Robotics: Object detection on factory assembly lines allows robots to identify different parts and correctly pick and assemble them.
Augmented Reality (AR): In AR games, object detection can identify objects in the surrounding environment and overlay corresponding virtual game elements on them.
Retail and inventory management: Object detection in unmanned stores can track items taken by customers for automatic checkout.
Agriculture: Object detection using cameras carried by drones or robots can identify crops and weeds, guiding precise fertilization and weeding.
Healthcare: In radiology image analysis, object detection can automatically identify and label lung nodules or tumors, aiding in diagnosis by doctors.
Content moderation: Object detection on social media platforms can automatically identify and block explicit or violent image content.
Accessibility: Object detection in mobile applications can recognize the surrounding environment and provide voice navigation for visually impaired individuals.
Research analysis: In biological research, object detection can be used to automatically identify and count cells or other microscopic structures.

Research analysis Pokemon Go
Pokemon Go GIFs - Find & Share on GIPHY

Object detection is a key component in automating tasks, enhancing security, and improving the efficiency of interpreting visual data. It plays a crucial role in developing intelligent systems across various industries. Therefore, the accuracy and processing speed of object detection are important metrics for evaluating computer vision application models.

There are various open-source and commercial models available in the market, and the following are some top object detection models to watch in 2024.

The 6 most popular object detection models in 2024

1. YOLO（YOLOv10）

YOLO (You Only Look Once) is a popular object detection model among computer vision and machine learning developers. YOLO adopts a revolutionary single-stage object detection approach by dividing the image into equally sized grids and predicting the presence of objects and their probabilities in each grid separately.

Developed by Joseph Redmon and continued by Ultralytics, YOLO represents a pioneering approach that combines speed and accuracy in object detection. YOLO treats object detection as a regression problem, directly predicting bounding boxes and class probabilities from input images in a single evaluation.

The main advantages of YOLO include:

Speed: By avoiding the region proposal step used in traditional object detectors, YOLO is very fast and can process images in real-time.
End-to-end training: YOLO trains on the entire image and directly optimizes detection performance.
Generalization: YOLO learns a generalized representation of objects and globally reasons about images.

The original YOLOv1 (2015) introduced this unified detection method. Subsequent versions have improved performance:

YOLOv2 (2016) introduced batch normalization, higher resolution, anchor boxes, and other enhancements.
YOLOv3 (2018) used logistic regression to improve performance on small objects.
YOLOv4 (2020) improved backbone networks, activations, and loss functions.
YOLOv5 (2020) focused on simplicity and modularity for easier deployment.
YOLOv6 (2022) added new data augmentation, self-supervised methods, and model scaling.
YOLOv7 (2022) brought significant accuracy improvements through better backbone networks and training techniques.
YOLOv8 (2023) used efficient visual transformers and improved scaling capabilities.
YOLOv9 (2024) introduced the concept of Programmable Gradient Information (PGI) to address various changes required for deep network detection of multiple targets.
YOLOv10 (2024) further improved YOLO's performance-efficiency boundary from post-processing and model architecture perspectives.

Yolov8 Tasks Catalog

Although earlier versions like YOLOv3/v4 were once state-of-the-art, recent versions like YOLOv7/v8 have achieved top results in benchmarks like MS COCO while maintaining real-time speed suitable for applications like autonomous driving, surveillance, and robotics. YOLOv9, released in February 2024, introduced Programmable Gradient Information (PGI) and a lightweight GELAN architecture, significantly enhancing performance and applicable to a wide range of models from lightweight to large.

After the release of YOLOv9 in February, the baton of the YOLO (You Only Look Once) series passed to researchers at Tsinghua University in China.

At the end of May, YOLOv10 was launched. The research team proposed an overall efficiency-accuracy-driven model design strategy for YOLO, optimizing all components of YOLO from both efficiency and accuracy perspectives, greatly reducing computational costs and enhancing model capabilities.

Extensive experiments have shown that YOLOv10 achieves SOTA performance and efficiency across various model scales. For example, YOLOv10-S is 1.8 times faster than RT-DETR-R18 in AP on COCO, while significantly reducing parameter count and FLOP. Compared to YOLOv9-C, YOLOv10-B reduces latency by 46% and parameters by 25% while maintaining the same performance.

The simplicity, speed, and continuous improvements of the YOLO series have made it one of the most widely used and influential object detection frameworks to date.

2. EfficientDet:

Known for its efficiency and accuracy, leveraging EfficientNet as the backbone.

EfficientDet family architecture

EfficientDet, proposed by researchers at Google Brain in 2020, is a state-of-the-art object detection model that achieves high accuracy while being highly efficient in terms of model size and inference speed.

The key ideas behind EfficientDet include:

Compound Model Scaling: EfficientDet uses compound scaling, which uniformly scales up all dimensions of the model (depth, width, resolution) using a simple compound coefficient. This improves efficiency over conventional scale methods.
BiFPN (Bi-directional Feature Pyramid Network): It introduces a weighted bi-directional feature pyramid network that allows easy and accurate multi-scale feature fusion.
EfficientNet Backbone: EfficientDet leverages the powerful EfficientNet backbone, which is highly accurate and efficient compared to conventional backbones like ResNet.
Model Automl: EfficientDet models are automatically developed using neural architecture search to maximize a compound scoring metric balancing accuracy and efficiency.

The EfficientDet architecture works as follows:

An EfficientNet backbone extracts multi-scale features from the input image.
The BiFPN integrates these multi-scale features in a bi-directional, top-down and bottom-up manner.
The integrated features are fed into a box/class prediction network to output the final detections.

EfficientDet models like EfficientDet-D7 achieve state-of-the-art accuracy on the challenging COCO dataset while being an order of magnitude smaller and faster than previous detectors like Faster R-CNN. They work well across a wide range of resource constraints like mobile devices.

The compound scaling method enables a simple way to scale up EfficientDet models for higher accuracy or scale them down for faster mobile deployment. This flexibility, combined with state-of-the-art performance, has made EfficientDet a popular choice for many object detection applications.

3. RetinaNet:

Introduces the Focal Loss to handle class imbalance.

RetinaNet the Focal Loss to handle class imbalance

RetinaNet, proposed by researchers from Facebook AI Research in 2017, is a highly efficient and accurate one-stage object detection model. It addressed several shortcomings of previous one-stage detectors like YOLO and SSD.

The key innovations in RetinaNet include:

Focal Loss: RetinaNet introduced a novel loss function called Focal Loss to address the foreground-background class imbalance during training. This focuses training on the hard, misclassified examples and prevents easy negatives from overwhelming the loss.
Feature Pyramid Network (FPN): It utilizes a Feature Pyramid Network that combines low and high-level feature maps to detect objects across a wide range of scales efficiently.
Two-Step Filtering: RetinaNet employs two filtering steps - the first filters over the entire image to identify regions likely containing objects, and the second filters the remaining regions to detect the final bounding boxes.

The RetinaNet architecture works as follows:

A backbone network like ResNet extracts feature maps from the input image.
A Feature Pyramid Network combines these multi-scale feature maps in a top-down and lateral fashion.
In parallel, two subnetworks predict object classification and bounding box regression across different scales.
Focal Loss is applied to the predicted classifications to focus on hard examples.

RetinaNet achieved state-of-the-art results on the COCO benchmark when introduced, outperforming previous one-stage and two-stage detectors in accuracy while being faster than two-stage models. Its ability to robustly detect small and large objects made it suitable for various real-world applications.

While more recent architectures have advanced further, RetinaNet's impact stems from its elegant solutions to key challenges in one-stage detection, like class imbalance and multi-scale sensing. Its design principles of improving representation and supervision have influenced many subsequent object detectors.

4. Faster R-CNN:

A highly accurate model that uses Region Proposal Networks (RPN).

Faster R-CNN accurate model that uses Region Proposal Networks (RPN)

Faster R-CNN, proposed in 2015 by Shaoqing Ren et al., is a highly influential two-stage object detection model that significantly improved upon its predecessors like R-CNN and Fast R-CNN.

The key innovations in Faster R-CNN include:

Region Proposal Network (RPN): This neural network component efficiently proposes regions of interest (ROIs) that potentially contain objects, replacing slow selective search algorithms used earlier.
Region-based CNN: Like Fast R-CNN, Faster R-CNN uses convolutional features from the whole image to classify and regress bounding boxes for each proposed ROI.
End-to-end training: Both the RPN and the region-based CNN are trained jointly in an end-to-end fashion using a multi-task loss.

The Faster R-CNN architecture works as follows:

A base convolutional network (e.g., VGG-16, ResNet) extracts feature maps from the input image.
The Region Proposal Network (RPN) processes these feature maps to propose candidate object bounding boxes (ROIs).
The ROIs are pooled into fixed-size feature maps using RoIPool/RoIAlign.
These pooled features are passed to separate fully connected networks to predict the class and bounding box offsets.

Faster R-CNN achieved state-of-the-art object detection accuracy on benchmarks like PASCAL VOC and MS COCO when it was introduced, while being much faster than its R-CNN predecessors. Its two-stage design allowed precise localization of objects.

Despite being superseded by newer one-stage models like YOLO and SSD in terms of speed, Faster R-CNN laid the groundwork for many subsequent region-based CNN detectors. Its impact was amplified by influential follow-ups like Mask R-CNN (for instance segmentation) and its extensions to other vision tasks.

Faster R-CNN's accuracy and architectural innovations cemented its status as a landmark model that advanced the field of object detection and visual recognition.

5. Mask R-CNN:

An extension of Faster R-CNN that adds a branch for predicting segmentation masks.

Faster R-CNN that adds a branch for predicting segmentation masks

Mask R-CNN, proposed in 2017 by Kaiming He et al., is an extension of the highly successful Faster R-CNN model for the task of instance segmentation. It not only predicts the bounding boxes around objects like Faster R-CNN, but also generates pixel-wise masks for each instance.

The key innovations in Mask R-CNN include:

Instance Segmentation: In addition to bounding box recognition, Mask R-CNN adds a branch for predicting an object mask in parallel with the existing branches for classification and bounding box regression.
RoIAlign: It introduces RoIAlign, an improved version of RoIPool used in Faster R-CNN, to properly align extracted features with the input, improving mask quality.
Parallel Branches: The model has three parallel branches - one each for classification, bounding box regression, and mask prediction - making it a multi-task model.

The Mask R-CNN architecture works as follows:

A CNN backbone extracts feature maps from the input image.

A Region Proposal Network (RPN) proposes candidate object bounding boxes (region of interests or ROIs).

ROIs are pooled into fixed-size features using RoIAlign.

Parallel branches predict the class label, bounding box offsets, and a binary mask for each ROI.

Mask R-CNN achieved state-of-the-art results on the challenging COCO instance segmentation benchmark when introduced, significantly outperforming previous methods. Its ability to generate high-quality masks along with bounding boxes made it suitable for applications requiring precise instance segmentation.

Beyond instance segmentation, Mask R-CNN has been extended to other areas like human pose estimation (e.g., Mask R-CNN + Keypoint R-CNN), showing its versatility as a general framework for object detection and segmentation tasks.

Mask R-CNN's accuracy, robust design, and widespread adoption have solidified its status as one of the most influential models in the field of instance-level recognition and a key milestone in the development of advanced computer vision systems.

6. DETR (Detection Transformer):

Uses transformers for object detection, providing a new approach to the task.

DETR (Detection Transformer) model

DETR, short for DEtection TRansformer, is a pioneering object detection model proposed in 2020 by researchers from Facebook AI Research. It was the first paper to successfully apply the transformer architecture to the object detection task in a simple and effective manner.

The key ideas behind DETR include:

Transformer Encoder-Decoder: DETR adapts the transformer encoder-decoder design from neural machine translation, using it to attend to the input image and directly output final predictions in parallel.
Set Prediction: Instead of predicting bounding boxes independently, DETR reasons about the set of predictions/objects jointly using global attention.
Bipartite Matching Loss: It introduces a new loss function that performs optimal bipartite matching between predicted and ground truth objects.

The DETR architecture works as follows:

A CNN backbone extracts a compact feature map from the input image.
A transformer encoder processes this feature map, building a rich representation.
A transformer decoder then attends to the encoder outputs and generates the final set of predictions in parallel.
The predictions include class labels, bounding boxes, and auxiliary outputs like mask coefficients.

When introduced, DETR matched the performance of the well-established Faster R-CNN detector while being much simpler and more parallelizable. It showed transformers' potential for high-level computer vision tasks beyond image classification.

While DETR was slower than traditional detectors, it inspired a flurry of follow-up work improving its speed, accuracy, and extending it to tasks like panoptic segmentation. Deformable DETR, Efficient DETR, and Anchor DETR built upon its core transformer-based detection ideas.

DETR's powerful set-based global reasoning capability and seamless integration of auxiliary outputs like masks/keypoints enabled an elegant, unified vision transformer framework. Its impact goes beyond just object detection, sparking wider use of transformers for various vision tasks.

Summary

This article introduces several popular object detection models and compares them.

popular object detection models and compares

How to choose?

High real-time requirements: Choose the YOLO series.
Limited resources (such as mobile devices): EfficientDet.
High precision requirements: Choose Faster R-CNN, Mask R-CNN.
Need to perform detection and segmentation simultaneously: Choose Mask R-CNN.
Complex scenes and global relationship modeling: Choose DETR.

According to specific application requirements and hardware configurations, choosing the most suitable model can achieve the best balance between performance and efficiency.

If you are interested in the latest research advancements, you can also follow important conferences in computer vision and pattern recognition such as CVPR(Conference on Computer Vision and Pattern Recognition)and ICCV(International Conference on Computer Vision), where the latest advancements and new applications of object detection models are frequently released.