The YOLO (You Only Look Once) series of algorithms is one of the most influential and widely used deep learning models in the field of object detection. Since the introduction of YOLOv1, the YOLO series has gained rapid recognition and application in both academia and industry due to its efficient real-time detection capabilities and high accuracy. The core idea of the YOLO series is to transform the object detection problem into a single regression problem, directly predicting the bounding boxes and class probabilities in an image through a neural network, thereby achieving fast and accurate object detection.
With the continuous development of the YOLO series, each generation has seen significant improvements and optimizations in terms of model architecture, detection performance, and application scenarios. YOLOv8, as one of the most widely used versions, has garnered substantial user favor due to its efficient detection performance and low computational resource requirements. The recently released YOLOv10 further enhances the model's detection accuracy and inference speed, while also making important improvements in model architecture and optimization strategies. Therefore, this paper will provide an in-depth comparison of YOLOv10 and YOLOv8, analyzing their differences in model size, performance metrics, and hardware requirements to help readers better understand and choose the most suitable YOLO version for their application scenarios.
YOLOv8 Model Architecture and Size
The architecture of YOLOv8 is mainly divided into three parts: Backbone, Neck, and Head.
Backbone: YOLOv8 uses CSPDarknet53 as its backbone network. This network improves information flow between different network stages through Cross-Stage Partial connections, enhancing gradient flow during training and thereby improving accuracy.
Neck: The Neck structure, also known as the feature extractor, is responsible for merging feature maps from different stages of the backbone network to capture multi-scale information. YOLOv8 employs the novel C2f module, which combines high-level semantic features with low-level spatial information, particularly improving accuracy in small object detection.
Head: The Head is responsible for making predictions. YOLOv8 uses multiple detection modules, which predict bounding boxes, objectness scores, and class probabilities for each grid cell in the feature map. These predictions are then aggregated to obtain the final detection results.
YOLOv8 also introduces several key innovations such as spatial attention mechanisms, feature fusion, bottlenecks, and the SPPF (Spatial Pyramid Pooling Fast) layer, as well as data augmentation and mixed precision training, all of which enhance the model's performance and efficiency.
YOLOv8 Architecture
Model | size(pixels) | mAP Val 50-95 | Speed CPU ONNX(ms) | Speed A100 TensorRT(ms) | params (M) | FLOPs(B) |
YOLOv8n | 640 | 37.3 | 80.4 | 0.99 | 3.2 | 8.7 |
YOLOv8s | 640 | 44.9 | 128.4 | 1.2 | 11.2 | 28.6 |
YOLOv8m | 640 | 50.2 | 234.7 | 1.83 | 25.9 | 78.9 |
YOLOv8l | 640 | 52.9 | 375.2 | 2.39 | 43.7 | 165.2 |
YOLOv8x | 640 | 53.9 | 479.1 | 3.53 | 68.2 | 257.8 |
YOLOv10 Model Architecture and Size
1. NMS-Free Training: Utilizes consistent dual assignments to eliminate the need for NMS, reducing inference latency.
2. Holistic Model Design: Comprehensive optimization of various components from both efficiency and accuracy perspectives, including lightweight classification heads, spatial-channel decoupled-down sampling, and rank-guided block design.
3. Enhanced Model Capabilities: Incorporates large-kernel convolutions and partial self-attention modules to improve performance without significant computational cost.
YOLOv10 Model Architecture
Model | Input Size | AP val | FLOPs(G) | Latency(ms) TensorRT FP16 T4 GPU |
YOLOv10-N | 640 | 38.5 | 6.7 | 1.84 |
YOLOv10-S | 640 | 46.3 | 21.6 | 2.49 |
YOLOv10-M | 640 | 51.1 | 59.1 | 4.74 |
YOLOv10-B | 640 | 52.5 | 92 | 5.74 |
YOLOv10-L | 640 | 53.2 | 120.3 | 7.28 |
YOLOv10-X | 640 | 54.4 | 160.4 | 10.7 |
Performance Metrics Comparison
On the COCO dataset, YOLOv10-S is 1.8 times faster than RT-DETR-R18, while YOLOv10-B, with comparable performance, has reduced latency by 46% and decreased parameters by 25%. This demonstrates that YOLOv10 surpasses YOLOv8 in terms of both accuracy and efficiency.
On a test platform equipped with an Intel Core i7 processor and an NVIDIA Geforce RTX 3060 GPU, YOLOv10 demonstrated significant efficiency advantages.
(Images sourced from this YouTube video)
Real-Time Recognition Frame Rate Comparison
Based on the provided test results, we can observe that on platforms like AMD CPU, LattePanda MU, and Jetson Orin 64GB, YOLOv10 did not exhibit significant improvements in inference speed compared to YOLOv8. This phenomenon may be due to the higher complexity of the YOLOv10 model and its greater demand for floating-point computations, which collectively contribute to reduced performance during inference on CPUs. We look forward to further optimizations of YOLOv10 to achieve more efficient inference on these platforms in the future.
YOLOv8n (fps) | YOLOv10n (fps) | Note | |
Thinkpad AMD CPU | 7.49 | 4.3 | original model |
Lattepanda MU | 5.72 | 1.98 | original model |
Lattepanda MU | 4.37 | 2 | Int8,onnx |
Lattepanda MU | 10 | 7.4 | Int8,onnx,openvino |
Lattepanda delta3 | 4 | 4 | original model |
Lattepanda delta3 | 2.7-3.5 | 3.4-4.1 | Int8,onnx |
Lattepanda sigma | 49.6 | 43.9 | Int8,onnx,openvino |
In actual use, in order to achieve our goals and meet the requirements of relevant hardware. We need to make a selection based on the yolo model.
In the yolov8 and yolov10 models, five different sizes of n, s, m, l and x are provided for selection.
First of all, from the perspective of model size, the number of parameters and floating point operations of the YOLOv10 series in all corresponding model sizes (N, S, M, L, X) are significantly lower than that of the YOLOv8 series. For example, the number of parameters of YOLOv10-X is only about half that of YOLOv8-X, and FLOPs are also reduced by about 38%. The lightweight design of this model makes YOLOv10 more efficient in storage and transmission, which is especially attractive for resource-constrained hardware platforms.
In terms of performance, the average precision (APval) of the YOLOv10 series on the validation set is generally higher than that of the YOLOv8 series, which shows that even while reducing the amount of parameters and computational complexity, YOLOv10 can still maintain or improve the accuracy of detection.
From an inference latency perspective, the YOLOv10 series exhibits lower inference latency and forward propagation latency across all model sizes. For example, the inference latency of YOLOv10-N is only 1.84ms, while YOLOv8-N is 6.16ms, which shows that YOLOv10 is more efficient in real-time target detection applications.
Model | Params (M) | FLOPs (G) | APval (%) | Latency (ms) | Latency (Forward) (ms) |
YOLOv8-N | 3.2 | 8.7 | 37.3 | 6.16 | 1.77 |
YOLOv10-N | 2.3 | 6.7 | 39.5 | 1.84 | 1.79 |
YOLOv8-S | 11.2 | 28.6 | 44.9 | 7.07 | 2.33 |
YOLOv10-S | 7.2 | 21.6 | 46.8 | 2.49 | 2.39 |
YOLOv8-M | 25.9 | 78.9 | 50.6 | 9.5 | 5.09 |
YOLOv10-M | 15.4 | 59.1 | 51.3 | 4.74 | 4.63 |
YOLOv8-L | 43.7 | 165.2 | 52.9 | 12.39 | 8.06 |
YOLOv10-L | 24.4 | 120.3 | 53.4 | 7.28 | 7.21 |
YOLOv8-X | 68.2 | 257.8 | 53.9 | 16.86 | 12.83 |
YOLOv10-X | 29.5 | 160.4 | 54.4 | 10.7 | 10.6 |
Model Size Comparison
In the above tests, due to the requirements of real-time performance, the N version has relatively low requirements for computing resources and is suitable for testing on a variety of hardware platforms. We selected the ultra-lightweight models of YOLOv8n and YOLOv10n, with a frame rate of up to 10. YOLOv8-X and YOLOv10-X were run on an Intel Core i7 processor and NVIDIA Geforce RTX 3060 GPU with frame rates of 30 and 36, respectively.
If the inference speed is slow, you can consider exporting the model to ONNX (Open Neural Network Exchange) format or other hardware-specific formats (such as TensorRT for NVIDIA GPUs) to improve performance, or quantize the model during the export process, that is, convert floating-point weights and activations to low-precision integers, which can reduce the size of the model and accelerate inference. The setting of the input size needs to be adjusted according to the actual task to ensure the detection effect of the model.
This article provides an in-depth comparison of YOLOv10 and YOLOv8, analyzing their differences in model size, performance metrics, and hardware requirements. The results show that although YOLOv10 has improved in terms of model complexity, floating-point operations, and accuracy, it did not significantly surpass YOLOv8 in inference speed on certain platforms, such as AMD CPU, LattePanda MU, and Jetson Orin 64GB. This may be due to the higher complexity and precision requirements of YOLOv10, as well as the lack of extensive optimization for these specific platforms.
In practical applications, users should choose the appropriate YOLO version based on their specific application scenarios and hardware resources. For hardware with limited resources, both YOLOv8 and YOLOv10 offer different model sizes to choose from. Additionally, to improve inference speed, consider exporting the model to ONNX format or other hardware-specific formats, or quantizing the model during the export proces