As the development of large language models progresses, an increasing number of people are seeking powerful devices to facilitate their study of these models. Among the graphics cards available on the market, especially those with large memory, NVIDIA stands unparalleled. However, their high cost and scarcity add to the challenge. Similarly, in the field of intelligent driving, many are keen to harness the capabilities of Jetson Orin to enhance assisted driving. But what exactly is the extent of Orin's capabilities? Today, we will delve into an experience with the Orin 64G development kit, boasting a formidable computing power of 275 TOPS.
Let's first acquaint ourselves with the NVIDIA Jetson AGX Orin.
Product Name
Figure
SKU
DFR1079
AI Performance
Up to 275 TOPS
GPU Architecture
NVIDIA Ampere GPU Architecture
Features
Supports deep learning and vision accelerators, high-speed IO, and rapid memory bandwidth
AI Model Applications
Suitable for natural language understanding, 3D perception, and multi-sensor fusion
Software Support
Runs the entire Jetson software stack
Development Tools
NVIDIA Omniverse Replicator, NVIDIA TAO Toolkit
Add to Cart
Add to Cart
This test will focus on the NVIDIA Jetson AGX Orin 64GB version, primarily testing its AIGC large language model inference capabilities.
The test begins with a subjective assessment, experiencing the ease of use and providing relevant suggestions to users for better familiarization.
Subsequently, we will conduct operations on LLM-related models, striving to present objective data on Orin's performance in LLM inference.
We followed the official documentation for booting and installation.
To be honest, the official documentation is detailed and clear, making the entire boot process incredibly smooth.
If you're a novice, I recommend directly following the official documentation provided below to complete the main configurations.
https://developer.nvidia.com/embedded/learn/get-started-jetson-agx-orin-devkit
Please prepare a DP cable or a DP to HDMI adapter in advance. Operating without a screen can complicate the process significantly. It is highly recommended to prepare a monitor, mouse, and keyboard, as these are essential for a smooth startup and usage experience.
We then tested the LLM large language model inference.
For fairness, we primarily used the LLama 2 model. By testing various quantized models of the LLama model, we provided token data for each quantized model, to better understand their performance on Jetson.
Let's look at the final table
From the table above, we can observe differences in processing speed and response length among different models and quantization methods.
The 2-bit quantized llama-2-7b model achieved the highest response speed of 1.84 tokens/s on the first question, but its speed slightly decreased to 1.78 tokens/s on the second question. The 4-bit quantized llama-2-7b model processed at speeds of 1.32 tokens/s and 1.62 tokens/s for the two questions, respectively, showing a decrease in speed compared to the 2-bit quantization.
The GPTQ model processed at speeds of 1.53 tokens/s and 1.65 tokens/s for the two questions, indicating a relatively stable performance, albeit lower than the 2-bit quantized llama-2-7b model.
In terms of response length, the 2-bit quantized llama-2-7b model significantly outperformed other models in the second question, reaching a length of 864 tokens, while the 4-bit quantized llama-2-7b model and GPTQ model had comparatively shorter responses.
Overall, the 2-bit quantized llama-2-7b model exhibited good processing speed, but its longer responses could impact user experience. The 4-bit quantized llama-2-7b model and GPTQ model were slightly slower, but their response lengths were more reasonable. Particularly, the GPTQ model maintained stable processing speeds and response lengths for both questions, potentially offering users a more consistent and predictable experience. Choosing a model and quantization method may require a balance between specific application scenarios and user needs.
In conclusion, while the NVIDIA Jetson AGX Orin is capable of running the 130 billion parameter llama model with a stable performance on the GPTQ model, the 2-bit quantized 70 billion parameter model, although faster, has limited acceleration. Therefore, without newer quantization methods, the 130 billion parameter llama model might be considered, as larger parameters typically perform better.
Let's now delve into the specifics of the testing process.
GGML format mainly used for CPU + GPU inference with llama.cpp and libraries and interfaces supporting this format.
Source: TheBloke/Llama-2-7B-Chat-GGML
1.1 llama-2-7b-chat.ggmlv3.q2_K.bin
2-bit quantized model
Question-1:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
hello[/INST]
Count tokens: 25
Response-1:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
hello[/INST] Hello! How are you today? It's nice to meet you. Is there something I can help you with or would you like to chat and get to know each other better?
Count tokens:38
tokens/s:1.84
Question-2:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
Are Large Langauage Models like LLaMa supported on embedded hardware?[/INST]
Count tokens: 20
Response-2:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
Are Large Langauage Models like LLaMa supported on embedded hardware?[/INST]
Yes, large language models like LLama can be supported on embedded hardware. Embedded systems are typically low-power and low-resource devices that are designed to perform specific tasks or functions within a limited power budget and memory footprint.
Large language models like LLama require significant computational resources, including memory, processing speed, and storage capacity. However, there are several strategies that can be used to enable these models on embedded hardware:
1. Model pruning: By removing redundant or unnecessary components from the model's architecture, it is possible to reduce its size and computational requirements without significantly impacting its performance. This approach involves identifying and eliminating redundant or unnecessary parts of the model while preserving its overall functionality.
2. Knowledge distillation: This technique involves compressing a large language model into a smaller, simpler model that can be trained on embedded hardware by transferring the knowledge from the larger model to the smaller one. The goal is to preserve as much of the original model's performance as possible while reducing its size and computational requirements.
3. Tensor trains: This technique involves representing large tensors (multidimensional arrays) more efficiently in memory, which can be particularly useful for models that require a lot of computation or storage capacity. By exploiting the structure of the tensor, it is possible to reduce the number of elements stored while maintaining their integrity and accuracy.
4. Quantization: This technique involves reducing the precision of numerical values used by the model during computations, which can significantly reduce memory usage and computational requirements without impacting performance too much. However, this approach may introduce some loss in accuracy depending on the specific application.
5. Hardware acceleration: Embedded devices often have specialized hardware accelerators like GPUs or TPUs (Tensor Processing Units) that are optimized for matrix multiplication and other linear algebra operations required by large language models. By offloading these computations to dedicated hardware, it is possible to significantly reduce the computational requirements of the model while maintaining its performance.
6. Distributed computing: This technique involves distributing computation across multiple devices or systems, which can help mitigate memory and processing limitations on individual devices by sharing workloads among them. By dividing the model's computations into smaller tasks that can be performed concurrently on different processors, it is possible to reduce its overall computational requirements while maintaining performance.
7. Model-specific optimizations: Some models are designed specifically for deployment on resource-constrained devices like embedded systems. These models may have been optimized from the ground up to run efficiently within these constraints and can provide competitive performance at a fraction of the size and processing speed required by their larger counterparts.
8. Data compression: This technique involves reducing the amount of data that needs to be stored or transmitted, which can significantly reduce memory usage and computational requirements without impacting model performance too much. By exploiting patterns in the data, it is possible to compress the information while preserving its integrity and accuracy.
9. Efficient algorithms: Some algorithms are designed specifically for resource-constrained devices like embedded systems. These algorithms may be optimized to run efficiently within these constraints and can provide competitive performance at a fraction of the size and processing speed required by their larger counterparts. Examples include sparse matrix libraries, linear algebra libraries, and specialized optimization techniques tailored to specific hardware platforms.
10. Hardware-specific optimizations: Some hardware platforms have specialized accelerators or coprocessors that can be exploited for efficient computation on resource-constrained devices like embedded systems. For example, some ARM Cortex processors include dedicated A75 and A55 cores optimized for matrix multiplication and other linear algebra operations required by large language models.
In summary, there are several strategies that can be used to enable large language models on embedded hardware: model pruning, knowledge distillation, tensor trains, quantization, hardware acceleration, distributed computing, model-specific optimizations, data compression, efficient algorithms, and hardware-specific optimizations. These techniques can help reduce the computational requirements of these models while maintaining their performance within resource constraints.
Count tokens: 864
tokens/s:1.78
1.2 llama-2-7b-chat.ggmlv3.q4_1.bin
4-bit quantization model
Question-1:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
hello[/INST]
Count tokens: 25
Response-1:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
hello[/INST] Hello! I'm here to help answer your questions. What would you like to know?
Count tokens: 20
tokens/s:1.32
Question-2:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
Are Large Langauage Models like LLaMa supported on embedded hardware?[/INST]
Count tokens: 41
Response-2:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
Are Large Langauage Models like LLaMa supported on embedded hardware?[/INST]
Yes, large language models like LLaMa (Large Language Model for AI) can be supported on embedded hardware. In fact, there are several efforts underway to port large language models to resource-constrained devices such as smartphones, smart home devices, and other embedded systems.
Some of the key challenges in deploying large language models on embedded hardware include:
1. Memory constraints: Large language models require significant amounts of memory to store their parameters and weights. Embedded devices often have limited memory resources, making it difficult to train and run these models.
2. Computational requirements: Training and inference tasks for large language models can be computationally intensive, requiring high-performance processors or specialized hardware accelerators. Embedded devices may not have the necessary processing power to handle these tasks efficiently.
3. Energy efficiency: Many embedded systems are designed to operate on battery power, so energy efficiency is a critical concern. Large language models can consume significant amounts of power during training and inference, which may be challenging for some embedded applications.
To overcome these challenges, researchers and developers are exploring various techniques, including:
1. Model pruning: This involves reducing the size of the large language model by pruning away redundant or unnecessary weights without significantly impacting its performance. This can help reduce memory requirements and computational demands on embedded hardware.
2. Quantization: This involves representing the model's weights and activations using fewer bits, which can result in significant reductions in memory usage and computational requirements. However, this may also affect the model's accuracy to some extent.
3. Knowledge distillation: This involves training a smaller, simpler model (the "student") to mimic the behavior of a larger, more complex model (the "teacher"). By doing so, researchers can transfer knowledge from the teacher to the student without requiring as much computational resources or memory.
4. Hardware acceleration: Specialized hardware accelerators such as Tensor Processing Units (TPUs) or Field-Programmable Gate Arrays (FPGAs) can significantly improve the performance and energy efficiency of large language models on embedded devices. These accelerators are designed to handle matrix multiplication and other linear algebra operations that are common in deep learning tasks, making them well-suited for natural language processing workloads.
5. Cloud-based inference: In some cases, it may be more practical or efficient to perform inference tasks (i.e., generate text) on cloud servers rather than on the embedded device itself. This can help alleviate memory and computational constraints while still providing a seamless user experience.
Overall, deploying large language models on embedded hardware is an active area of research and development, with various techniques being explored to overcome the challenges involved.
Count tokens: 588
tokens/s:1.62
Source TheBloke/Llama-2-13B-chat-GPTQ
Question-1:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
hello[/INST]
Count tokens: 25
Response-1:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
hello [/INST] Hello! I'm here to help answer any questions you may have. What would you like to know?
Count tokens: 24
tokens/s:1.53
Question-2:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
Are Large Langauage Models like LLaMa supported on embedded hardware?[/INST]
Count tokens: 41
Response-2:
Bash
[INST] <<SYS>>
Answer the questions.
<</SYS>>
Are Large Langauage Models like LLaMa supported on embedded hardware? [/INST] Large language models like LLaMA are typically designed to run on powerful servers and workstations, and are not directly supported on embedded hardware. Embedded hardware is a class of microcontrollers and other specialized computer hardware that is used in a wide range of applications, from consumer electronics to industrial control systems. These devices are typically resource-constrained, with limited processing power, memory, and storage capacity compared to servers and workstations.
While it is technically possible to run large language models like LLaMA on some types of embedded hardware, such as high-end smartphones or specialized AI boards, these models are not optimized for deployment on such platforms. In fact, attempting to run such models on resource-constrained devices can lead to significant performance degradation, increased energy consumption, and potentially even device failure.
Therefore, if you need to deploy LLaMA or similar large language models on an embedded system, you will likely need to use a customized version of the model that has been specifically designed and optimized for the constraints of your target hardware. This may involve techniques such as model pruning, quantization, and knowledge distillation to reduce the computational requirements and memory footprint of the model, while preserving its accuracy and functionality. Alternatively, you may be able to use a pre-trained model that has already been adapted for deployment on your specific hardware platform.
Count tokens: 297
tokens/s:1.65
Despite the official data indicating that Orin can deliver up to 275 TOPS of computational power, the current lack of optimization means that the performance of large language models on this platform is not yet remarkable. Whether it's the 2-bit or 4-bit quantized 7b model, or the GPTQ quantized 13b model, the speed is below 2 tokens per second, and the responses are relatively ordinary. However, as our understanding and optimization of large language models improve, the potential of Orin is expected to see significant advancement.
We invite you to stay tuned for our future reports on Orin's performance in other aspects.