In the world of AI, large language models (LLM) are leading the trend, endowing machines with unprecedented intelligence through their powerful abilities to understand and generate text. However, the operation of these models typically requires significant computational resources, which is why they are primarily run on large servers.
Yet, with the advancement of technology and the rise of edge computing, we now have the potential to run these models on smaller, more portable devices. Single Board Computers (SBC) such as the Raspberry Pi and LattePanda are pioneers of this transformation. Despite their small size, these devices are powerful enough to run some quantized versions of models.
In this article, we will delve into how to run these LLM models (LLaMA, Alpaca, LLaMA2, ChatGLM) on a Raspberry Pi 4B, as well as how to build your own AI Chatbot Server on these devices. We will provide a detailed explanation of the CPU requirements for these models and how to deploy them on a Raspberry Pi 4B in a friendly, approachable manner.
The Raspberry Pi is a small computer that, despite its performance being comparable to desktop computers in many aspects, still has limited hardware resources.
The performance of the Raspberry Pi's processor depends on its model. For instance, the Raspberry Pi 4B uses a quad-core Broadcom BCM2711 processor that runs at 1.5 GHz. This processing power is obviously insufficient compared to high-end personal computers and servers. This means that tasks requiring substantial computational resources, such as training or running LLM, may take longer on the Raspberry Pi.
The memory capacity of the Raspberry Pi also depends on the model. The Raspberry Pi 4B offers versions with 2GB, 4GB, and 8GB of RAM. For running LLM, such capacity could become a bottleneck. With limited RAM, it might be impossible to load or run these models, or the running speed might be significantly reduced.
Due to Raspberry Pi's computational power and memory limitations, running LLM may encounter some problems. Firstly, the loading and running of the model might be very slow, affecting the user experience. Secondly, due to memory constraints, it may be impossible to load or run large models, or RAM shortage errors may occur during operation. Even though large models usually perform better, for devices like the Raspberry Pi, choosing models with smaller memory footprints might be a better option.
Large language models usually specify prerequisite requirements for CPU/GPU in the project. Given that the Raspberry Pi only has a CPU, we need to prioritize models that can run on a CPU. In model selection, we need to prioritize models with smaller memory footprints. At the same time, models often support quantization, and quantized models require less RAM. Generally, a model needs twice the memory size of RAM to run normally. Therefore, we recommend using an 8GB Raspberry Pi 4B and quantized models with small footprints to experience and test the performance of LLMs on the Raspberry Pi.
With the continuous development of computational power, the capacity of models such as OpenAI's GPT1 to GPT3, InstructGPT, ChatGPT, and Anthropic's Claude is getting larger and larger, but these models have not been open-sourced and have taken the path of Closed AI. In this context, a batch of open-source models has emerged, with recent influential ones including Meta AI's LLama and Stanford's Alpaca based on LLama. The following is a list of smaller models selected from the open_llm_leaderboard on the Huggingface.
Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | License |
LLaMA-7B | 49.7 | 51 | 77.8 | 35.7 | 34.3 | Non-commercial |
Alpaca-7b | 31.9 | 28.1 | 25.8 | 25.3 | 48.5 | Non-commercial |
LLaMA2-7B-chat-hf | 56.4 | 52.9 | 78.6 | 48.3 | 45.6 | Meta |
LLaMA-13B | 56.1 | 56.2 | 80.9 | 47.7 | 39.5 | Non-commercial |
ChatGLM-6B | 48.2 | 38.8 | 59 | 46.7 | 48.1 | Non-commercial |
P.S.
1. ARC (AI2 Reasoning Challenge)
2. HellaSwag (Testing the model's common sense reasoning abilities)
3. MMLU (Measuring Massive Multitask Language Understanding)
4. TruthfulQA (Measuring How Models Mimic Human Falsehoods)
Model quantization aims to reduce hardware requirements by lowering the precision of the weight parameters of each neuron in a deep neural network model. These weights are usually represented as floating-point numbers, with varying precisions such as 16, 32, 64 bits, etc. Standard methods for model quantization include GGML and GPTQ. GGML is a tensor library for machine learning; it is a C++ library that defines a binary format for distributing LLMs, allowing you to run LLMs on CPU or CPU + GPU. It supports many different quantization strategies (such as 4-bit, 5-bit, and 8-bit quantization), with each strategy offering different trade-offs between efficiency and performance.
Figure: Comparison of AI model size after quantization
The following is the process of quantizing LLaMA 7B 4bit via GGML on a Linux PC:
The first section of the process is to set up llama.cpp on a Linux PC, download the LLaMA 7B models, convert them, and then copy them to a USB drive. We need the Linux PC’s extra power to convert the model as the 8GB of RAM in a Raspberry Pi is insufficient.
1. On your Linux PC open a terminal and ensure that git is installed.
sudo apt update && sudo apt install git
2. Use git to clone the repository.
git clone https://github.com/ggerganov/llama.cpp
3. Install a series of Python modules. These modules will work with the model to create a chatbot.
python3 -m pip install torch numpy sentencepiece
4. Ensure that you have G++ and build essential installed. These are needed to build C applications.
sudo apt install g++ build-essential
5. In the terminal change directory to llama.cpp.
cd llama.cpp
6. Build the project files. Press Enter to run.
make
7. Download the Llama 7B torrent using this link. I used qBittorrent to download the model.
magnet:?xt=urn:btih:ZXXDAUWYLRUXXBHUYEMS6Q5CE5WA3LVA&dn=LLaMA
8. Refine the download so that just 7B and tokenizer files are downloaded. The other folders contain larger models which weigh in at hundreds of gigabytes in size.
(Image credit: Tom's Hardware)
9. Copy 7B and the tokenizer files to /llama.cpp/models/.
10. Open a terminal and go to the llama.cpp folder. This should be in your home directory.
cd llama.cpp
11. Convert the 7B model to GGML FP16 format. Depending on your PC, this can take a while. This step alone is why we need 16GB of RAM. It loads the entire 13GB models/7B/consolidated.00.pth file into RAM as a pytorch model. Trying this step on an 8GB Raspberry Pi 4 will cause an illegal instruction error.
python3 convert-pth-to-ggml.py models/7B/ 1
12. Quantize the model to 4 bits. This will reduce the size of the model.
python3 quantize.py 7B
13. Copy the contents of /models/ to the USB drive.
(Image credit: Tom's Hardware)
In this final section, I repeat the llama.cpp setup on the Raspberry Pi 4, then copy the models across using a USB drive. Then I load an interactive chat session and ask “Bob” a series of questions. Just don’t ask it to write any Python code. Step 9 in this process can be run on the Raspberry Pi 4 or on the Linux PC.
1. Boot your Raspberry Pi 4 to the desktop.
2. Open a terminal and ensure that git is installed.
sudo apt update && sudo apt install git
3. Use git to clone the repository.
git clone https://github.com/ggerganov/llama.cpp
4. Install a series of Python modules. These modules will work with the model to create a chatbot.
python3 -m pip install torch numpy sentencepiece
5. Ensure that you have G++ and build essential installed. These are needed to build C applications.
sudo apt install g++ build-essential
6. In the terminal, change the directory to llama.cpp.
cd llama.cpp
7. Build the project files. Press Enter to run.
make
8. Insert the USB drive and copy the files to /models/ This will overwrite any files in the model's directory.
9. Start an interactive chat session with “Bob”. Here is where a little patience is required. Even though the 7B model is lighter than other models, it is still a rather weighty model for the Raspberry Pi to digest. Loading the model can take a few minutes.
./chat.sh
10. Ask Bob a question and press Enter. I asked it to tell me about Jean-Luc Picard from Star Trek: The Next Generation. To exit press CTRL + C.
(Image credit: Tom's Hardware)
Test for Raspberry Pi 4B(8GB) & LLM
Model | File Size | Compatibility | Out Of Memory | Token Speed |
LLaMA-7B-Q4 | < 4GB | √ | ~0.1 token/s | |
Alpaca-7B-Q4 | < 4 GB | √ | ||
LLaMA2-7B-chat-hf-Q4 | < 7GB | √ | ||
LLaMA-13B-Q4 | < 8GB | √ | ||
ChatGLM-6B-Q4 | 13GB | √ |
This article thoroughly explores the possibilities and challenges of running LLM on hardware-limited devices like the Raspberry Pi 4B. We proposed using models with smaller memory footprints that can run solely on a CPU and applying model quantization strategies to lower hardware requirements. This opens up new possibilities for implementing AI Chatbot Servers on edge devices.
Deploy and run LLM on Raspberry Pi 5 vs Raspberry Pi 4B (LLaMA, LLaMA2, Phi-2, Mixtral-MOE, mamba-gp
Testing the LLaMa Language Model on Lattepanda Sigma: Unleashing AI Capabilities on a SBC
Running LLaMA 7B on a 8 GB RAM LattePanda Alpha
1.open_llm_leaderboard:https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
2. Project: Running LLaMA-7B on Raspberry Pi:https://www.tomshardware.com/how-to/create-ai-chatbot-server-on-raspberry-pi#Managing%20Expectations
3. Project: Running LLaMA2-7B on Raspberry Pi:https://scrapbox.io/yuiseki/Raspberry_Pi_4_Model_B_8GB%E3%81%A7LLaMA_2%E3%81%AF%E5%8B%95%E3%81%8