0 - 概要

本文主要描述如何在 Radxa Orion O6 上使用 llama.cpp 部署 Deepseek R1 模型，包含以下内容：

在 Orion O6 上编译 llama.cpp 最新版本；
对比 Orion O6 Debian 系统预装的 cix-llama-cpp 与本人编译的 llama.cpp 的性能差别；
对比 llama.cpp CLI 模式原生 WebUI。

1 - 编译 llama.cpp

简介

llama.cpp 是一款轻量级 C/C++ 开源工具，核心作用是在普通设备上高效运行 LLaMA 系列大模型：

无依赖部署，无需复杂环境配置，支持 Windows、Linux、macOS 等多系统。
优化模型推理效率，支持 CPU、GPU（部分）加速，低配置设备也能运行。
兼容 LLaMA、LLaMA 2、GPT-2 等多种主流大模型格式。
开发者快速测试大模型原型，无需依赖高性能服务器。
个人用户在本地运行大模型，保障数据隐私，无需联网。

下载代码

克隆 llama.cpp 官方代码：

git clone https://github.com/ggml-org/llama.cpp.git

topgear@radxa-orion-o6:~$ git clone https://github.com/ggml-org/llama.cpp.git
Cloning into 'llama.cpp'...
remote: Enumerating objects: 67002, done.
remote: Total 67002 (delta 0), reused 0 (delta 0), pack-reused 67002 (from 1)
Receiving objects: 100% (67002/67002), 191.69 MiB | 10.29 MiB/s, done.
Resolving deltas: 100% (48724/48724), done.

安装依赖

llama.cpp 支持很多特性：

如果要使用 GPU 作为 backend，需要打开 GGML_VULKAN 编译选项，并需要安装 glslc 和 libvulkan1；
如果要支持联网下载模型，则需要安装 libcurl4。

Orion O6 Debian 系统已经预装 libvulkan1 包，所以额外安装 glslc 和 libcurl4 即可，libcurl4 我选择的是使用 Openssl 的版本：

sudo apt install glslc libcurl4-openssl-dev

安装 glslc

glslc 是 Google 开发的 GLSL 编译器（GLSL Compiler），主要用于将 OpenGL Shading Language（GLSL）编写的着色器代码编译为二进制格式（如 SPIR-V），供 GPU 驱动程序直接使用。
在 llama.cpp 中，glslc 主要用于编译项目中与 GPU 加速相关的着色器代码（例如针对 Vulkan 图形 API 的计算着色器），以实现 GPU 对模型推理的加速支持。当 llama.cpp 启用 Vulkan 后端时，会依赖 glslc 完成着色器的预处理和编译，确保 GPU 能正确执行相关计算任务。

没有安装 glslc 时，报错如下：

topgear@radxa-orion-o6:~/llama.cpp$ cmake -B build -DGGML_VULKAN=ON
CMAKE_BUILD_TYPE=Release
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native
CMake Error at /usr/share/cmake-3.25/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find Vulkan (missing: glslc) (found version "1.3.239")
Call Stack (most recent call first):
  /usr/share/cmake-3.25/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.25/Modules/FindVulkan.cmake:597 (find_package_handle_standard_args)
  ggml/src/ggml-vulkan/CMakeLists.txt:9 (find_package)


-- Configuring incomplete, errors occurred!
See also "/home/topgear/llama.cpp/build/CMakeFiles/CMakeOutput.log".
See also "/home/topgear/llama.cpp/build/CMakeFiles/CMakeError.log".

执行 sudo apt install glslc 安装 glslc：

topgear@radxa-orion-o6:~/llama.cpp$ sudo apt install glslc
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libshaderc1
The following NEW packages will be installed:
  glslc libshaderc1
0 upgraded, 2 newly installed, 0 to remove and 295 not upgraded.
Need to get 2,015 kB of archives.
After this operation, 7,666 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 https://mirrors.aliyun.com/debian bookworm/main arm64 libshaderc1 arm64 2023.2-1 [1,314 kB]
Get:2 https://mirrors.aliyun.com/debian bookworm/main arm64 glslc arm64 2023.2-1 [700 kB]
Fetched 2,015 kB in 2s (1,260 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 2.)
debconf: falling back to frontend: Readline
Selecting previously unselected package libshaderc1:arm64.
(Reading database ... 183355 files and directories currently installed.)
Preparing to unpack .../libshaderc1_2023.2-1_arm64.deb ...
Unpacking libshaderc1:arm64 (2023.2-1) ...
Selecting previously unselected package glslc.
Preparing to unpack .../glslc_2023.2-1_arm64.deb ...
Unpacking glslc (2023.2-1) ...
Setting up libshaderc1:arm64 (2023.2-1) ...
Setting up glslc (2023.2-1) ...
Processing triggers for man-db (2.11.2-2) ...
Processing triggers for libc-bin (2.36-9+deb12u8) ...

安装 libcurl4-openssl-dev

llama.cpp 支持联网下载大模型，底层依赖 curl。没有安装的话，报错如下。如果不想使用此功能，也可以在 cmake 命令中加上 -DLLAMA_CURL=OFF 禁用该特性。

topgear@radxa-orion-o6:~/llama.cpp$ cmake -B build -DGGML_VULKAN=ON
CMAKE_BUILD_TYPE=Release
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native
-- Found Vulkan: /usr/lib/aarch64-linux-gnu/libvulkan.so (found version "1.3.239") found components: glslc missing components: glslangValidator
-- Vulkan found
-- GL_KHR_cooperative_matrix not supported by glslc
-- GL_NV_cooperative_matrix2 not supported by glslc
-- GL_EXT_integer_dot_product not supported by glslc
-- GL_EXT_bfloat16 not supported by glslc
-- Including Vulkan backend
-- ggml version: 0.9.4
-- ggml commit:  230d1169e
-- Could NOT find CURL (missing: CURL_LIBRARY CURL_INCLUDE_DIR)
CMake Error at common/CMakeLists.txt:86 (message):
  Could NOT find CURL.  Hint: to disable this feature, set -DLLAMA_CURL=OFF

执行 sudo apt install libcurl4-openssl-dev 进行安装：

topgear@radxa-orion-o6:~/llama.cpp$ sudo apt install libcurl4-openssl-dev
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
  libcurl4-doc libidn-dev libkrb5-dev libldap2-dev librtmp-dev libssh2-1-dev pkg-config
The following NEW packages will be installed:
  libcurl4-openssl-dev
0 upgraded, 1 newly installed, 0 to remove and 295 not upgraded.
Need to get 476 kB of archives.
After this operation, 1,806 kB of additional disk space will be used.
Get:1 https://mirrors.aliyun.com/debian bookworm/main arm64 libcurl4-openssl-dev arm64 7.88.1-10+deb12u14 [476 kB]
Fetched 476 kB in 0s (1,799 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
Selecting previously unselected package libcurl4-openssl-dev:arm64.
(Reading database ... 183367 files and directories currently installed.)
Preparing to unpack .../libcurl4-openssl-dev_7.88.1-10+deb12u14_arm64.deb ...
Unpacking libcurl4-openssl-dev:arm64 (7.88.1-10+deb12u14) ...
Setting up libcurl4-openssl-dev:arm64 (7.88.1-10+deb12u14) ...
Processing triggers for man-db (2.11.2-2) ...

编译 llama.cpp (Vulkan support)

cmake 配置

在 llama.cpp 源码根目录执行 cmake 配置命令：

cmake -B build -DGGML_VULKAN=ON

topgear@radxa-orion-o6:~/llama.cpp$ cmake -B build -DGGML_VULKAN=ON
CMAKE_BUILD_TYPE=Release
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native
-- Vulkan found
-- GL_KHR_cooperative_matrix not supported by glslc
-- GL_NV_cooperative_matrix2 not supported by glslc
-- GL_EXT_integer_dot_product not supported by glslc
-- GL_EXT_bfloat16 not supported by glslc
-- Including Vulkan backend
-- ggml version: 0.9.4
-- ggml commit:  230d1169e
-- Found CURL: /usr/lib/aarch64-linux-gnu/libcurl.so (found version "7.88.1")
-- Configuring done
-- Generating done
-- Build files have been written to: /home/topgear/llama.cpp/build

cmake 构建

在 llama.cpp 源码根目录执行 cmake 构建命令：

cmake --build build --config Release -j 8

topgear@radxa-orion-o6:~/llama.cpp$ cmake --build build --config Release -j 8
[  0%] Creating directories for 'vulkan-shaders-gen'
[  1%] Building CXX object tools/mtmd/CMakeFiles/llama-llava-cli.dir/deprecation-warning.cpp.o
[  2%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[  2%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
[  3%] Building CXX object tools/mtmd/CMakeFiles/llama-gemma3-cli.dir/deprecation-warning.cpp.o
[  4%] No download step for 'vulkan-shaders-gen'
[  4%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[  5%] Building C object examples/gguf-hash/CMakeFiles/xxhash.dir/deps/xxhash/xxhash.c.o
[  5%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
[  5%] No update step for 'vulkan-shaders-gen'
[  5%] Built target build_info
[  5%] Building CXX object tools/mtmd/CMakeFiles/llama-minicpmv-cli.dir/deprecation-warning.cpp.o
[  5%] No patch step for 'vulkan-shaders-gen'
[  5%] Built target sha1
[  5%] Performing configure step for 'vulkan-shaders-gen'
[  5%] Built target sha256
[  5%] Building CXX object tools/mtmd/CMakeFiles/llama-qwen2vl-cli.dir/deprecation-warning.cpp.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
[  5%] Linking CXX executable ../../bin/llama-gemma3-cli
[  5%] Linking CXX executable ../../bin/llama-llava-cli
-- The C compiler identification is GNU 12.2.0
[  5%] Built target llama-llava-cli
[  5%] Built target llama-gemma3-cli
[  5%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[  5%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[  5%] Linking CXX executable ../../bin/llama-minicpmv-cli
-- The CXX compiler identification is GNU 12.2.0
-- Detecting C compiler ABI info
[  6%] Linking CXX executable ../../bin/llama-qwen2vl-cli
...

llama 工具

在编译结束之后，可以在 build/bin/ 目录下看到很多命令，其中常用的命令如下：

topgear@radxa-orion-o6:~/llama.cpp$ ls -l build/bin/llama-bench build/bin/llama-cli build/bin/llama-server
-rwxr-xr-x 1 topgear topgear  488112 Nov  6 14:13 build/bin/llama-bench
-rwxr-xr-x 1 topgear topgear 2354416 Nov  6 14:13 build/bin/llama-cli
-rwxr-xr-x 1 topgear topgear 4385320 Nov  6 14:14 build/bin/llama-server

2 - 下载 DeepSeek 模型

直接使用现成的 GGUF 格式模型

目前 GGUF 是 llama.cpp 官方主推的格式，对新功能和优化支持更完善，其他格式可能在兼容性或性能上存在限制。
从 Hugging Face Hub 等平台直接下载社区转换好的 GGUF 格式模型文件（例如 DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf 这类文件），下载后可直接通过 llama-cli 加载运行，无需额外转换步骤。
本次测试中我是 https://huggingface.co/bartow... 页面下载了 DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf，模型精度还可以，4-bit量化，文件大小也适中。可以直接使用 wget 命令下载：

wget https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

下载原生格式模型，手动转换为 GGUF

也可以先下载模型的原生格式文件（通常是 Hugging Face 上的 PyTorch 格式，如 .bin 权重文件 + 配置文件），再通过 llama.cpp 提供的转换工具转为 GGUF 格式。

# 确保安装了 git-lfs (https://git-lfs.com)
git lfs install

git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

# 在 llama.cpp 目录, 安装 convert_hf_to_gguf 脚本依赖的模块
pip install -r requirements/requirements-convert_hf_to_gguf.txt

# 转换原生格式模型为 gguf 格式，这里转换为了 16 位浮点精度
python convert_hf_to_gguf.py ./DeepSeek-R1-Distill-Qwen-1.5B --outfile ./DeepSeek-R1-Distill-Qwen-1.5B-f16.gguf --outtype f16

# 将高精度的 GGUF 模型量化为低精度的 GGUF 模型，llama-quantize 是 llama.cpp 提供的模型量化工具，用于降低模型权重的精度，以减小文件体积、降低运行时的内存 / 显存占用，并提升推理速度（牺牲部分精度换取效率）
./build/bin/llama-quantize ~/DeepSeek-R1-Distill-Qwen-1.5B-f16.gguf ~/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf Q4_K_M

使用上述方法，同样可以得到 DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf。

3 - 运行模型

使用 llama-cli 运行模型

./build/bin/llama-cli --model ~/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf

执行上述命令后可以看到命令行对话界面，如下图：

对比使用 CPU 和 GPU 的运行速度

为了对比 llama-cli 分别使用 CPU 和 GPU 作为 backend 时模型的运行速度，于是使用下方几条命令进行对比测试，测试过程中问了同一个问题：

# 使用 CPU 4~11 共8个核心，--device 设置位 none 表示不卸载到硬件进行加速
./build/bin/llama-cli --model ~/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf --cpu-mask 0xFF0 --device none

# 或者使用 taskset 设置 CPU list
taskset --cpu-list 4-11 ./build/bin/llama-cli --model ~/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf --device none

# 使用 GPU 进行加速，因为编译时默认开启
./build/bin/llama-cli --model ~/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf --device Vulkan0

经过对比发现，只使用 CPU 时，8个CPU使用率100%，且模型响应速度很慢。而使用 GPU 硬件加速时速度快了很多，下方是CPU使用率的截图和模型的使用信息对比。

对比测试期间，我发现 Orion O6 Debian 12 系统已经集成了一个 cix-llama-cpp 包。使用其预置的 llama-cli 运行同一个模型时，发现程序无法使用 GPU 进行加速，12核CPU资源几乎都用满。不过奇怪的是，与我编译的llama-cli仅使用CPU时对比，模型的响应速度却快了很多。这个会在后面进一步分析。

使用原生 SvelteKit-based WebUI

幸运的是，在进行本次测评活动开始之前一个月，llama.cpp master 合并了 https://github.com/ggml-org/l... ，使用下方命令即可提供原生 WebUI 服务：

./build/bin/llama-server -m ~/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf --jinja -c 0 --host 0.0.0.0 --port 8033

使用浏览器访问 http://127.0.0.1:8033 即可开始聊天：

4 - 测试数据对比

对比 cix-llama-cpp

在使用自己编译的 llama-cli 运行大模型时，如果只使用 CPU，速度很慢。而使用系统预编译的 cix-llama-cpp 时，发现它只支持使用 CPU 作为 backend，但是速度却快了很多，于是做了下列对比测试。

下图是对比测试的结果，我自己编译的 llama.cpp 只使用 CPU 时性能比 cix-llama-cpp 差了不少。

在搜索了不少技术文章之后，使用下方编译选项重新生成了 llama 工具集：

关闭原生指令优化
-DGGML_NATIVE=OFF
GGML_NATIVE 是 llama.cpp 中控制是否启用当前 CPU 原生指令集优化的选项（默认 ON，会自动检测并启用 CPU 支持的最高级指令集，如 x86 的 AVX2、ARM 的 NEON 等）。
设为 OFF 表示不自动启用原生指令集，而是通过后续选项手动指定 CPU 架构和指令集，适合需要精确控制硬件适配的场景（如针对特定 ARM 芯片优化）。
指定 ARM 架构与指令集
-DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod
手动指定目标 ARM 架构及支持的指令集：
- armv9-a：基础架构为 ARMv9-A（较新的 ARM 架构，支持 64 位，性能更强）。
- i8mm：启用 8 位整数矩阵乘法指令（Int8 Matrix Multiply，针对低精度计算优化，适合大模型量化推理）。
- dotprod：启用 dot product（点积）指令（加速向量运算，提升神经网络计算效率）。
启用 KleidiAI 加速
-DGGML_CPU_KLEIDIAI=ON
KleidiAI 是一款针对 ARM 架构的开源 AI 加速库，提供优化的矩阵运算实现（类似 x86 的 Intel MKL 或 AMD BLAS）。
启用后，llama.cpp 会调用 KleidiAI 库加速 CPU 上的张量计算，进一步提升 ARM 设备上的推理性能（尤其适合量化模型）。

cmake -B build_cpu_kleidiai_with_vulkan -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON -DGGML_VULKAN=ON

(venv) topgear@radxa-orion-o6:~/llama.cpp$ cmake -B build_cpu_kleidiai_with_vulkan -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON -DGGML_VULKAN=ON
CMAKE_BUILD_TYPE=Release
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- GGML_SYSTEM_ARCH: ARM
-- Including CPU backend
-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature SVE enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Using KleidiAI optimized kernels if applicable
-- Adding CPU backend variant ggml-cpu: -march=armv9-a+i8mm+dotprod
-- Vulkan found
-- GL_KHR_cooperative_matrix not supported by glslc
-- GL_NV_cooperative_matrix2 not supported by glslc
-- GL_EXT_integer_dot_product not supported by glslc
-- GL_EXT_bfloat16 not supported by glslc
-- Including Vulkan backend
-- ggml version: 0.9.4
-- ggml commit:  230d1169e
-- Configuring done
-- Generating done
-- Build files have been written to: /home/topgear/llama.cpp/build_cpu_kleidiai_with_vulkan

使用重新编译的 llama.cpp 进行测试，结果如下：

多核编译 llama.cpp 的速率

测试过程中，使用 Orion O6 开启多线程编译 llama.cpp，大概只需要三分钟多一点，多核心并行让实际耗时大幅缩短，个人认为当作开发主机使用也毫无压力。

[100%] Built target llama-server

real    3m34.949s
user    19m22.985s
sys     1m29.087s

“星睿O6”AI PC开发套件评测-使用llama.cpp部署DeepSeek并使用原生WebUI