1 - 编译 whisper.cpp

简介

whisper.cpp 是一个基于 C/C++ 实现的轻量级开源语音识别库，旨在让 OpenAI 的 Whisper 模型能够在各种设备（包括嵌入式系统、边缘设备等）上高效运行，核心特点如下：

轻量化与跨平台：
通过优化模型推理和内存占用，支持在 CPU 上高效运行，无需依赖 GPU，兼容 Windows、Linux、macOS 及 Raspberry Pi 等嵌入式平台。
核心功能：
实现了 Whisper 模型的语音识别（将音频转文本）、语音翻译（如将其他语言音频直接译为英文文本）等核心功能，支持多种模型尺寸（如 tiny、base、small、medium、large 等），可根据设备性能选择。
易用性：
提供简单的命令行工具，用户可直接通过命令调用模型处理音频文件（支持 WAV 等格式），同时也提供 C API，方便集成到其他编程语言或项目中。
下载代码

克隆 whisper.cpp 官方代码：

git clone https://github.com/ggerganov/whisper.cpp.git

安装依赖

sudo apt install libvulkan1 libcurl4-openssl-dev libsdl2-dev

whisper.cpp 支持很多特性：

如果要使用 GPU 作为 backend，需要打开 GGML_VULKAN 编译选项，并需要安装和 libvulkan1；
如果需要编译 whisper-stream 命令，则需要安装 libsdl2-dev

编译 whisper.cpp

cmake 配置

Vulkan backend

如果需要调用 GPU 作为 backend，则需要打开 vulkan 编译选项，在 whisper.cpp 源码根目录执行 cmake 配置命令：

cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON -DGGML_VULKAN=ON -DWHISPER_SDL2=ON

CPU backend only

如果只需要调用 CPU 作为 backend，关闭 vulkan 编译选项即可，在 whisper.cpp 源码根目录执行 cmake 配置命令：

cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod -DGGML_CPU_KLEIDIAI=ON -DGGML_VULKAN=OFF -DWHISPER_SDL2=ON

cmake 构建

在 whisper.cpp 源码根目录执行 cmake 构建命令：

cmake --build build_stream --config Release -j 12

whisper 工具

在编译结束之后，可以在 build/bin/ 目录查看 whisper.cpp 命令，其中常用的命令如下：

(venv) topgear@radxa-orion-o6:~/whisper.cpp$ ls -l build/bin/
total 8756
-rwxr-xr-x 1 topgear topgear   72768 Nov  9 20:57 whisper-bench
-rwxr-xr-x 1 topgear topgear  929152 Nov  9 20:58 whisper-cli
-rwxr-xr-x 1 topgear topgear 1226128 Nov  9 20:58 whisper-server
-rwxr-xr-x 1 topgear topgear  856448 Nov  9 20:58 whisper-stream

2 - 下载语音识别模型和准备测试音频

语音识别模型

wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium.bin

我下载了两个模型用来对比测试，分别是 ggml-base.bin 和 ggml-medium.bin：

ggml-base.bin
- 模型规模：参数数量约为 74M，文件大小约为 142MB。
- 性能特点：该模型在速度和精度之间取得了较好的平衡，属于基础模型，适合作为语音识别的入门选择，其识别的词错误率（WER）大约在 8%-12%。
- 适用场景：适用于对资源消耗有一定限制，但又希望获得相对准确语音识别结果的场景，例如在一些配置不是特别高的个人电脑上进行日常的语音转文字任务。
ggml-medium.bin
- 模型规模：参数数量约为 769M，文件大小约为 1.5GB。
- 性能特点：该模型具有较高的识别准确率，词错误率相对较低，但由于模型较大，推理速度较慢，运行时会比较耗费 GPU 资源。
- 适用场景：适合对语音识别精度要求较高，且设备有独立显卡、具备较强计算能力的场景，如专业的语音转文字工作，对较长音频或视频进行高精度转录等。

测试音频

我从此芯在 modelscope 上的模型仓库中下载了三个中文测试音频：

wget https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Audio/Speech_Recognotion/onnx_whisper_tiny_multi_language/test_data/1.wav
wget https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Audio/Speech_Recognotion/onnx_whisper_tiny_multi_language/test_data/2.wav
wget https://www.modelscope.cn/models/cix/ai_model_hub_25_Q3/resolve/master/models/Audio/Speech_Recognotion/onnx_whisper_tiny_multi_language/test_data/3.wav

3 - 运行模型

使用 whisper-cli 运行模型

我分别比较了 ggml-base.bin 和 ggml-medium.bin 识别这三个音频的准确率。

# 使用 ggml-medium.bin，指定中文
~/whisper.cpp/build/bin/whisper-cli -m ~/ggml-medium.bin -t 12 -f ~/1.wav ~/2.wav ~/3.wav -l zh

# 使用 ggml-medium.bin，指定中文，并翻译为英文
~/whisper.cpp/build/bin/whisper-cli -m ~/ggml-medium.bin -t 12 -f ~/1.wav ~/2.wav ~/3.wav -l zh --translate

# 使用 ggml-base.bin，指定中文
~/whisper.cpp/build/bin/whisper-cli -m ~/ggml-base.bin -t 12 -f ~/1.wav ~/2.wav ~/3.wav -l zh

# 使用 ggml-base.bin，指定中文，并翻译为英文
~/whisper.cpp/build/bin/whisper-cli -m ~/ggml-base.bin -t 12 -f ~/1.wav ~/2.wav ~/3.wav -l zh --translate

使用 ggml-medium.bin 的结果如下：
title=

使用 ggml-base.bin 的结果如下，精确度降低了很多：
title=

使用 whisper-server 运行模型

下方这条命令用于启动 whisper.cpp 提供的 whisper-server 后端程序，以 API 服务的形式对外提供语音转文字功能：

~/whisper.cpp/build/bin/whisper-server -m ~/ggml-medium.bin -t 12 --host 0.0.0.0 --port 8080

使用浏览器访问 http://<server-ip>:8080，选择音频并提交，等待一段时间之后可以看到转换的文本，如下图：
title=

使用 whisper-stream 运行模型

使用这种方式运行模型时，可以在 Orion O6 上连接一个麦克风，然后运行下方命令：

./build_stream/bin/whisper-stream -m ~/ggml-medium.bin -t 12 --step 500 --length 5000 -l zh

我测试之后，觉得效果很差，没有花太多时间进行调试。

4 - 其他测试

使用 llama-tts 生成音频用于 whisper.cpp 转文字

测试命令如下：

# 下载模型，根据实测结果，OuteTTS-0.3-500M-FP16.gguf 好很多
wget https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/blob/main/OuteTTS-0.2-500M-Q4_K_M.gguf
wget https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF/resolve/main/OuteTTS-0.3-500M-FP16.gguf
wget https://huggingface.co/ggml-org/WavTokenizer/blob/main/WavTokenizer-Large-75-F16.gguf


# 生成音频，尝试使用中文prompt，无法转换，没有花时间调试，于是换成英文
~/llama.cpp/build_cpu_kleidiai_with_vulkan/bin/llama-tts -m ~/OuteTTS-0.3-500M-FP16.gguf -mv ~/WavTokenizer-Large-75-F16.gguf -o ~/output.wav -p "Hello! This is a sample text for generating English audio. It includes different sentence structures and common words to test the TTS system's pronunciation and rhythm. Let's see how natural the generated voice sounds."

# 音频转文字
~/whisper.cpp/build/bin/whisper-cli -f ~/output.wav -m ~/ggml-base.bin -t 12 -l en

尝试连续音频转换

结合 AI 提供的代码简单尝试了以下连续音频转换，效果不好，这里记录以下代码：

server 端程序

# server.py
import asyncio
import websockets
import subprocess
import tempfile
import os
import json

WHISPER_BIN = "./whisper.cpp/build/bin/whisper-cli"  # whisper.cpp 可执行文件路径
MODEL_PATH = "./ggml-medium.bin"

CHUNK_DURATION = 2.0   # 每累计2秒音频进行一次识别
SAMPLE_RATE = 16000
SAMPLE_SIZE = 2  # 16-bit

CHUNK_SIZE = int(SAMPLE_RATE * SAMPLE_SIZE * CHUNK_DURATION)  # bytes per chunk

async def handle_client(websocket):
    print("[Server] Client connected.")
    audio_buffer = bytearray()
    chunk_index = 0

    async for message in websocket:
        if isinstance(message, bytes):
            audio_buffer.extend(message)

            # 累积到一定长度就处理
            if len(audio_buffer) >= CHUNK_SIZE:
                chunk_index += 1
                await process_audio_chunk(websocket, audio_buffer[:CHUNK_SIZE], chunk_index)
                audio_buffer = audio_buffer[CHUNK_SIZE:]

        elif isinstance(message, str):
            data = json.loads(message)
            if "eof" in data:
                # 处理最后一段
                if len(audio_buffer) > 0:
                    chunk_index += 1
                    await process_audio_chunk(websocket, audio_buffer, chunk_index)

                await websocket.send(json.dumps({"type": "done"}))
                break

async def process_audio_chunk(websocket, audio_data, chunk_index):
    """Convert raw PCM data via sox and transcribe with whisper.cpp."""
    tmp_raw = tempfile.NamedTemporaryFile(delete=False, suffix=".raw")
    tmp_wav = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")

    try:
        tmp_raw.write(audio_data)
        tmp_raw.flush()
        tmp_raw.close()

        # 使用 sox 转换格式
        sox_cmd = [
            "sox",
            "-t", "raw",
            "-r", str(SAMPLE_RATE),
            "-e", "signed",
            "-b", "16",
            "-c", "1",
            tmp_raw.name,
            tmp_wav.name
        ]
        subprocess.run(sox_cmd, check=True)

        # 调用 whisper.cpp
        whisper_cmd = [
            WHISPER_BIN,
            "-m", MODEL_PATH,
            "-f", tmp_wav.name,
            "--no-timestamps",
            "--language", "zh"
        ]
        result = subprocess.run(
            whisper_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
        )

        text = result.stdout.strip()
        print(f"[Chunk {chunk_index}] Recognized: {text}")

        await websocket.send(json.dumps({
            "type": "partial",
            "chunk": chunk_index,
            "text": text
        }))

    except Exception as e:
        await websocket.send(json.dumps({
            "type": "error",
            "error": str(e)
        }))
        print("[Error]", e)

    finally:
        tmp_raw.close()
        tmp_wav.close()
        os.remove(tmp_raw.name)
        os.remove(tmp_wav.name)

async def main():
    async with websockets.serve(handle_client, "0.0.0.0", 8080, max_size=None):
        print("[Server] Realtime Whisper WebSocket Server running at ws://0.0.0.0:8080")
        await asyncio.Future()

if __name__ == "__main__":
    asyncio.run(main())

client 端程序

import asyncio
import sounddevice as sd
import websockets
import numpy as np
import json
import wave

SAMPLE_RATE = 16000
BLOCK_DUR = 5  # seconds

async def stream_audio():
    uri = "ws://192.168.6.47:8080"
    q = asyncio.Queue()

    def callback(indata, frames, time, status):
        # ✅ 转换 float32 → int16 PCM
        pcm16 = (indata * 32767).astype(np.int16)
        q.put_nowait(pcm16.tobytes())

    async with websockets.connect(uri) as ws:
        async def sender():
            while True:
                data = await q.get()
                await ws.send(data)

        async def receiver():
            async for msg in ws:
                try:
                    print(">>", json.loads(msg)["text"])
                except:
                    print(msg)

        asyncio.create_task(sender())

        with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, dtype="float32", callback=callback, blocksize=int(SAMPLE_RATE * BLOCK_DUR)):
            print("🎙️ Start talking...")
            await receiver()

asyncio.run(stream_audio())

性能问题

无论是使用 cpu 还是 gpu 作为 backend，whisper.cpp 在 Orion O6 上的表现都不是很好。另外我也尝试了此芯的 ai_model_hub 模型库里的 onnx_whisper_medium_multilingual 模型，无论是 cpu 推理还是 npu 推理也都比较慢。

“星睿O6”AI PC开发套件评测-使用 whisper.cpp 部署语音识别模型