FunASR是由阿里巴巴达摩院开源的一款工业级语音识别工具包,其SenseVoiceSmall模型在中文语音识别领域表现出色。这套系统整合了语音活动检测(VAD)、自动语音识别(ASR)和标点恢复三大核心功能模块,形成完整的语音转文字解决方案。
在实际应用中,我们发现这套系统具有以下显著优势:
提示:虽然SenseVoiceSmall模型体积小巧,但在嘈杂环境下的识别表现依然出色,这得益于其采用的FSMN(Feedforward Sequential Memory Networks)结构,能有效建模长时语音特征。
推荐使用Python 3.8-3.10版本,过新的Python版本可能导致依赖冲突。以下是完整的依赖安装步骤:
bash复制# 创建虚拟环境(推荐)
python -m venv funasr_env
source funasr_env/bin/activate # Linux/Mac
funasr_env\Scripts\activate # Windows
# 安装核心依赖
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8
pip install funasr_onnx soundfile fastapi uvicorn
对于GPU加速,需要确保系统已安装对应CUDA驱动。可通过nvidia-smi命令验证CUDA是否可用。
FunASR提供两种模型加载方式:
python复制from funasr_onnx import SenseVoiceSmall
model = SenseVoiceSmall("iic/SenseVoiceSmall")
bash复制# 下载模型文件
git clone https://www.modelscope.cn/datasets/modelscope/FunASR.git
mv FunASR/models/iic/SenseVoiceSmall /path/to/model_dir
初始化时可配置关键参数:
python复制model = SenseVoiceSmall(
model_dir="path/to/model",
batch_size=10, # 批处理大小,影响内存占用
quantize=True, # 启用量化,减少内存消耗
device="cuda" # 使用GPU加速
)
FSMN-VAD模型采用基于帧的语音检测算法,其工作原理如下:
典型使用方式:
python复制from funasr_onnx import Fsmn_vad
vad_model = Fsmn_vad("damo/speech_fsmn_vad_zh-cn-16k-common-pytorch")
vad_results = vad_model("audio.wav") # 返回[[[start_ms,end_ms],...]]
注意事项:VAD对低信噪比(SNR<15dB)环境敏感,建议在安静环境下使用或配合降噪预处理。
SenseVoiceSmall模型架构特点:
关键参数说明:
python复制result = model(
audio_input, # 支持文件路径或URL列表
language="auto", # 自动检测语言
use_itn=True, # 启用数字/日期等规范化
hotwords=["特定术语"] # 提升专业词汇识别率
)
CT-Transformer模型采用上下文感知的标点预测:
使用示例:
python复制from funasr_onnx import CT_Transformer_VadRealtime
punc_model = CT_Transformer_VadRealtime(
"damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727"
)
text = "今天天气真好"
punctuated = punc_model(text) # 输出:"今天天气真好。"
以下是优化后的文件处理类实现:
python复制import soundfile as sf
from pathlib import Path
from typing import List, Optional
from funasr_onnx import Fsmn_vad, SenseVoiceSmall, CT_Transformer_VadRealtime
class FileASR:
def __init__(self,
vad_model: Optional[str] = None,
asr_model: Optional[str] = None,
punc_model: Optional[str] = None):
"""
初始化ASR处理管道
参数:
vad_model: VAD模型路径,默认使用官方模型
asr_model: ASR模型路径,默认SenseVoiceSmall
punc_model: 标点模型路径
"""
self.vad_model = Fsmn_vad(vad_model or "damo/speech_fsmn_vad_zh-cn-16k-common-pytorch")
self.asr_model = SenseVoiceSmall(
asr_model or "iic/SenseVoiceSmall",
quantize=True
)
self.punc_model = CT_Transformer_VadRealtime(
punc_model or "damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727"
)
def process_file(self,
audio_path: str,
output_dir: Optional[str] = None) -> str:
"""
处理单个音频文件
参数:
audio_path: 音频文件路径
output_dir: 分段音频保存目录(可选)
返回:
完整识别文本
"""
audio, sr = sf.read(audio_path)
vad_segments = self.vad_model(audio_path)[0]
texts = []
for i, (start, end) in enumerate(vad_segments):
segment = audio[int(start*sr/1000):int(end*sr/1000)]
if output_dir:
seg_path = Path(output_dir)/f"seg_{i}.wav"
sf.write(seg_path, segment, sr)
text = self.asr_model([str(seg_path)])[0]
else:
# 内存模式处理
with NamedTemporaryFile(suffix=".wav") as tmp:
sf.write(tmp.name, segment, sr)
text = self.asr_model([tmp.name])[0]
texts.append(text)
# 标点恢复
final_text = ""
cache = {"cache": []}
for t in texts:
final_text += self.punc_model(t, param_dict=cache)[0]
return final_text
生产环境部署建议:
优化后的API服务代码:
python复制from fastapi import FastAPI, UploadFile, HTTPException
from fastapi.middleware.cors import CORSMiddleware
app = FastAPI(title="ASR API")
# 允许跨域
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
@app.post("/transcribe")
async def transcribe_audio(
file: UploadFile,
language: str = "auto",
use_itn: bool = True
):
if not file.content_type.startswith("audio/"):
raise HTTPException(400, "仅支持音频文件")
try:
# 保存临时文件
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
content = await file.read()
tmp.write(content)
tmp_path = tmp.name
# 执行识别
result = asr_service.process_file(
tmp_path,
language=language,
use_itn=use_itn
)
return {"text": result}
finally:
Path(tmp_path).unlink(missing_ok=True)
启动命令:
bash复制gunicorn -w 4 -k uvicorn.workers.UvicornWorker asr_api:app --bind 0.0.0.0:8000
测试环境:Intel Xeon 2.4GHz, RTX 3090, 32GB内存
| 操作 | 耗时(秒) | 内存占用(MB) |
|---|---|---|
| VAD处理(1小时音频) | 12.3 | 520 |
| ASR识别(1小时音频) | 86.7 | 1800 |
| 标点恢复(1万字) | 1.2 | 350 |
问题1:识别结果出现乱码
python复制import librosa
audio, sr = librosa.load("audio.wav", sr=16000) # 强制转换为16kHz
问题2:长音频处理内存溢出
python复制# 减小批处理大小
model = SenseVoiceSmall(batch_size=2)
# 启用量化
model = SenseVoiceSmall(quantize=True)
问题3:专业术语识别不准
python复制result = model(
audio_files,
hotwords=["神经网络", "Transformer", "激活函数"]
)
python复制from denoiser import enhance
def process_noisy_audio(path):
audio, sr = torchaudio.load(path)
enhanced = enhance(audio, sr)
with tempfile.NamedTemporaryFile(suffix=".wav") as tmp:
torchaudio.save(tmp.name, enhanced, sr)
return model([tmp.name])
python复制def postprocess(text):
# 替换常见错误
corrections = {
"语音识别": "语音识别",
"ASR系统": "ASR系统"
}
for wrong, right in corrections.items():
text = text.replace(wrong, right)
return text
python复制from funasr_onnx import SpeechStreamingASR
stream_asr = SpeechStreamingASR(
"iic/SenseVoiceSmall-streaming",
chunk_size=1600 # 每块200ms
)
for audio_chunk in audio_stream:
text = stream_asr(audio_chunk)
print(text, end="", flush=True)