WenetSpeech-Wu:

Datasets, Benchmarks, and Models for a Unified

Chinese Wu Dialect Speech Processing Ecosystem

Chengyou Wang1*, Mingchen Shao1*, Jingbin Hu1*, Zeyu Zhu1*, Hongfei Xue1,
Bingshen Mu1, Xin Xu2, Xingyi Duan6, Binbin Zhang3,
Pengcheng Zhu3, Chuang Ding4, Xiaojun Zhang5, Hui Bu2, Lei Xie1†

1 Audio, Speech and Language Processing Group (ASLP@NPU),
Northwestern Polytechnical University
2 Beijing AISHELL Technology Co., Ltd.
3 WeNet Open Source Community
4 Moonstep AI
5 Xi’an Jiaotong-Liverpool University
6 YK Pao School

Abstract
Speech processing for low-resource dialects remains a fundamental challenge in developing inclusive and robust speech technologies. Despite its linguistic significance and large speaker population, the Wu dialect of Chinese has long been hindered by the lack of large-scale speech data, standardized evaluation benchmarks, and publicly available models. In this work, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source speech corpus for the Wu dialect, comprising approximately 8,000 hours of diverse speech data. Building upon this dataset, we introduce WenetSpeech-Wu-Bench, the first standardized and publicly accessible benchmark for systematic evaluation of Wu dialect speech processing, covering automatic speech recognition (ASR), Wu-to-Mandarin translation, speaker attribute prediction, speech emotion recognition, text-to-speech (TTS) synthesis, and instruction-following TTS (instruct TTS). Furthermore, we release a suite of strong open-source models trained on WenetSpeech-Wu, establishing competitive performance across multiple tasks and empirically validating the effectiveness of the proposed dataset. Together, these contributions lay the foundation for a comprehensive Wu dialect speech processing ecosystem, and we open-source proposed datasets, benchmarks, and models to support future research on dialectal speech intelligence

Contents

Demo Video

Demo Video

Data Overview

Figure 1. Data construction pipeline for WenetSpeech-Wu.

Data Samples

Sample 1 Sample 2

呃大灰狼就跟山羊奶奶讲山羊奶奶侬一家头蹲阿拉决定拿这点物事侪送拨侬
Mandarin Translation:呃,大灰狼就帮山羊奶奶说,山羊奶奶,你们一家子都蹲着,我们决定拿这些物件一起送给你们。
Confidence: 0.842 | Gender: Female
DNSMOS: 0.81 | SNR: 6.90

胖胖又得意了啥人会得想到玩具汽车里头还囥了物事呢
Mandarin Translation:胖胖又得意了,谁会想到玩具汽车里头还藏了东西呢
Confidence: 0.800 | Gender: Female
DNSMOS: 2.89 | SNR: 32.48

这物事里头是有利益分配的讲好个埃种大生意难做一趟做两三年也做不出的
Mandarin Translation:这件事里头是有利益分配的,讲好这种大生意难做,一趟做两三年也做不出的。
Confidence: 0.800 | Gender: Male
DNSMOS: 2.71 | SNR: 44.49

这个新生儿啊相对来讲偏少大家侪不愿意生嘛
Mandarin Translation:这个新生儿的数量相对较少,大家都不愿意生。
Confidence: 0.876 | Gender: Male
DNSMOS: 3.00 | SNR: 47.81

这自然应该是像上海大都市这能介告诉伊虽然伊同样是外来的闲话
Mandarin Translation:这自然应该是像上海这样的大都市,可以告诉对方,虽然对方同样是外来闲话。
Confidence: 0.818 | Gender: Male
DNSMOS: 2.95 | SNR: 34.59

已经有西南亚洲的外国人居住辣辣埃及从事贸易活动
Mandarin Translation:已经有西南亚的外国人居住在埃及从事贸易活动
Confidence: 0.846 | Gender: Male
DNSMOS: 2.63 | SNR: 38.10

青春的舞龙唱出短暂的曲子的清风里后世
Mandarin Translation:青春的舞龙在清风中唱出短暂的曲子,流传后世
Confidence: 0.825 | Gender: Female
DNSMOS: 1.37 | SNR: 23.99

肠道菌群也就是阿拉肠道当中不同种类的细菌等微生物会的影响大脑的健康
Mandarin Translation:肠道菌群也就是我们肠道中不同种类的细菌等微生物会对大脑健康产生影响。
Confidence: 0.871 | Gender: Male
DNSMOS: 3.25 | SNR: 46.22

呃对伐现在实际上是新上海人越来越多了外加未来我觉着这群新上海人会得取代脱阿拉
Mandarin Translation:嗯,对吧?现在实际上新上海人越来越多了,再加上未来我觉得这群新上海人会取代我们。
Confidence: 0.814 | Gender: Male
DNSMOS: 2.84 | SNR: 11.34

有搿种爷娘对伐但是我觉着现在好像就讲上海哦现在勿是侪讲房子也没人住嘛外国人跑得一批还有就是叫低生育率帮低结婚率嗯
Mandarin Translation:有这样的父母对吧?但我感觉现在好像就讲上海了,现在不是都讲房子也没人住嘛,外国人跑得一批,还有就是低生育率和低结婚率嗯。
Confidence: 0.900 | Gender: Female
DNSMOS: 3.14 | SNR: 33.88

当侬老了一个人头发花白坐辣盖落花旁边轻轻的从书架上面取下一本书来慢慢叫的阅读
Mandarin Translation:当你老了,头发花白,坐在落花旁边,轻轻地从书架上取下一本书来慢慢阅读。
Confidence: 0.822 | Gender: Female
DNSMOS: 2.57 | SNR: 50.29

伴着夕阳的余晖一切侪是最美好的样子
Mandarin Translation:伴着夕阳的余晖,一切都在最美的样子
Confidence: 0.858 | Gender: Female
DNSMOS: 3.32 | SNR: 44.57

勿晓得个呀老早勿是讲旧社会个辰光嘛搿种流氓阿了
Mandarin Translation:不知道呀,早就不是讲旧社会那种流氓了。
Confidence: 0.804 | Gender: Male
DNSMOS: 3.40 | SNR: 31.13

观众朋友们就是教个小诀窍就是屋里向大家一直拌馄饨芯子啊
Mandarin Translation:观众朋友们,这里教一个小诀窍,就是屋里往大家手里一直拌馄饨芯子啊。
Confidence: 0.943 | Gender: Male
DNSMOS: 3.44 | SNR: 37.91

哦对的对的侬讲了对的哎哟这小米侬还是侬脑子好
Mandarin Translation:哦,对的对的,你讲的是对的。哎哟,这小米你还是你脑子好。
Confidence: 0.909 | Gender: Female
DNSMOS: 3.60 | SNR: 42.27

嗯沿海各地包括㑚南翔连是日本海的前头一个费城
Mandarin Translation:嗯,沿海各地包括上海南翔一带,是日本海前方一个费城。
Confidence: 0.844 | Gender: Male
DNSMOS: 3.27 | SNR: 25.60

侬就没命了为了不叫类似的事体再发生张晨
Mandarin Translation:你没命了,为了不让类似的事情再发生,张晨。
Confidence: 0.843 | Gender: Male
DNSMOS: 3.25 | SNR: 48.05

其实这两年我也就是行尸走肉因为老婆没了
Mandarin Translation:其实这两年我也就是行尸走肉,因为老婆没了。
Confidence: 0.821 | Gender: Male
DNSMOS: 3.43 | SNR: 26.43

对的呀末伊拉这评论里向有种侬要讲一个人真个红了对勿啦就讲侬粉丝超过一万了嘛侬这种黑粉丝多
Mandarin Translation:对的,他们这评论里有种说法,说一个人真的火了,对吧?就讲你的粉丝超过一万了,你这种黑粉丝还挺多的。
Confidence: 0.844 | Gender: Male
DNSMOS: 2.42 | SNR: 17.01

正常保养电池呃电瓶啊搿种轮胎啊还有
Mandarin Translation:正常保养电池、电瓶以及这种轮胎等
Confidence: 0.841 | Gender: Male
DNSMOS: 3.68 | SNR: 26.10

ASR Leaderboard

ASR results (CER %) on various test sets. gray, red, light green, and dark green rows denote open-source baselines, commercial models, ASR models trained on WenetSpeech-Wu, and annotation models trained on in-house data, respectively. Bold numbers indicate best results; underlined numbers indicate second-best results.
Category Model In-House WS-Wu-Bench
Dialogue CER (%) ↓ Reading CER (%) ↓ ASR CER (%) ↓
ASR Models Paraformer 63.13 66.85 64.92
SenseVoice-small 29.20 31.00 46.85
Whisper-medium 79.31 83.94 78.24
FireRedASR-AED-L 51.34 59.92 56.69
Step-Audio2-mini 24.27 24.01 26.72
Qwen3-ASR 23.96 24.13 29.31
Tencent-Cloud-ASR 23.25 25.26 29.48
Gemini-2.5-pro 85.50 84.67 89.99
Conformer-U2pp-Wu 15.20 12.24 15.14
Whisper-medium-Wu 14.19 11.09 14.33
Step-Audio2-Wu-ASR 8.68 7.86 12.85
Annotation Models Dolphin-small 24.78 27.29 26.93
TeleASR 29.07 21.18 30.81
Step-Audio2-FT 8.02 6.14 15.64
Tele-CTC-FT 11.90 7.23 23.85

Understanding Leaderboard

Speech understanding results. gray and light green rows denote baseline and in-house models, respectively. Bold numbers indicate best results; underlined numbers indicate second-best results.
Model WenetSpeech-Wu-Bench
ASR CER (%) ↓ AST BLEU (%) ↑ Gender ACC ↑ Age ACC ↑ Emotion ACC ↑
Qwen3-Omni 44.27 33.31 0.977 0.541 0.667
Step-Audio2-mini 26.72 37.81 0.855 0.370 0.460
Step-Audio2-Wu-Und 13.23 53.13 0.956 0.729 0.712

TTS Leaderboard

TTS results on WenetSpeech-Wu-Bench. Bold and underlined values denote the best and second-best results, respectively; light green rows indicate models trained on WenetSpeech-Wu or further fine-tuned on an internal high-quality dataset.
Model WS-Wu-Eval-TTS-easy WS-Wu-Eval-TTS-hard
CER (%) ↓ SIM ↑ IMOS ↑ SMOS ↑ AMOS ↑ CER (%) ↓ SIM ↑ IMOS ↑ SMOS ↑ AMOS ↑
Qwen3-TTS 5.95 -- 4.35 -- 4.19 16.45 -- 4.03 -- 3.91
DiaMoE-TTS 57.05 0.702 3.11 3.43 3.52 82.52 0.587 2.83 3.14 3.22
CosyVoice2 10.33 0.713 3.83 3.71 3.84 82.49 0.618 3.24 3.42 3.37
CosyVoice2-Wu-CPT 6.35 0.727 4.01 3.84 3.92 32.97 0.620 3.72 3.55 3.63
CosyVoice2-Wu-SFT 6.19 0.726 4.32 3.78 4.11 25.00 0.601 3.96 3.48 3.76
CosyVoice2-Wu-SS* 5.42 -- 4.37 -- 4.21 15.45 -- 4.04 -- 3.88

Instruct TTS Leaderboard


Performance of instruct TTS model. Bold numbers indicate best results.
Type Metric ↑ CosyVoice2-Wu-SFT CosyVoice2-Wu-instruct
Emotion Happy 0.87 0.94
Angry 0.83 0.87
Sad 0.84 0.88
Surprised 0.67 0.73
EMOS 3.66 3.83
Prosody Pitch 0.24 0.74
Speech Rate 0.26 0.82
PMOS 2.13 3.68

TTS

Note: CosyVoice2-SS adopts a reference audio (Prompt Wav) for voice style conditioning. The target speaker is Blizzard Challenge 2020-SS1 ( paper ).
Text CosyVoice2-Wu-SS Qwen3-TTS Prompt Wav for Voice Cloning DiaMoE-TTS CosyVoice2 CosyVoice2-Wu-CPT CosyVoice2-Wu-SFT
昨日夜里向落了一夜雨,今早起来空气特别清爽,马路浪向也干净交关。
English Translation:Last night it rained all night. This morning when I woke up, the air was particularly fresh and the road surface was very clean.
今朝是礼拜六,我准备辣屋里向打扫卫生,拿房间理理清爽。
English Translation:Today is Saturday. I'm going to do some cleaning in the house and tidy up the rooms to make them look nice and fresh.
最近工作浪向有个项目快要结束哉,大家侪辣抓紧辰光做。
English Translation:Recently, one of our projects is coming to an end. Everyone should make the most of the time to finish it.
我小辰光最怕打针,看到医生就吓得来要命。
English Translation:When I was young, I was extremely afraid of getting injections. Just the sight of a doctor would make me extremely nervous.
我小辰光常庄到外婆屋里去孛相,外婆总会得拨我吃交关好吃的点心。
English Translation:When I was young, I often went to visit my grandmother. My grandmother would always prepare delicious snacks for me to eat.
我老早读书的辰光,每日放学总归要搭同学一道辣路浪向白相一歇再回去。
English Translation:When I was young and studying, every day after school, I would always walk home with my classmates, stopping for a rest along the way before returning home.
昨日我搭老朋友碰头,一道吃了顿饭,讲了交关闲话。
English Translation:Yesterday, I met with an old friend and we had a meal together. We also chatted about various topics.
我欢喜辣休息日的下半天困个午觉,起来以后精神好交关。
English Translation:I enjoy taking a nap in the second half of the day on weekends. After waking up, I feel much more energetic.
我欢喜辣落雪的辰光搭小囡一道堆雪人,虽然手冷,但是开心。
English Translation:I enjoy the snowy days with my little daughter, helping her build a snowman. Although my hands are cold, I am very happy.
早浪向买菜的辰光,摊主多送了我两根葱,蛮客气的。
English Translation:When I was buying vegetables early that morning, the vendor gave me an extra two scallions as a gesture of politeness.