Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement
Abstruct
Large Language Models (LLMs) have advanced audio generation through discrete representation learning.
However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking
unified low frame rate modeling across diverse audio domains, including speech, music, and general sound.
Moreover, high reconstruction quality does not necessarily yield semantically informative representations,
limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec
tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic–acoustic decoupling
by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy
to improve codebook utilization and reconstruction. Experiments show that OmniCodec achieves outstanding
performance at the same bitrate, delivering superior reconstruction quality while also providing more
semantically informative representations that benefit downstream generation tasks.
Comparison among OmniCodec, OmniCodec-F, XCodec, Mimi codec, UniCodec, AUV and WavTokenizer on samples from different audio domains.
Models
Ground Truth
OmniCodec-32L
XCodec-8L
OmniCodec-16L
OmniCodec-F-32L
Mimi codec-16L
OmniCodec-8L
OmniCodec-F-16L
UniCodec
AUV
WavTokenizer
Bitrate(bps)
------
4400
4000
2200
2200
2200
1100
1100
1050
716
480
TPS
------
12.5*32
50*8
12.5*16
6.25*32
12.5*16
12.5*8
6.25*16
75
50
40
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Music 1
Music 2
Music 3
Music 4
Music 5
Sound 1
Sound 2
Sound 3
Sound 4
Sound 5
Semantic–Acoustic Disentanglement Study
By comparing the audio generated after the first decoder layer across different models, we can directly observe whether the model preserves stable semantic information in the early stage while reducing entanglement with timbre and acoustic details, thereby reflecting its semantic disentanglement capability.
We compare the audio reconstructed from only the first layer and from the first two layers of OmniCodec and Mimi codec, in order to examine how semantic and acoustic information is progressively preserved and disentangled across layers.
Models
Ground Truth
OmniCodec-1L
Mimi codec-1L
OmniCodec-2L
Mimi codec-2L
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Speech 6
Speech 7
Speech 8
Speech 9
Speech 10
Music 1
Music 2
Music 3
Music 4
Music 5
Sound 1
Sound 2
Sound 3
Sound 4
Sound 5
Ablation Study
The ablation study analyzes the results of removing the semantic branch, removing the self-guidance loss, removing Adapter-1, and training with speech-only data.
Models
Ground Truth
OmniCodec-16L
w/o Semantic branch
w/o Self-guidance loss
w/o Adapter-1
w/o music and general sound dataset
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Speech 6
Speech 7
Speech 8
Speech 9
Speech 10
Perplexity Study
This figure shows the TensorBoard-recorded PPL0 curves on the validation set during training on the Emilia dataset. As illustrated, the yellow curve represents the PPL of Layer 1 in OmniCodec, while the blue curve represents the PPL of Layer 1 in Mimi codec. Since a lower PPL indicates better predictability, the consistently lower PPL0 of OmniCodec suggests that its Layer 1 representations are easier for the model to predict. Notably, we further optimized the proposed model by introducing additional semantic supervision through WavLM distillation, which yields representations with stronger semantic characteristics. Taking this figure as an example, the lower PPL0 can be seen as evidence of better semantic disentanglement, as it indicates that the learned representation is more semantically focused and contains less interference from non-semantic acoustic details.