OmniCodec
Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement

Abstruct

Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic–acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Experiments show that OmniCodec achieves outstanding performance at the same bitrate, delivering superior reconstruction quality while also providing more semantically informative representations that benefit downstream generation tasks.

Model Structure

Overview of OmniCodec framework
Overview of OmniCodec framework.

Audio Reconstruction Comparison Study

Comparison among OmniCodec, OmniCodec-F, XCodec, Mimi codec, UniCodec, AUV and WavTokenizer on samples from different audio domains.

Models Ground Truth OmniCodec-32L XCodec-8L OmniCodec-16L OmniCodec-F-32L Mimi codec-16L OmniCodec-8L OmniCodec-F-16L UniCodec AUV WavTokenizer
Bitrate(bps) ------ 4400 4000 2200 2200 2200 1100 1100 1050 716 480
TPS ------ 12.5*32 50*8 12.5*16 6.25*32 12.5*16 12.5*8 6.25*16 75 50 40
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Music 1
Music 2
Music 3
Music 4
Music 5
Sound 1
Sound 2
Sound 3
Sound 4
Sound 5

Semantic–Acoustic Disentanglement Study

By comparing the audio generated after the first decoder layer across different models, we can directly observe whether the model preserves stable semantic information in the early stage while reducing entanglement with timbre and acoustic details, thereby reflecting its semantic disentanglement capability. We compare the audio reconstructed from only the first layer and from the first two layers of OmniCodec and Mimi codec, in order to examine how semantic and acoustic information is progressively preserved and disentangled across layers.

Models Ground Truth OmniCodec-1L Mimi codec-1L OmniCodec-2L Mimi codec-2L
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Speech 6
Speech 7
Speech 8
Speech 9
Speech 10
Music 1
Music 2
Music 3
Music 4
Music 5
Sound 1
Sound 2
Sound 3
Sound 4
Sound 5

Ablation Study

The ablation study analyzes the results of removing the semantic branch, removing the self-guidance loss, removing Adapter-1, and training with speech-only data.

Models Ground Truth OmniCodec-16L w/o Semantic branch w/o Self-guidance loss w/o Adapter-1 w/o music and general sound dataset
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Speech 6
Speech 7
Speech 8
Speech 9
Speech 10

Perplexity Study

PPL0 tensorboard
This figure shows the TensorBoard-recorded PPL0 curves on the validation set during training on the Emilia dataset. As illustrated, the yellow curve represents the PPL of Layer 1 in OmniCodec, while the blue curve represents the PPL of Layer 1 in Mimi codec. Since a lower PPL indicates better predictability, the consistently lower PPL0 of OmniCodec suggests that its Layer 1 representations are easier for the model to predict. Notably, we further optimized the proposed model by introducing additional semantic supervision through WavLM distillation, which yields representations with stronger semantic characteristics. Taking this figure as an example, the lower PPL0 can be seen as evidence of better semantic disentanglement, as it indicates that the learned representation is more semantically focused and contains less interference from non-semantic acoustic details.