OmniCodec

Low Frame Rate Universal Audio Codec with Semantic–Acoustic Disentanglement

Abstruct

Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic–acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Experiments show that OmniCodec achieves outstanding performance at the same bitrate, delivering superior reconstruction quality while also providing more semantically informative representations that benefit downstream generation tasks.

Contents

Model Structure
Audio Reconstruction Comparison Study
Semantic–Acoustic Disentanglement Study
Ablation Study
Perplexity Study

Model Structure

Overview of OmniCodec framework.

Audio Reconstruction Comparison Study

Comparison among OmniCodec, OmniCodec-F, XCodec, Mimi codec, UniCodec, AUV and WavTokenizer on samples from different audio domains.

Models	Ground Truth	OmniCodec-32L	XCodec-8L	OmniCodec-16L	OmniCodec-F-32L	Mimi codec-16L	OmniCodec-8L	OmniCodec-F-16L	UniCodec	AUV	WavTokenizer
Bitrate(bps)	------	4400	4000	2200	2200	2200	1100	1100	1050	716	480
TPS	------	12.5*32	50*8	12.5*16	6.25*32	12.5*16	12.5*8	6.25*16	75	50	40
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Music 1
Music 2
Music 3
Music 4
Music 5
Sound 1
Sound 2
Sound 3
Sound 4
Sound 5

Semantic–Acoustic Disentanglement Study

By comparing the audio generated after the first decoder layer across different models, we can directly observe whether the model preserves stable semantic information in the early stage while reducing entanglement with timbre and acoustic details, thereby reflecting its semantic disentanglement capability. We compare the audio reconstructed from only the first layer and from the first two layers of OmniCodec and Mimi codec, in order to examine how semantic and acoustic information is progressively preserved and disentangled across layers.

Models	Ground Truth	OmniCodec-1L	Mimi codec-1L	OmniCodec-2L	Mimi codec-2L
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Speech 6
Speech 7
Speech 8
Speech 9
Speech 10
Music 1
Music 2
Music 3
Music 4
Music 5
Sound 1
Sound 2
Sound 3
Sound 4
Sound 5

Ablation Study

The ablation study analyzes the results of removing the semantic branch, removing the self-guidance loss, removing Adapter-1, and training with speech-only data.

Models	Ground Truth	OmniCodec-16L	w/o Semantic branch	w/o Self-guidance loss	w/o Adapter-1	w/o music and general sound dataset
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Speech 6
Speech 7
Speech 8
Speech 9
Speech 10

Perplexity Study

This figure shows the TensorBoard-recorded PPL0 curves on the validation set during training on the Emilia dataset. As illustrated, the yellow curve represents the PPL of Layer 1 in OmniCodec, while the blue curve represents the PPL of Layer 1 in Mimi codec. Since a lower PPL indicates better predictability, the consistently lower PPL0 of OmniCodec suggests that its Layer 1 representations are easier for the model to predict. Notably, we further optimized the proposed model by introducing additional semantic supervision through WavLM distillation, which yields representations with stronger semantic characteristics. Taking this figure as an example, the lower PPL0 can be seen as evidence of better semantic disentanglement, as it indicates that the learned representation is more semantically focused and contains less interference from non-semantic acoustic details.

OmniCodec Demo Page