A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation

Kai Li^*, Jintao Cheng^*, Chang Zeng, Zijun Yan, Helin Wang, Zixiong Su, Bo Zheng, Xiaolin Hu
Tsinghua University, Shanda AI, Johns Hopkins University
^*Equal contribution
Completed during Kai Li's internship at Shanda AI.
📜 Arxiv 2026 | ⚙️ Code | 🤗 Dataset

We propose an automated pipeline that eliminates co-occurrence noise by mining high-purity single-event segments from unconstrained recordings and synthesizing semantically consistent mixtures. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2k hours of audio.

Abstract

Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from unconstrained mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: ubiquitous in-the-wild datasets contain weak labels and severe event co-occurrence. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence noise by mining high-purity single-event segments from unconstrained recordings and synthesizing mixtures via semantically consistent strategies. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2k hours of audio. Experimental results demonstrate that, despite using only ~0.2% of the data scale of million-hour baselines, models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibit remarkable zero-shot generalization on out-of-distribution evaluation benchmarks such as MUSDB18-HQ and USS-Bench. These findings highlight that prioritizing supervision purity enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs.

Overview of the proposed automated pipeline. The framework consists of three coupled stages: (1) Ontology Reconstruction & Data Preprocessing. (2) Single-source Semantic-acoustic Alignment. (3) Super-resolution-based Standardization.

Hive Dataset

Dataset composition across sources.

Mixture type distribution (2-5 mix).

Label frequency statistics.

Performance & Efficiency

Separation performance results.

Efficiency comparison across models.

Demo Samples

Below we show inference results for different models on four types of mixture (2mix, 3mix, 4mix, 5mix).

For each mixture type, we present five test samples. AudioSep and FlowSep provide Hive-trained versions, selectable via the Model Weights dropdown next to each model.

Select model, mix type, and sample to view full comparison

Mix Type Sample Model A Model B

Acknowledgements

Website template was borrowed from Colorful Image Colorization and Nerfies.