Kai Li*, Jintao Cheng*, Chang Zeng, Zijun Yan, Helin Wang, Zixiong Su, Bo Zheng, Xiaolin Hu
Tsinghua University, Shanda AI, Johns Hopkins University
*Equal contribution
Completed during Kai Li's internship at Shanda AI.
📜 Arxiv 2026 | ⚙️ Code | 🤗 Dataset
We propose an automated pipeline that eliminates co-occurrence noise by mining high-purity single-event segments from unconstrained recordings and synthesizing semantically consistent mixtures. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2k hours of audio.
| Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from unconstrained mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: ubiquitous in-the-wild datasets contain weak labels and severe event co-occurrence. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence noise by mining high-purity single-event segments from unconstrained recordings and synthesizing mixtures via semantically consistent strategies. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2k hours of audio. Experimental results demonstrate that, despite using only ~0.2% of the data scale of million-hour baselines, models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibit remarkable zero-shot generalization on out-of-distribution evaluation benchmarks such as MUSDB18-HQ and USS-Bench. These findings highlight that prioritizing supervision purity enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. |
|
|
|
Below we show inference results for different models on four types of mixture (2mix, 3mix, 4mix, 5mix).
For each mixture type, we present five test samples. AudioSep and FlowSep provide Hive-trained versions, selectable via the Model Weights dropdown next to each model.
AcknowledgementsWebsite template was borrowed from Colorful Image Colorization and Nerfies. |