This demo showcases our framework's performance across Universal Sound Separation, Speech-Music Separation, and Target Sound Extraction (TSE). Our model intelligently handles complex acoustic environments without prior knowledge of source counts.
Mix // Mixture
p[n] // Predicted
s[n] // Ground Truth
Case 01: 2-Source Mixture
Case 02: 3-Source Mixture
Case 03: 4-Source Mixture
Case 04: 5-Source Mixture
Case 05: 6-Source Mixture
Case 06: Speech & Music 2Mix
Case 07: Speech & Music 3Mix
Case 08: Speech & Music 4Mix
Condition: Video Only
Case: 2mix Extraction
Condition: Video Only
Case: 3mix Extraction
Condition: Text Only
"a man is speaking"
Complex mixture with rain, wind, and horns. Model extracts the clean human voice.
Condition: Video + Text + Tag
"a person is whistling"
Condition: Video + Text + Tag
"a frog croaks several times"