USE Demos

A Unified Model for Universal Sound Separation and Extraction

This demo showcases our framework's performance across Universal Sound Separation, Speech-Music Separation, and Target Sound Extraction (TSE). Our model intelligently handles complex acoustic environments without prior knowledge of source counts.

Mix // Mixture p[n] // Predicted s[n] // Ground Truth

Universal Sound Separation

Case 01: 2-Source Mixture

Mixture

Predicted 1

Predicted 2

GT Source 1

GT Source 2

Case 02: 3-Source Mixture

Mixture

Predicted 1

Predicted 2

Predicted 3

GT Source 1

GT Source 2

GT Source 3

Case 03: 4-Source Mixture

Mixture

Predicted 1

Predicted 2

Predicted 3

Predicted 4

Case 04: 5-Source Mixture

Mixture

Predicted 1

Predicted 2

Predicted 3

Predicted 4

Predicted 5

Case 05: 6-Source Mixture

Mixture

Predicted 1

Predicted 2

Predicted 3

Predicted 4

Predicted 5

Predicted 6

Speech-Music Separation

Case 06: Speech & Music 2Mix

Mixture

Predicted 1

Predicted 2

GT Source 1

GT Source 2

Case 07: Speech & Music 3Mix

Mixture

Predicted 1

Predicted 2

Predicted 3

Case 08: Speech & Music 4Mix

Mixture

Predicted 1

Predicted 2

Predicted 3

Predicted 4

Target Sound Extraction

Condition: Video Only
Case: 2mix Extraction

Mixture

Extracted

Ground Truth

Condition: Video Only
Case: 3mix Extraction

Mixture

Extracted

Ground Truth

Condition: Text Only
"a man is speaking"

Complex mixture with rain, wind, and horns. Model extracts the clean human voice.

4mix Mixture

Extracted Voice

Condition: Video + Text + Tag
"a person is whistling"

Mixture

Extracted

Ground Truth

Condition: Video + Text + Tag
"a frog croaks several times"

Mixture

Extracted

Ground Truth