Unsupervised Blind Speech Separation with a Diffusion Prior

[code] [pdf]

Zhongweiyang Xu*, Xulin Fan*;, Zhong-Qiu Wang†, Xilin Jiang♯, Romit Roy Choudhury*;

* University of Illinois Urbana-Champaign, † Southern University of Science and Technology, ♯ Columbia University

Table of Contents

Brief Overview

Blind Speech Separation (BSS) aims to separate multiple speech sources from mixtures recorded by any microphone array. We propose ArrayDPS to solve the BSS problem in an unsupervised, array agnostic, and generative manner. ArrayDPS uses a speech diffusion model as a prior and then samples from the conditional distribution of speech sources given the multi-channel mixture. This is a challenging task because it is a blind inverse problem; both relative room impulse responses (relative RIR) and sources are unknown, so the likelihood is intractable. ArrayDPS solves this problem by estimating the relative RIRs during diffusion posterior sampling (DPS), initialized by independent vector analysis (IVA). After the ArrayDPS sampling process, we can separate all the sources, along with all the relative RIRs (inferring room acoustics and source locations), without knowing any of the microphone array information. Also, this method only needs a simple single-speaker speech diffusion model as a prior for any microphone array. We show elaborate experiments and results to validate our method.


Audio Demo for Synthetic 2-speaker Mixtures

Here is the main result of our paper for 2-speaker separation on 3-channel SMS-WSJ dataset.

In the following demos, we evalute a subset of methods from Table 1 (shown above):

  • CLEAN - Groundtruth souce at referece channel
  • Spatial Cluster - row 1a in the Table
  • IVA - row 1c in the Table
  • UNSSOR - row 1d in the Table
  • ArrayDPS-ML - row 4a in the Table
  • ArrayDPS-A1 - row 2a in the Table, first sampled result
  • ArrayDPS-A2 - row 2a in the Table, second sampled result
  • ArrayDPS-A3 - row 2a in the Table, third sampled result
  • ArrayDPS-A4 - row 2a in the Table, fourth sampled result
  • ArrayDPS-A5 - row 2a in the Table, fifth sampled result
  • TF-GridNet-SMS - row 5b in the Table
  • TF-GridNet-Spatial - row 5c in the Table



Audio samples for SMS-WSJ dataset (aligned with UNSSOR's demo), 2 speakers and 3 microphones.

Note that here we also show the virtual source sampled ArrayDPS. For clean signal, the original anechoic clean speech is displayed in the virtual source column.

Sample 1: 1015_446c0415_442c040c

Mixture
Methods Source 1 Source 2 Virtual Source 1 Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-SMS

Sample 2: 15_443c040a_444c0415

Mixture
Methods Source 1 Source 2 Virtual Source 1 Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-SMS

Sample 3: 1120_445c040c_441c040m

Mixture
Methods Source 1 Source 2 Virtual Source 1 Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
TF-GridNet-SMS

Sample 4: 0_442c040o_443c040g

Mixture
Methods Source 1 Source 2 Virtual Source 1 Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
TF-GridNet-SMS

Sample 5: 999_441c040c_447c040k

Mixture
Methods Source 1 Source 2 Virtual Source 1 Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
TF-GridNet-SMS


Audio samples for Spatialized WSJ0-2Mix, 2 speakers and 4 microphones.

Sample 1: 050a0504_2.4414_443o0313_-2.4414

Mixture
Methods Source 1 Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-Spatial

Sample 2: 050a0502_1.463_420a010o_-1.463

Mixture
Methods Source 1 Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
TF-GridNet-Spatial

Sample 3: 050a0501_1.7783_442o030z_-1.7783

Mixture
Methods Source 1 Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
TF-GridNet-Spatial

Audio Demo for Real-World Recorded 2-speaker Mixtures

The mixtures are recorded in an office room (about 7mx4mx2.7m) with 3 microphones, where two speakers are speaking simultaneously inside the room. Three microphones are set up shown as below and the speakers are at least 1 meter away from the mics. Some background noise is there, including air conditioning noise, noise form headlights, clock ticking....Also note that some speakers might speak the sentence a bit wrong (stuttering for some words).

3-channel microphone setup for real-world mixture recording.
Mixture
Methods Source 1 Source 2
IVA
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-Spatial
Mixture
Methods Source 1 Source 2
IVA
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-Spatial
Mixture
Methods Source 1 Source 2
IVA
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-Spatial
Mixture
Methods Source 1 Source 2
IVA
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-Spatial

Audio Demo for Synthetic 3-speaker Mixtures

Here is the main result of our paper for 3-speaker separation on 4-channel Spatialized-WSJ dataset.

In the following demos, we evalute a subset of methods from Table 6 (shown above)

  • CLEAN - Groundtruth souce at referece channel
  • Spatial Cluster - row 1a in the Table
  • IVA - row 1c in the Table
  • ArrayDPS-ML - row 4a in the Table
  • ArrayDPS-A1 - row 2a in the Table, second sampled result
  • ArrayDPS-A2 - row 2a in the Table, second sampled result
  • ArrayDPS-A3 - row 2a in the Table, second sampled result
  • ArrayDPS-A4 - row 2a in the Table, second sampled result
  • ArrayDPS-A5 - row 2a in the Table, second sampled result
  • TF-GridNet-Spatial - row 5a in the Table, second sampled result



Audio samples for Spatialized WSJ0-3Mix, 3 speakers and 4 microphones.

Sample 1: 050a0504_0.34233_447c0210_-0.34233_22ga010q_0

Mixture
Methods Source 1 Source 2 Source 3
Clean
Spatial Clustering
IVA
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-Spatial

Sample 2: 050a0504_0.71133_446c020g_-0.71133_421c020s_0

Mixture
Methods Source 1 Source 2 Source 3
Clean
Spatial Clustering
IVA
ArrayDPS-ML
TF-GridNet-Spatial

Sample 3: 050a0504_2.2121_446o0306_-2.2121_052a050b_0

Mixture
Methods Source 1 Source 2 Source 3
Clean
Spatial Clustering
IVA
ArrayDPS-ML
TF-GridNet-Spatial

Sample 4: 050a0505_0.8216_22ho010j_-0.8216_445o030v_0

Mixture
Methods Source 1 Source 2 Source 3
Clean
Spatial Clustering
IVA
ArrayDPS-ML
TF-GridNet-Spatial

Filter and Source Visualization

Each figure corresponds to one mixture separation. The first row shows the anehoic source 1, source 1's RIR, and reverberant source 1 (reference channel), from left to right. The second row shows ArrayDPS's separated virtual source 1, FCP estimated filter, and the ArrayDPS separated reverberant source 1 (reference channel). The third and fourth row are the same for source 2.

From observation, we can see that ArrayDPS separated virtual source has some dereverberation effect (not always), this is because our FCP estimated filter sometimes is close to the actual RIR. However, in this paper, we do not over claim that we are doing dereverberation as the virtual source is not aligned with the groundtruth anehoic source, which makes it difficult for us to evaluate dereverberation performance. To perceive the dereverberation effect, the virtual source's demo can be listened in the SMS-WSJ 2-speaker 3-channel shown below.

Figure 1: SMS-WSJ: sample 1015_446c0415_442c040c
Figure 2: SMS-WSJ: sample 15_443c040a_444c0415

Figure 3: SMS-WSJ: sample 725_444c040l_442c040p
Figure 4: SMS-WSJ: sample 1120_445c040c_441c040m

Figure 5: SMS-WSJ: sample 0_442c040o_443c040g
Figure 6: SMS-WSJ: sample 999_441c040c_447c040k