Blind Speech Separation (BSS) aims to separate multiple speech sources from mixtures recorded by any microphone array. We propose ArrayDPS to solve the BSS problem in an unsupervised, array agnostic, and generative manner. ArrayDPS uses a speech diffusion model as a prior and then samples from the conditional distribution of speech sources given the multi-channel mixture. This is a challenging task because it is a blind inverse problem; both relative room impulse responses (relative RIR) and sources are unknown, so the likelihood is intractable. ArrayDPS solves this problem by estimating the relative RIRs during diffusion posterior sampling (DPS), initialized by independent vector analysis (IVA). After the ArrayDPS sampling process, we can separate all the sources, along with all the relative RIRs (inferring room acoustics and source locations), without knowing any of the microphone array information. Also, this method only needs a simple single-speaker speech diffusion model as a prior for any microphone array. We show elaborate experiments and results to validate our method.
Here is the main result of our paper for 2-speaker separation on 3-channel SMS-WSJ dataset.
In the following demos, we evalute a subset of methods from Table 1 (shown above):
CLEAN - Groundtruth souce at referece channel
Spatial Cluster - row 1a in the Table
IVA - row 1c in the Table
UNSSOR - row 1d in the Table
ArrayDPS-ML - row 4a in the Table
ArrayDPS-A1 - row 2a in the Table, first sampled result
ArrayDPS-A2 - row 2a in the Table, second sampled result
ArrayDPS-A3 - row 2a in the Table, third sampled result
ArrayDPS-A4 - row 2a in the Table, fourth sampled result
ArrayDPS-A5 - row 2a in the Table, fifth sampled result
TF-GridNet-SMS - row 5b in the Table
TF-GridNet-Spatial - row 5c in the Table
Audio samples for SMS-WSJ dataset (aligned with UNSSOR's demo), 2 speakers and 3 microphones.
Note that here we also show the virtual source sampled ArrayDPS. For clean signal, the original anechoic clean speech is displayed in the virtual source column.
Sample 1: 1015_446c0415_442c040c
Mixture
Methods
Source 1
Source 2
Virtual Source 1
Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-SMS
Sample 2: 15_443c040a_444c0415
Mixture
Methods
Source 1
Source 2
Virtual Source 1
Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
ArrayDPS-A1
ArrayDPS-A2
ArrayDPS-A3
ArrayDPS-A4
ArrayDPS-A5
TF-GridNet-SMS
Sample 3: 1120_445c040c_441c040m
Mixture
Methods
Source 1
Source 2
Virtual Source 1
Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
TF-GridNet-SMS
Sample 4: 0_442c040o_443c040g
Mixture
Methods
Source 1
Source 2
Virtual Source 1
Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
TF-GridNet-SMS
Sample 5: 999_441c040c_447c040k
Mixture
Methods
Source 1
Source 2
Virtual Source 1
Virtual Source 2
Clean
Spatial Clustering
IVA
UNSSOR
ArrayDPS-ML
TF-GridNet-SMS
Audio samples for Spatialized WSJ0-2Mix, 2 speakers and 4 microphones.
The mixtures are recorded in an office room (about 7mx4mx2.7m) with 3 microphones, where two speakers are speaking simultaneously inside the room. Three microphones are set up shown as below and the speakers are at least 1 meter away from the mics. Some background noise is there, including air conditioning noise, noise form headlights, clock ticking....Also note that some speakers might speak the sentence a bit wrong (stuttering for some words).
3-channel microphone setup for real-world mixture recording.
Each figure corresponds to one mixture separation. The first row shows the anehoic source 1, source 1's RIR, and reverberant source 1 (reference channel), from left to right. The second row shows ArrayDPS's separated virtual source 1, FCP estimated filter, and the ArrayDPS separated reverberant source 1 (reference channel). The third and fourth row are the same for source 2.
From observation, we can see that ArrayDPS separated virtual source has some dereverberation effect (not always), this is because our FCP estimated filter sometimes is close to the actual RIR. However, in this paper, we do not over claim that we are doing dereverberation as the virtual source is not aligned with the groundtruth anehoic source, which makes it difficult for us to evaluate dereverberation performance. To perceive the dereverberation effect, the virtual source's demo can be listened in the SMS-WSJ 2-speaker 3-channel shown below.