Blind Source Separation of recorded speech and music signals.

We present methods to separate blindly mixed signals recorded in a room. The learning algorithm is based on the information maximization in a single layer neural network. We focus on the implementation of the learning algorithm and on issues that arise when separating speakers in room recordings. We used an infomax approach in a feedforward neural network implemented in the frequency domain using the polynomial filter matrix algebra technique. Fast convergence speed was achieved by using a time-delayed decorrelation method as a preprocessing step. Under minimum-phase mixing conditions this preprocessing step was sufficient for the separation of signals. These methods successfully separated a recorded voice with music in the background (cocktail party problem).


Blind Source Separation: Audio Examples

The audio-files have been updated with the new proposed algorithm combining TDD - algorithm and ICA (see ICASSP'98 paper).

1. Speech - Music Separation

A speaker has been recorded with two distance talking microphones (sampling rate 16kHz) in a normal office room with loud music in the background. The distance between the speaker, cassette player and the microphones is about 60cm in a square ordering. (All files are in 16kHz wav-format).

2. Speech - Speech Separation


A real Cocktail Party Effect . Two Speakers have been recorded speaking simultaneously. Speaker 1 says the digits from one to ten in English and speaker 2 counts at at the same time the digits in Spanish (uno dos ... ) The recording has been done in a normal office room. The distance between the speakers and the microphones is about 60cm in a square ordering (sampling rate 16kHz). (All files are in 16kHz wav-format).


3. Speech - Speech Separation in difficult environments


A real Cocktail Party Effect II . Two Speakers have been recorded speaking simultaneously. This time the recording was in a conference room ( 5.5m by 8m ). The conference room had some air-conditioning noise. Both speakers are reading a section from the newspaper for 16sec. The mics were placed 120 cm away from the speakers. (sampling rate 16kHz). (All files are in 16kHz wav-format). The unmixing filters need to be sufficiently long. We used a filter size of 2048 taps for each filter.