Blind Source Separation of recorded speech and music signals.
We present methods to separate blindly mixed signals recorded in a
room. The learning algorithm is based on the information maximization
in a single layer neural network. We focus on the implementation of
the learning algorithm and on issues that arise when separating speakers
in room recordings. We used an infomax approach in a feedforward
neural network implemented in the frequency domain using
the polynomial filter matrix algebra technique. Fast
convergence speed was achieved by using a time-delayed decorrelation
method as a preprocessing step. Under minimum-phase mixing conditions
this preprocessing step was sufficient for the separation of
signals. These methods successfully separated a recorded voice with
music in the background (cocktail party problem).
Blind Source Separation: Audio Examples
The audio-files have been updated with the new proposed algorithm
combining TDD - algorithm and ICA (see ICASSP'98 paper).
1. Speech - Music Separation
A speaker has been recorded with two distance talking microphones
(sampling rate 16kHz)
in a normal office room with loud
music in the background. The distance between the speaker, cassette player
and the microphones is about 60cm in a square ordering.
(All files are in 16kHz wav-format).
2. Speech - Speech Separation
A real Cocktail Party Effect .
Two Speakers have been recorded speaking simultaneously.
Speaker 1 says the digits from one to ten in English and speaker 2 counts at
at the same time the digits in Spanish (uno dos ... )
The recording has been done in a normal office room. The distance between the
speakers and the microphones is about 60cm in a square ordering
(sampling rate 16kHz).
(All files are in 16kHz wav-format).
3. Speech - Speech Separation in difficult environments
A real Cocktail Party Effect II .
Two Speakers have been recorded speaking simultaneously.
This time the recording was in a conference room ( 5.5m by 8m ).
The conference room had some air-conditioning noise.
Both speakers are reading a section from the newspaper for 16sec.
The mics were placed 120 cm away from the speakers.
(sampling rate 16kHz).
(All files are in 16kHz wav-format).
The unmixing filters need to be sufficiently long. We used a
filter size of 2048 taps for each filter.