Introduction to Independent Component Analysis

Recently, blind source separation by Independent Component Analysis (ICA) has received attention because of its potential applications in signal processing such as in speech recognition systems, telecommunications and medical signal processing. The goal of ICA is to recover independent sources given only sensor observations that are unknown linear mixtures of the unobserved independent source signals. In contrast to correlation-based transformations such as Principal Component Analysis (PCA), ICA not only decorrelates the signals (2nd-order statistics) but also reduces higher-order statistical dependencies, attempting to make the signals as independent as possible. In other words, "ICA is a way of finding a linear non-orthogonal co-ordinate system in any multivariate data. The directions of the axes of this co-ordinate system are determined by both the second and higher order statistics of the original data. The goal is to perform a linear transform which makes the resulting variables as statistically independent from each other as possible."

Two different research communities have considered the analysis of independent components. On one hand, the study of separating mixed sources observed in an array of sensors has been a classical and difficult signal processing problem. The seminal work on blind source separation was by Herault and Jutten (1986) where they introduced an adaptive algorithm in a simple feedback architecture that was able to separate several unknown independent sources. Their approach has been further developed by Jutten and Herault (1991), Karhunen and Joutsensalo (1994), Cichocki, Unbehauen and Rummert (1994). Comon (1994) elaborated the concept of independent component analysis and proposed cost functions related to the approximate minimization of mutual information between the sensors.

In parallel to blind source separation studies, unsupervised learning rules based on information-theory were proposed by Linsker (1992). The goal was to maximize the mutual information between the inputs and outputs of a neural network. This approach is related to the principle of redundancy reduction suggested by Barlow (1961) as a coding strategy in neurons. Each neuron should encode features that are as statistically independent as possible from other neurons over a natural ensemble of inputs; decorrelation as a strategy for visual processing was explored by Atick (1992). Nadal and Parga (1994) showed that in the low-noise case, the maximum of the mutual information between the input and output of a neural network implied that the output distribution was factorial; that is, the multivariate probability density function (p.d.f.) can be factorized as a product of marginal p.d.f.s. Roth and Baram (1996) and Bell and Sejnowski (1995) independently derived stochastic gradient learning rules for this maximization and applied them, respectively, to forecasting, time series analysis, and the blind separation of sources. Bell and Sejnowski (1995) put the blind source separation problem into an information-theoretic framework and demonstrated the separation and deconvolution of mixed sources. Their adaptive methods are more plausible from a neural processing perspective than the cumulant-based cost functions proposed by Comon (1994). A similar adaptive method for source separation was proposed by Cardoso and Laheld (1996).

Other algorithms for performing ICA have been proposed from different viewpoints. Maximum Likelihood Estimation (MLE) approaches to ICA were first proposed by Gaeta and Lacoume (1990) and elaborated by Pham (1992). Pearlmutter and Parra (1996), MacKay (1996) and Cardoso (1997) showed that the infomax approach of Bell and Sejnowski (1995) and the maximum likelihood estimation approach are equivalent. Girolami and Fyfe (1997b,c), motivated by information-theoretic indices for Exploratory Projection Pursuit (EPP) used marginal negentropy~\footnote{A general term for negentropy is relative entropy (Cover and Thomas, 1991)} as a projection index and showed that kurtosis-seeking projection pursuit will extract one of the underlying sources from a linear mixture. A multiple output EPP network was developed to allow full separation of all the underlying sources (Girolami and Fyfe, 1997c). Nonlinear PCA algorithms for ICA which have been developed by Karhunen and Joutsensalo (1994), Xu (1993) and Oja (1997) can also be viewed from the infomax principle since they approximately minimize the sum of squares of the fourth-order marginal cumulants (Comon, 1994) and therefore approximately minimize the mutual information of the network outputs (Girolami and Fyfe, 1997a). Bell and Sejnowski (1995) have pointed out a similarity between their infomax algorithm and the Bussgang algorithm in signal processing and Lambert (1996) elucidated the connection between three different Bussgang cost functions. Lee et al. (1998) show how the Bussgang property relates to the infomax principle and how all of these seemingly different approaches can be put into a unifying framework for the source separation problem based on an information theoretic approach.

The original infomax learning rule for blind separation by Bell and Sejnowski (1995) was suitable for super-Gaussian sources. Girolami and Fyfe (1997b) derive, by choosing negentropy as a projection pursuit index, a learning rule that is able to blindly separate mixed sub- and super-Gaussian source distributions. Lee, Girolami and Sejnowski (1997) show that the learning rule is an extension of the infomax principle satisfying a general stability criterion and preserving the simple architecture of Bell and Sejnowski (1995). When optimized using the natural gradient (Amari, 1997), or equivalently the relative gradient (Cardoso and Laheld, 1996), the learning rule gives superior convergence. Simulations and results on real-world physiological data show the power of the proposed methods (Lee, Girolami and Sejnowski, 1997).

Extensive simulations have been performed to demonstrate the power of the learning algorithm. However, instantaneous mixing and unmixing simulations are {\em toy} problems and the challenge lies in dealing with real world data. Makeig et al. (1996) applied the original infomax algorithm to EEG and ERP data showing that the algorithm can extract EEG activations and isolate artifacts. Jung et al. (1997) show that the extended infomax algorithm is able to linearly decompose EEG artifacts such as line noise, eye blinks, and cardiac noise into independent components with sub- and super-Gaussian distributions. McKeown et. al. (1997) have used the extended ICA algorithm to investigate task-related human brain activity in fMRI data. By determining the brain regions that contained significant amounts of specific temporally independent components, they were able to specify the spatial distribution of transiently task-related brain activations. Other potential applications may result from exploring independent features in natural images. Bell and Sejnowski (1997) suggest that independent components of natural scenes are edge filters. The filters are localized, mostly oriented and similar to Gabor like filters. The outputs of the ICA filters are sparsely distributed. Bartlett and Sejnowski (1997) and Gray, Movellan and Sejnowski (1997) demonstrate the successful use of the ICA filters as features in face recognition tasks and lippreading tasks respectively.

For these applications, the instantaneous mixing model may be appropriate because the propagation delays are negligible. However, in real environments substantial time-delays may occur and an architecture and algorithm is needed to account for the mixing of time-delayed sources and convolved sources. The multichannel blind source separation problem has been addressed by Yellin and Weinstein (1994) and Ngyuen and Jutten (1995) and others based on $4^{th}$-order cumulants criteria. An extension to time-delays and convolved sources from the infomax viewpoint using a feedback architecture has been developed by Torkkola (1996). Lee, Bell and Lambert (1997) have extended the blind source separation problem to a full feedback system and a full feedforward system. The feedforward architecture allows the inversion of non-minimum phase systems. In addition, the rules are extended using polynomial filter matrix algebra in the frequency domain (Lambert, 1996). The proposed method can successfully separate voices and music recorded in a real environment. Lee, Bell and Orglmeister (1997) show that the recognition rate of an automatic speech recognition system is increased after separating the speech signals.

Since ICA is restricted and relies on several assumptions researchers have started to tackle a few limitations of ICA. One obvious but non-trivial extension is the nonlinear mixing model. In (Hermann and Yang, 1996; Lin and Cowan, 1997; Pajunen, 1997) nonlinear components are extracted using self-organizing-feature-maps (SOFM). Other researchers (Burel, 1992; Lee, Koehler and Orglmeister, 1997; Taleb and Jutten, 1997; Yang, Amari and Cichocki, 1997) have used a more direct extension to the previously presented ICA models. They include certain flexible nonlinearities in the mixing model and the goal is to invert the linear mixing matrix as well as the nonlinear mixing. More recently, Hochreiter and Schmidhuber (1998) have proposed low complexity coding and decoding approaches for nonlinear ICA. Other limitations such as the under-determined problem in ICA, i.e. having less sensors than sources and noise models in the ICA formulation are subject to current research efforts.

ICA is a fairly new and a generally applicable method to several challenges in signal processing. It reveals a diversity of theoretical questions and opens a variety of potential applications. Successful results in EEG, fMRI, speech recognition and face recognition systems indicate the power and optimistic hope in the new paradigm.

Te-Won Lee, March 98