PERONA GROUP 
 
 
TitleIndependent Component Analysis 

Author: Max Welling and Markus Weber 

Abstract 
A method is developed that linearly extracts independent features from a data-set. Unlike PCA, which merely decorrelates the data, ICA searches for directions in data-space, which are independent across all statistical orders. Our approach has the advantage of learning a probabilistic model (as opposed to projections in data-space), while remaining computationally efficient in high dimensions. 

What is ICA?
Let  denote  a vector of  D independent source random variables, and  X  denote a vector of D sensor outputs.  Furthermore, assume a linear relationship between  S and X  (see figure 1),  possibly "contaminated" with gaussian noise, nu. The matrix A, which mixes the sources, is called mixing matrix. The task of ICA may be formulated as follows:

Given a sequence of (independent) observations, X1, ..., Xn, estimate the mixing matrix and the original source sequence S1, ..., Sn.
   

  
  

Why is ICA possible?
Let's  plot each sample of the D sensors in a D-dimensional data space,  as is done in figure 2. If the data were  normally distributed, this cloud would look elliptical. Next we shall center and sphere the data. This implies that we subtract the mean and scale all eigen-directions such that the covariance matrix is the identity matrix I .  If the data were normally distributed, the cloud would look like a sphere and no further information is present to retrieve the original sources.  However, as can be seen from the figure, in the case that the sources are not normally distrubuted, the cloud displays some features (like the peaks in the plot) which can be used to rotate the data into the most  independent basis.  In this basis the sources lie along the axes.  The inverse of this transformation is then the mixing matrix. The original source sequence can furthermore be retrieved by acting with the inverse mixing matrix on the sensor data.
 

  

The Model
Our approach is to fit a probabilistic model to the data maximizing the likelihood of the data under the model. The model is given in figure 3. Each source is described by a mixture of Gaussians while all sources are assumed independent. Then the sources are linearly mixed and convolved with isotropic Gaussian noise. In this framework, the sources are considered as hidden variables. The expectation maximization (EM) algorithm is employed to iteratively fit this model to the data. The estimated parameters  are the noise variance, the parameters of the source densities (i.e. mixture coefficients, means, variances), and the mixing matrix.  Because we first sphered the data in a preprocessing step, it can be proven that the mixing matrix is a rotation. This, and the fact that we use simple isotropic Gaussian noise, avoids an exponential increase of the complexity with the number of sources. For this purpose, the EM algorithm had to be adjusted in order to optimize over orthogonal matrices.
 

 
 
 
 
 
 
Soft-Switching
If one is not interested in an accurate probabilistic model, but only in the mixing matrix, a simpler  algorithm is possible. It has been observed in the literature that the estimation of the mixing matrix is relatively insensitive to the details of the source densities. We will therefore use a mixture of a super-Gaussian model and a sub-Gaussian model for every source density (sub-Gaussian is more uniform than Gaussian, super-Gaussian is more peaked  than Gaussian with longer tails, see figure 4 red line, yellow line  is Gaussian).  The models for the super- and sub-Gaussian densities are in turn build from a preset mixture of Gaussians. For every source only one parameter, namely the mixing coefficient between the super- and sub-Gaussian component, is estimated. This reduces the total number of parameters of the model (and thus the complexity), which helps to avoid overfitting in cases where few data are available.
 
 

Discussion of the Algorithm
In the following we will discuss some important features of our algorithm.

  • The algorithm estimates a square mixing matrix. This implies that the number of sources need to match the number of sensors.
  • Simple noise is modelled.
  • Due to the simple noise assumption and the invertibility of the mixing matrix, an exponential increase of the complexity  with the number of sources is avoided.
  • The algorithm estimates a probability density of the data. This is important in applications such as pattern recognition. It also allows data synthesis. 
  • EM has stable convergence and does not need parameter tuning (like stepsizes in gradient descent).
  • An extension to ICA of incomplete data has been worked out and tested on sound data. This problem is important   in the case of sensor occlusions.
  • An extension to mixtures of ICAs is straightforward due to the nature of the EM algorithm.
 

Experiments
To test the viability of the algorithm we performed computer experiments on real sound data and synthetic data from a Laplace density. The Amari distance between the true mixing matrix and the estimated one was monitored. In all cases we found excellent  convergence to good solutions. We found that the method performed comparable to 2 other widely used ICA methods, while enjoying the advantages mentioned above. In the figure we plot the relative Amari distance (normalized to start at 1) when approximately half of the data was deleted by hand. The blue line follows the Amari distance when only complete data vectors are used. The green line is the improved result after the incomplete data were taken into account.

 

Conclusion
In the figure we  show the relationship between 6 linear models. The top row deals with PCA-like models which look for high variance directions (FA is factor analysis). All these models decorrelate the data using only second order statistics. From left to right the noise model becomes increasingly sophisticated. While PCA searches for data projections, the other two models estimate probability densities. In the bottom row we find three ICA related techniques. Again, from left to right the noise model becomes more sophisticated. Also, the leftmost model (ICA) searches for projections while the other two estimate probability densities. The difference between the top row and the bottom row is that all ICA related methods try to find a basis on which the data are independent across all statistical orders.  The most general method of all (IFA is independent factor analysis) suffers from an exponential increase in complexity with increasing  number of sources. Precisely this drawback is avoided in EM-ICA (method discussed here). The price paid for this is a square mixing matrix and a simple noise model.
 

 
 

Publications
M.Welling and  M. Weber  (1999)  A constrained E.M. Algorithm for Independent Component Analysis, submitted for publication to Neural Computation.

M.Welling and Markus Weber (1999) Independent Component Analysis of Incomplete Data, Proceedings of the 6th joint Symposium on Neural Computation, May 22 1999, UCSD.
 

Max Welling - welling@vision.caltech.edu