PERONA GROUP
| Title:
Independent Component Analysis
Author:
Max Welling and Markus Weber
Abstract
A method
is developed that linearly extracts independent features from a data-set.
Unlike PCA, which merely decorrelates the data, ICA searches for directions
in data-space, which are independent across all statistical orders. Our
approach has the advantage of learning a probabilistic model (as opposed
to projections in data-space), while remaining computationally efficient
in high dimensions. |
What is ICA?
Let S denote a vector of D independent
source random variables, and X denote a vector of D
sensor outputs. Furthermore, assume a linear relationship between
S and X (see figure 1), possibly "contaminated"
with gaussian noise, nu. The matrix A, which mixes the sources,
is called mixing matrix. The task of ICA may be formulated as follows:
Given a sequence of (independent) observations, X1, ..., Xn,
estimate the mixing matrix and the original source sequence S1, ...,
Sn.
Why is ICA possible?
Let's
plot each sample of the D sensors in a D-dimensional data space,
as is done in figure 2. If the data were normally distributed, this
cloud would look elliptical. Next we shall center and sphere the data.
This implies that we subtract the mean and scale all eigen-directions such
that the covariance matrix is the identity matrix I . If the
data were normally distributed, the cloud would look like a sphere and
no further information is present to retrieve the original sources.
However, as can be seen from the figure, in the case that the sources are
not normally distrubuted, the cloud displays some features (like the peaks
in the plot) which can be used to rotate the data into the most independent
basis. In this basis the sources lie along the axes.
The inverse of this transformation is then the mixing matrix. The original
source sequence can furthermore be retrieved by acting with the inverse
mixing matrix on the sensor data.
The Model
Our approach
is to fit a probabilistic model to the data maximizing the likelihood of
the data under the model. The model is given in figure 3. Each source is
described by a mixture of Gaussians while all sources are assumed independent.
Then the sources are linearly mixed and convolved with isotropic Gaussian
noise. In this framework, the sources are considered as hidden variables.
The expectation maximization (EM) algorithm is employed to iteratively
fit this model to the data. The estimated parameters are the noise
variance, the parameters of the source densities (i.e. mixture coefficients,
means, variances), and the mixing matrix. Because we first sphered
the data in a preprocessing step, it can be proven that the mixing matrix
is a rotation. This, and the fact that we use simple isotropic Gaussian
noise, avoids an exponential increase of the complexity with the number
of sources. For this purpose, the EM algorithm had to be adjusted in order
to optimize over orthogonal matrices.
Soft-Switching
If one is
not interested in an accurate probabilistic model, but only in the mixing
matrix, a simpler algorithm is possible. It has been observed in
the literature that the estimation of the mixing matrix is relatively insensitive
to the details of the source densities. We will therefore use a mixture
of a super-Gaussian model and a sub-Gaussian model for every source density
(sub-Gaussian is more uniform than Gaussian, super-Gaussian is more peaked
than Gaussian with longer tails, see figure 4 red line, yellow line
is Gaussian). The models for the super- and sub-Gaussian densities
are in turn build from a preset mixture of Gaussians. For every source
only one parameter, namely the mixing coefficient between the super- and
sub-Gaussian component, is estimated. This reduces the total number of
parameters of the model (and thus the complexity), which helps to avoid
overfitting in cases where few data are available.
Discussion of the Algorithm
In the following
we will discuss some important features of our algorithm.
-
The algorithm
estimates a square mixing matrix. This implies that the number of sources
need to match the number of sensors.
-
Simple noise is
modelled.
-
Due to the simple
noise assumption and the invertibility of the mixing matrix, an exponential
increase of the complexity with the number of sources is avoided.
-
The algorithm
estimates a probability density of the data. This is important in applications
such as pattern recognition. It also allows data synthesis.
-
EM has stable
convergence and does not need parameter tuning (like stepsizes in gradient
descent).
-
An extension to
ICA of incomplete data has been worked out and tested on sound data. This
problem is important in the case of sensor occlusions.
-
An extension to
mixtures of ICAs is straightforward due to the nature of the EM algorithm.
Experiments
To test the
viability of the algorithm we performed computer experiments on real sound
data and synthetic data from a Laplace density. The Amari distance between
the true mixing matrix and the estimated one was monitored. In all cases
we found excellent convergence to good solutions. We found that the
method performed comparable to 2 other widely used ICA methods, while enjoying
the advantages mentioned above. In the figure we plot the relative Amari
distance (normalized to start at 1) when approximately half of the data
was deleted by hand. The blue line follows the Amari distance when only
complete data vectors are used. The green line is the improved result after
the incomplete data were taken into account.
Conclusion
In the figure we show the
relationship between 6 linear models. The top row deals with PCA-like models
which look for high variance directions (FA is factor analysis). All these
models decorrelate the data using only second order statistics. From left
to right the noise model becomes increasingly sophisticated. While PCA
searches for data projections, the other two models estimate probability
densities. In the bottom row we find three ICA related techniques. Again,
from left to right the noise model becomes more sophisticated. Also, the
leftmost model (ICA) searches for projections while the other two estimate
probability densities. The difference between the top row and the bottom
row is that all ICA related methods try to find a basis on which the data
are independent across all statistical orders. The most general method
of all (IFA is independent factor analysis) suffers from an exponential
increase in complexity with increasing number of sources. Precisely
this drawback is avoided in EM-ICA (method discussed here). The price paid
for this is a square mixing matrix and a simple noise model.
|