Rao, Bhaskar (NOTE: Ken Kreutz-Delgado will likely present the paper)
Coauthors(s): K. Kreutz-Delgado, B.D. Rao, K. Engan Electrical and Computer Engineering Jacobs School of Engineering University of California, San Diego and T.-W. Lee, T. Sejnowski Computational Neurobiology Laboratory The Salk Institute
UC San Diego
Electrical and Computer Engineering
Mail Code 0407 Room 2407 Bldg EB-U1 9500 Gilman Drive La Jolla, CA 92093-0407



Learning Overcomplete Dictionaries: Convex/Schur-Convex (SCS) Log-Priors, Factorial Codes, and Independent/Dependent Component Analysis (I/DCA).

As described in the companion paper, "Convex/Schur-Convex (SCS) Log-Priors and Sparse Coding," the generative model y = A x + n (1) can be used to develop algorithms enabling sparse coding of y. In that paper it it shown that given an overcomplete dictionary, A, (with the columns of A comprising the dictionary vectors) a MAP estimate of the source vector, x, will yield a sparse coding of y in the low-noise limit if the prior, p(x), is convex/Schur-convex (CSC). For p(x) factorizable into a product of marginal probabilities, the resulting code is also an ICA representation of y. More generally, a CSC prior results in a sparse representation even in the non-factorizable case (with x then forming a "dependent component analysis," or DCA, representation). As noted in recently published work by Olshausen, Fields, Lewicki, and Sejnowski, given iid data Y = (y_1, ... , y_N) generated by the model (1), a maximum likelihood estimate, A_ML, of the dictionary A can be determined as A_ML = argmax_A p(Y; A). This requires integrating out the unobservable iid source vectors, X = (x_1, ... , x_N), in order to compute p(Y;A) from the (assumed) known probabilities p(x) and p(n). In essence X is formally treated as a vector of nuisance parameters which, in principle, can be removed via integration. However, because the prior p(x) is generally taken to be supergaussian, this integration is intractable or computationally unreasonable. Thus approximations to this integration are performed which results in an approximation to p(Y;A) which is then maximized with respect to Y. A new, better, approximation to the integration can then be made and this process is iterated until the estimate of A has converged. No formal proof of the convergence of these algorithms to the true maximum likelihood estimate, A_ML, are given, but they appear to perform well in various test cases. We examine the problem of dictionary learning within the framework of our recently developed CSC log-prior model which can be used to obtain sparse codes (for a known dictionary, A) both for the ICA (factorial code) and DCA (nonfactorial code) cases. Such sparse codes are found using an affine scaling transformation (AST) like interative algorithm recently described by Rao and Kreutz-Delgado which finds a locally optimally MAP estimate of the source x for an associated observation y. This framework enables the development of a variety of novel algorithms using a missing data formulation where the missing data, X, is assumed available. This is the standard framework used to develop the expectation-maximization (EM) algorithm. However, the proposed algorithms are generally distinct from the EM algorithm and the asymptotic relationship of the resulting dictionary estimates to the true maximum likelihood estimate is currently under theoretical investigation. Under certain conditions, convergence to a local minimum of a loss function which measures a combination of functions of the discrepancy ||y - A x|| and the degree of sparsity in x *can* be shown, but the asymptotic statistical properties of this estimate remain unclear (at least at the time of the writing of this abstract). The complexity of the proposed dictionary algorithms, their similarity/dissimilarity to previously proposed algorithms, and their performance on some test cases are also discussed.