Tuesday, October 21, 2008

Learning to Represent Patterns

For the past several months I have been revisiting the issue of sensory and motor representation. I had implemented some initial ideas at the end of 2006, but I hadn't taken the time to study things in depth. My goal here is to represent real-valued sensory and motor spaces as efficiently as possible with limited resources. For example, say we're talking about visual sensory data (i.e. pixel arrays) involving 100 pixels (10x10 image), and we only have the resources to represent the 28 most common visual patterns. If we want to represent that visual space efficiently, we have to move our 28 basis vectors around the 100-dimensional space so that the resulting vectors represent the 28 most common visual patterns. This all must be learned online (in real time) as the system is experiencing visual data. Then, after learning, each incoming image will be classified as one of those 28 "categories."

One approach is based on the standard statistical approach of maximum likelihood learning. We assume the basis vectors are the center points of Gaussian kernels, each with a corresponding variance. For each data sample, we compute the likelihood of seeing that sample. (Given our current model, i.e. the 28 Gaussian centers and variances, what's the likelihood that the data sample was "generated" by our model?) Maximum likelihood learning attempts to adjust the Gaussian kernel positions and sizes within the data space to maximize this likelihood value over all data samples. The end result should be the model that best represents the actual data distribution.

Another approach is based on information theory, specifically the mutual information between the "input" and "output" variables. (This idea is usually attributed to Ralph Linsker at IBM Research, who called it "infomax.") I like to think of it this way: the data samples are coming from a real-valued input variable, V. We want to classify those samples into a discrete number of classes which represent the discrete class variable C. Each Gaussian kernel represents one class in C. Now, for each sample v, we want to transmit the maximum amount of information to the output class variable. We can do this by maximizing the mutual info between V and C, given the constraint of limited resources (i.e. a finite number of Gaussian kernels).

How do we do this? There are several ways. The simplest way to get started is to do gradient ascent on the mutual info. (Take the derivative of the expression for mutual info between V and C with respect to the Gaussian kernels' parameters, then continually adjust those parameters to increase the mutual info.) However, this direct gradient-based approach is hard to derive for mutual info because it depends on terms that are difficult to estimate; also, the resulting learning rules are (in my experience) unstable. But in general, any learning rule can be used as long as it generates a model with two properties: maximal prior entropy and minimal posterior entropy. Before seeing each data sample, the prior distribution over C should be uniform (maximum entropy/uncertainty), ensuring that each kernel is utilized equally (i.e. we're not wasting resources). After seeing each sample, the posterior distribution over C should be totally peaked on one class/kernel, representing minimal entropy/uncertainty. This is true when the Gaussian kernels are distinct, not overlapping. Thus, for each data sample received, we're reducing uncertainty as much as possible, which is equivalent to transmitting the greatest amount of information.

So I've been working on an infomax algorithm to represent real-valued data vectors optimally with limited resources. The tricky thing is that the posterior probability calculations require an accurate estimation of the probability density at each data sample. But I think I have a good solution to all these issues. The resulting algorithm appears to be great at pattern classification (< 10% error on the classic Iris data set and < 4% error on a handwritten digits set). More importantly, it should be just what I need for the core of my sensory and motor cortex systems.