The Verve Project: Self-Referential Reinforcement Learning

Lately I've been researching new reinforcement learning methods. Genetic algorithms have worked well so far, but I thought it might be good to see what else is out there.

I came across the website of Juergen Schmidhuber, a researcher at the IDSIA lab in Switzerland (www.idsia.ch). A lot of his work deals with "self-referential" learning systems. Instead of having a single hard-coded learning algorithm for some agent, you start with some initial learning "germ" that is able to modify any part of the agent's policy, including the learning system itself. This allows an agent to learn better strategies, to learn better learning strategies, to learn how to learn better learning strategies... An important feature of such a system is that there must be a closed loop somewhere (something like Hofstadter's "strange loops" he describes in Goedel, Escher, Bach). You can't simply have a meta-learning algorithm that only modifies the level below it; it must be totally self-referential to be able to change all parts of its learning strategy.

Obviously there must be some hard-coded aspects of the system. For example, if reinforcement is provided from the agent's environment, the agent shouldn't learn to misinterpret what's good and what's bad reinforcement; the agent should continually try to improve itself to meet a fixed goal.

My main concern is the problem of getting stuck in local optima. GAs can search through different chunks of the solution space in parallel, but this new method might not allow that. How do humans search through a solution space? We don't have multiple bodies that can try a ton of possible solutions in parallel, though something like this probably occurs in our minds. We create lots of hypotheses and test them against our mental model of the world, then test the best hypothesis against the real world. Predicting outcomes and measuring the actual outcomes against our predictions probably comes into play somewhere.

I've implemented two such self-referential learning systems so far, both modifying a character's neural net. One is sort of a dynamic programming approach, the other is totally neural net-based. The first didn't work so well, probably because it had to learn a sequential program to adjust a parallel architecture (the character's neural net). The second is a regular neural net with special output nodes that can address, read, and modify any parameter in the net. I'm still experimenting with this. One problem I'm having is that the network usually reaches a stable attractor and just stops changing: no character movement and no changes due to learning. I might add some probabilistic features to keep this from happening. For example, maybe there should always be a non-zero probability that the learning system can modify things at any time.

The Verve Project

Saturday, August 28, 2004

Self-Referential Reinforcement Learning

No comments: