Two new things...

1. I've been experimenting with a layer of radial basis functions (RBFs) for the input layer to my neural nets. Before I would just use a single real value for each input. For example, the pendulum angle was a single real value which used a single input neuron. Now I separate each input into an array of RBFs. It seems that a lot of people use RBFs when using neural nets for motor control. They increase the "locality" of learning: only the connection weights associated with a particular part of the state space are affected, so it helps the neural net retain old knowledge better. Also, it's neat to think about RBFs in terms of biology (think of the arrays of sensory nerves on the skin, the retina, the vestibular system, etc.)

2. I have always had a hard time visualizing what's going on with the neural networks while they learn. The combination of learning rates, eligibility traces for connections, neuronal firing rates, etc. makes it hard to understand what's going on. I had already been plotting various parameters on graphs after the simulations end, but I really wanted to watch how the parameters change in real time. So I wrote some code to see the neural nets floating in 3D space beside the agent's physical body. It shows me the neuron firing rates, connection weight sign/magnitudes, and eligibility traces, making it much easier to see what's going on.

## Friday, May 27, 2005

## Friday, May 20, 2005

### Fixed Problems with Multilayer Neural Network

I discovered yesterday that it really helps to connect the input neurons directly to the output neurons in the multilayer neural net. Now the value function neural net learns really well. The temporal difference error almost always goes to +/- 0.001 eventually.

Now I'm having trouble with the policy/actor neural net. It doesn't seem to improve its performance very much. It'll learn to swing the pendulum a little higher over time, but it never gets high enough to swing straight up. It definitely has enough torque, so that's not the problem. I wonder if it needs more random exploration. I'm currently using a low-pass filtered noise channel that changes slowly over time to encourage long random actions. I'll keep trying stuff...

Now I'm having trouble with the policy/actor neural net. It doesn't seem to improve its performance very much. It'll learn to swing the pendulum a little higher over time, but it never gets high enough to swing straight up. It definitely has enough torque, so that's not the problem. I wonder if it needs more random exploration. I'm currently using a low-pass filtered noise channel that changes slowly over time to encourage long random actions. I'll keep trying stuff...

## Thursday, May 19, 2005

### Problems Learning the Value Function with a Multilayer Neural Network

I've noticed that my agents can learn the value function really well using a linear function approximator - a neural network with a single output layer using linear activation functions. However, when I add a hidden layer of neurons, the agent cannot learn the value function very closely (i.e. the temporal difference error fluctuates between -0.1 and +0.1; the max possible error is +/-1.0). I usually use sigmoid activation functions for the hidden neurons, so I tried linear functions for them as well, but it didn't help. So I'm trying to figure out why it can't converge to something closer to the real value function using a hidden and output layer of neurons. I know that there are not any good convergence guarantees for nonlinear function approximators using temporal difference, but I at least thought using linear activation functions in all neurons, even in a multilayer neural network, still represented a linear function approximator. So maybe something else is wrong.

On the other hand, maybe I won't really need a nonlinear function approximator. It seems like a lot of researchers do pretty well with linear only, but then they only attempt fairly simple control problems.

I've learned a good way to represent the state for the pendulum swing up task. I was inputting the sine and cosine of the pendulum's angle (plus the angular velocity), similar to what Remi Coulom did in his thesis, but I had trouble learning the value function with this method, even with a single layer neural network. Instead I tried representing the angle as two inputs: one "turns on" when the pendulum is between 0 and 180 degrees (the input value ranging from 0 to 1) while the other is "off" (i.e. value of 0). When the pendulum is between 180 and 360 degrees, the first input is off and the other is on. This seemed to work really well - the temporal difference error usually falls to around +/-0.001 within a few minutes of real-time training.

On the other hand, maybe I won't really need a nonlinear function approximator. It seems like a lot of researchers do pretty well with linear only, but then they only attempt fairly simple control problems.

I've learned a good way to represent the state for the pendulum swing up task. I was inputting the sine and cosine of the pendulum's angle (plus the angular velocity), similar to what Remi Coulom did in his thesis, but I had trouble learning the value function with this method, even with a single layer neural network. Instead I tried representing the angle as two inputs: one "turns on" when the pendulum is between 0 and 180 degrees (the input value ranging from 0 to 1) while the other is "off" (i.e. value of 0). When the pendulum is between 180 and 360 degrees, the first input is off and the other is on. This seemed to work really well - the temporal difference error usually falls to around +/-0.001 within a few minutes of real-time training.

## Thursday, May 12, 2005

### More Time for Research

Summer's here, so I finally have time for research again. I'm going to (try to) graduate with a master's degree this summer, so I'll be working on this project a lot over the next few months.

I didn't find any definitive answer concerning time representation in Daw's dissertation, but it was still a great (long) read. I'm still using it as a reference. I implemented some of his ideas about average reward rates (as opposed to discounted rewards, the standard method in temporal difference learning) and opponent dopamine/serotonin channels. The dopamine channel represents phasic (short-term) rewards and tonic (long-term) punishment, and the serotonin channel represents phasic punishment and tonic rewards. Eventually I'd like to do some experiments to see if this model more closely mimics animal behavior.

I've pretty much decided what I'm going to cover in my thesis. The main two topics are 1) temporal difference learning for motor control in continuous time and space, and 2) artificial neural networks for function approximation. My focus is on biologically realistic algorithms, so I'll spend some time talking about how my implementation relates to the brain. I'd like to include at least three experiments with solid results. I'm thinking maybe the pendulum swing up task (a pendulum hanging in midair has to swing itself upright and stay there using a limited amount of torque), the cart-pole task (a cart resting on a plane with an attached pole must force the cart back and forth to keep the pole balanced), and maybe a legged creature that learns to walk.

Another possible addition is the use of a learned model of the environment's dynamics, also using an artificial neural network. We'll see if I have time for that. I'd really like to try it, though, because others (e.g. Doya in his 2000 paper on continuous reinforcement learning) have gotten better results in motor control tasks using a learned dynamics model.

So far I've been working on the pendulum task. I have my value function (critic) working pretty well, but I can't get the policy (actor) to learn very well. I'm thinking my problem is either with my exploration method or how I'm representing the state as inputs to the neural nets.

I didn't find any definitive answer concerning time representation in Daw's dissertation, but it was still a great (long) read. I'm still using it as a reference. I implemented some of his ideas about average reward rates (as opposed to discounted rewards, the standard method in temporal difference learning) and opponent dopamine/serotonin channels. The dopamine channel represents phasic (short-term) rewards and tonic (long-term) punishment, and the serotonin channel represents phasic punishment and tonic rewards. Eventually I'd like to do some experiments to see if this model more closely mimics animal behavior.

I've pretty much decided what I'm going to cover in my thesis. The main two topics are 1) temporal difference learning for motor control in continuous time and space, and 2) artificial neural networks for function approximation. My focus is on biologically realistic algorithms, so I'll spend some time talking about how my implementation relates to the brain. I'd like to include at least three experiments with solid results. I'm thinking maybe the pendulum swing up task (a pendulum hanging in midair has to swing itself upright and stay there using a limited amount of torque), the cart-pole task (a cart resting on a plane with an attached pole must force the cart back and forth to keep the pole balanced), and maybe a legged creature that learns to walk.

Another possible addition is the use of a learned model of the environment's dynamics, also using an artificial neural network. We'll see if I have time for that. I'd really like to try it, though, because others (e.g. Doya in his 2000 paper on continuous reinforcement learning) have gotten better results in motor control tasks using a learned dynamics model.

So far I've been working on the pendulum task. I have my value function (critic) working pretty well, but I can't get the policy (actor) to learn very well. I'm thinking my problem is either with my exploration method or how I'm representing the state as inputs to the neural nets.

Subscribe to:
Posts (Atom)