Progress has been pretty steady, but slower than I'd like. At least I'm getting close to finishing the pendulum swing up task. Then I can move on to the cart-pole/inverted pendulum task.
My current problem (and probably the last main problem before the pendulum task is finished) is how motor output signals are represented. Here are two possibilities:
1. Use one output neuron for each motor output signal. The neuron's firing rate encodes the motor signal to be used (e.g. a torque value). This allows the use of continuous signals for motor control, a desirable feature. The problem is that any reinforcement (via the temporal difference error) increases or decreases the firing rate of all outputs in the same direction (i.e. all firing rates will either increase or decrease), but we actually might want some to increase and some to decrease. It might be possible to use some kind of differential Hebbian update in conjunction with the temporal difference error (i.e. weight changes are proportional to TD error * pre-synaptic firing rate * change in post-synaptic firing rate), though I haven't read about anyone else doing this.
2. Use a "winner-take-all" scheme to select from among many action choices. Say we have three actions for the pendulum task: apply a constant clockwise torque, apply a constant counter-clockwise torque, or do nothing (letting the pendulum swing freely). Using three output neurons, we choose the one with the highest firing rate as the "winner." The winning neuron's associated action is applied. Another alternative is to have the firing rates represent probabilities of being chosen as the winner. The problem here is that there is a finite number of possible actions. This method currently gives me better results on the pendulum swing up task (it learns to swing the pendulum up and hold it for a second or two), but it can't keep it up indefinitely.
It might be possible (and probably is biologically plausible) to combine the two approaches: fine motor control is learned by one mechanism and encapsulated in a motor program, and another mechanism learns to switch from among a finite set of motor programs. It would be nice to have an agent automatically construct such motor programs, possibly even constructing hierarchies of them. Richard Sutton has done some work recently with what he calls "options" which are basically control policies that have special starting and termination conditions. I think these might be a pretty good theoretical foundation for building hierarchies of motor programs.