Summer's here, so I finally have time for research again. I'm going to (try to) graduate with a master's degree this summer, so I'll be working on this project a lot over the next few months.
I didn't find any definitive answer concerning time representation in Daw's dissertation, but it was still a great (long) read. I'm still using it as a reference. I implemented some of his ideas about average reward rates (as opposed to discounted rewards, the standard method in temporal difference learning) and opponent dopamine/serotonin channels. The dopamine channel represents phasic (short-term) rewards and tonic (long-term) punishment, and the serotonin channel represents phasic punishment and tonic rewards. Eventually I'd like to do some experiments to see if this model more closely mimics animal behavior.
I've pretty much decided what I'm going to cover in my thesis. The main two topics are 1) temporal difference learning for motor control in continuous time and space, and 2) artificial neural networks for function approximation. My focus is on biologically realistic algorithms, so I'll spend some time talking about how my implementation relates to the brain. I'd like to include at least three experiments with solid results. I'm thinking maybe the pendulum swing up task (a pendulum hanging in midair has to swing itself upright and stay there using a limited amount of torque), the cart-pole task (a cart resting on a plane with an attached pole must force the cart back and forth to keep the pole balanced), and maybe a legged creature that learns to walk.
Another possible addition is the use of a learned model of the environment's dynamics, also using an artificial neural network. We'll see if I have time for that. I'd really like to try it, though, because others (e.g. Doya in his 2000 paper on continuous reinforcement learning) have gotten better results in motor control tasks using a learned dynamics model.
So far I've been working on the pendulum task. I have my value function (critic) working pretty well, but I can't get the policy (actor) to learn very well. I'm thinking my problem is either with my exploration method or how I'm representing the state as inputs to the neural nets.