I've noticed that my agents can learn the value function really well using a linear function approximator - a neural network with a single output layer using linear activation functions. However, when I add a hidden layer of neurons, the agent cannot learn the value function very closely (i.e. the temporal difference error fluctuates between -0.1 and +0.1; the max possible error is +/-1.0). I usually use sigmoid activation functions for the hidden neurons, so I tried linear functions for them as well, but it didn't help. So I'm trying to figure out why it can't converge to something closer to the real value function using a hidden and output layer of neurons. I know that there are not any good convergence guarantees for nonlinear function approximators using temporal difference, but I at least thought using linear activation functions in all neurons, even in a multilayer neural network, still represented a linear function approximator. So maybe something else is wrong.
On the other hand, maybe I won't really need a nonlinear function approximator. It seems like a lot of researchers do pretty well with linear only, but then they only attempt fairly simple control problems.
I've learned a good way to represent the state for the pendulum swing up task. I was inputting the sine and cosine of the pendulum's angle (plus the angular velocity), similar to what Remi Coulom did in his thesis, but I had trouble learning the value function with this method, even with a single layer neural network. Instead I tried representing the angle as two inputs: one "turns on" when the pendulum is between 0 and 180 degrees (the input value ranging from 0 to 1) while the other is "off" (i.e. value of 0). When the pendulum is between 180 and 360 degrees, the first input is off and the other is on. This seemed to work really well - the temporal difference error usually falls to around +/-0.001 within a few minutes of real-time training.