Recently I was looking for a way to train a simple linear neural network using the
delta rule in such a way that I could specify exactly how much error is reduced on each weight update. The usual delta learning rule for a single linear neuron is:
delta_w_i = learning_rate * error * x_i
In my opinion, the learning rate parameter lacks an intuitive interpretation. Ideally it would represent exactly how much error is reduced per update, but it doesn't. It is usually just set to some small constant, like 0.01, which is hopefully not too small (slow convergence) or too large (instability). My goal was to find a modified learning rule with a learning rate-like parameter which could reduce the error by a specific amount on each weight update (e.g., a value of 0.2 would reduce the error by 20% per update).
Temporarily assume a network with only one input/weight. Also assume that the goal is to reduce the error by 100% within a single step. (These assumptions will be relaxed later.) The output function is:
y = w * x
The error is:
error = target - y
In order to achieve zero error in one step, we need the following to be true:
y = target = (w + delta_w) * x
So we need to solve for the learning rate which yields zero error after one step:
delta_w = target / x - w
learning_rate * error * x = target / x - w
learning_rate = (target / x - w) / (error * x)
= [(target / x - w) / (error * x)] * (x / x)
= (target - w * x) / (error * x^2)
= (target - w * x) / ((target - w * x) * x^2)
= 1 / (x^2)
So a learning rate of 1 / (x * x) will reduce the error in the system to zero after one step. We can easily make this work for multiple inputs/weights by simply dividing by the size of the input/weight space n:
learning_rate = 1 / (n * x^2)
Of course, we don't usually want to reduce the error completely in one step, so we replace the 1 with an "error reduction rate." This quantity, which is a value within [0, 1], determines the amount of error to be reduced on each weight update. It is a rate with meaningful units (0.01 x error reduction percent per weight update). For example, an error reduction rate of 0.4 will reduce the error by 40% on each update compared to the error value on the previous step. It is equivalent to 1 - error(t+1) / error(t). Now the desired learning rate becomes the following:
learning_rate = error_reduction_rate / (n * x^2)
Now we can modify the delta learning rule by replacing the traditional learning rate:
delta_w_i = error_reduction_rate * error * x_i / (n * x_i * x_i)
= error_reduction_rate * error / (n * x_i)
One problem remains: we can't divide by zero when the input (x_i) values are zero. To remedy this, we only update weights associated with active (non-zero) inputs. Also, instead of dividing by n, we divide by the number of active inputs m:
only for non-zero x_i:
delta_w_i = error_reduction_rate * error / (m * x_i)
This looks very similar to the original delta rule except for three major differences:
- The learning rate has been replaced with the error reduction rate, a quantity with a much more meaningful interpretation.
- The weight deltas are divided by m, the number of active inputs. This essentially scales the error by the number of weights that contribute to the error.
- The weight deltas are divided by their associated input (x_i) values. This is an interesting deviation from the original delta rule which instead multiplies them by their input values.
UPDATE (5/23/07): It turns out that there's a problem with this modified delta rule. My assumptions concerning how the single input/weight case generalize to multiple inputs/weights were incorrect. Fortunately, only a small change is needed. Instead of dividing by the input value squared, we need to divide by the input vector length squared, which serves to normalize the weight deltas by the energy of the input vector:
delta_w_i = error_reduction_rate * error * x_i / length(x)^2
length(x) here is the Euclidean norm of the input vector. If length(x) is zero, we simply avoid modifying any weights, which is valid because all x_i values are zero which would produce weight deltas of zero. We are still able to specify the error reduction rate in an intuitive way, which was the motivation behind my search for a better delta rule.
Interestingly, this turns out to be equivalent to the normalized LMS rule. (LMS, or least-mean-square, is another name for the delta rule). Normalized LMS is described
here and in Haykin's Neural Networks book, 2nd edition, p. 152.