Ahh, Part IV. It’s been long overdue, mostly because I’ve been changing directions with my startup. I decided to drop everything I was doing, since it wasn’t working, and head in another direction with the startup. And lately, I’ve been messing around with mobile platforms as well as zoomable interfaces. I’ll talk more about that another time! But you came for neural networks. Last time in part III, we were able to connect the perceptrons to each other. This time, we’re going to look at how you’d actually learn.
The ways of learning
There are many types of neural networks. But this one that we’re building is a classic feed-forward neural network. A feed-forward neural network is a linear classifier, and the way it learns is to adjust the hyperplane that separates different classes in multi-dimensional space to minimize classification error, according to what it has seen before. The way that one would adjust the hyperplane is to change the value of the weights in the neural network. But how much to adjust it?
The classic way is to use back propagation, which we’ll explore here. People since then have used other methods to calculate the weights, such as genetic algorithms and particle swarm optimization. You can basically use any type of optimization algorithm to adjust the weights.
Carrying the Error Backwards
To figure out the error at the output node is easy. You simply subtract the output from what the output was suppose to be, and that’s your error (not exactly, but that’s the idea). The problem was, how do you assign weights to the hidden layers when you can’t directly see their output? Even if you could, how would you know which way to adjust it, since it would affect other nodes?
The basic idea of back propagation is to get the output of the network and compare its decision with the decision it should have made, and more importantly, how far off it was. That is the error rate of decision. We’ll take that error and propagate it backwards towards the input so we will know how to adjust the weights, layer by layer.
I’m not going to go too much into the hows and whys back propagation, since I feel like there’s a lot of tutorials out there that do it justice. And I won’t go into the proof either. It’s mainly just multi-dimensional calculus. It’s not too hard to follow, actually. It’s really just a matter of keeping the variables straight, since there are so many. I’ll skip all that. But I will show and explain the result, since it makes understanding the code a lot easier.
I’m going to assume that most of my audience are programmers that didn’t much like math. If they did, they probably wouldn’t be reading this, and would have read the proof themselves from a textbook. Therefore, I’ll explain some math things that I otherwise would not. Math people, bear with me…or correct me.
Starting from the back of the bus
Calculating the change in weights for the output node isn’t too bad. Using my “awesome” GIMP skillz…it looks like this:
We’ll start from the back. I color coded it to make it easier to figure out what the equations are saying. (If a variable is bolded, that means it’s a vector) The error of output of the training input is:
(1) J(w) = ½ ∑ (tk – zk)2 = ½ * ||t – z||2
where t is what the output should have been, and z is what we actually got from the neural network. J(w) is basically a sum of all the errors across all output nodes. You’d want a square of the differences because you want to make all differences positive before you sum them, so the errors don’t cancel each other out. The double lines stand for norm. You can think of norm as “length of vector”. Norm is just a convenient way to write it.
If you wanted to derive back propagation, you’d take the derivative of J(w) with respect to w, and try to minimize J. Remember what I said about going in the direction of steepest change in error? Well, to calculate change, you calculate the derivative (since derivative means change), and that’s why you’d do it in the proof. If you want to follow the proof, check out page 290-293 of Pattern Classification by Duda, Hart, and Stork.
So skipping all the proof, you’d get two equations. One for calculating the adjustment of weights in the output layer (red layer), and the other for calculating the adjustment in weights of all other layers before that (yellow and green layers).
(2) ∆wkj = ɳ * (tk – zk) * f'(netk) * yj
This is the equation to adjust the purple weights. It’s not too bad, and I’ll go through each part.
- ɳ – The eta (funny looking ‘n’) in the beginning is the learning rate. This is a variable you tweak to adjust how fast the neural network learns. I’ll talk more about that some other time, but don’t think that you’d want to set this as high as possible.
- (tk – zk) – Next, note that tk – zk aren’t bolded, so they are what the output was suppose to be, and the output of the neural network of the kth output node. For us, we only have one output node.
- f'(netk) – Remember back in part II, where we were talking about the sigmoid function? f'(x) is the derivative of the sigmoid function. If I haven’t forgotten my calculus, it should be:
(3) f'(x) = e-x / (1 + e-2x)
- netk is the dot product of the output node weights with the inputs (yj) of the output node. Note that yj is also the outputs of the hidden layer, and it is calculated by f(netj)–note that this is a regular sigmoid.
In equation (2) above, we’ll need a part of it to send back to the hidden layers. We’ll represent it by a lower case delta (looks like an ‘o’ with a squiggly on top). It is called the sensitivity. This is what we propagate back to the other layers, and where the technique gets its name.
(4) δk = (tk – zk) * f'(netk)
The second equation dictates how to adjust all hidden layers. Note that it uses the sensitivity variable:
(5) ∆wji = ɳ * [∑k=1 to c wkjδk] * f'(netj) * xi
- As you can see, this is more of the same. The only difference is the second term, which is the dot product of all the output node input weights (wkj) from a hidden node and the sensitivities (δk) across all output nodes the hidden node is connected to.
- netj is like as before–it’s the dot product of the inputs xi with the inputs weights of the hidden nodes.
You’ll note that from the perspective a single hidden node, the adjustment of its input weights depends on the set of inputs from the previous layer that is connected to it, and the set of sensitivities and the associated weights of the output layer from the next layer that the hidden node is connected to. netj is no exception since it is the dot product of xi and wji for all i. You can better see this in a picture. GIMP again.
I know we don’t have 3 output nodes and 4 input nodes. It’s just to illustrate that from the perspective of the hidden node, this would be the information it needs from the layers surrounding it. In the code base we’ve written so far, the weights are contained in the node it’s connected to. So wji would belong to the hidden layer, and wkj would belong to the output layer. Therefore, the output layer would need to send both the sensitivity and the output layer input weights back to the hidden node.
This perspective is important, because Erlang follows an Actor model, where you model the problem as individual agents that pass messages back and forth to each other. We have now written how each individual node adjusts its weights, and that will help us in our coding.
This also means that as the current implemention is headed, I am assuming an asynchronous model of the neural network. Each perceptron will update when any of its inputs change. That means, like a digital circuit, there will be a minimum time that it takes for the output to reach a correct steady state and for the weight adjustments to propagate back. What this minimum time will be, will probably depend on the number of hidden layers. We’ll see if it’ll work. I have a hunch it should be ok, as long as the inputs are throttled to wait until the minimal time passes before feeding it a new set of inputs. It might result a lot of unnecessary messages, but if we can get away with it while keeping the code simple, I think it’s probably worth it.
Whew. That all took a long time. Probably a good four or five hours. Well, I was hoping to be done by part IV when I started this, but it looks like there’ll still probably one or two more installments to this series. Next time, we’ll get to the code. I had intended to get to it this installment, but the code will make a lot more sense if you know what the math is saying about it.
In the meantime, I’ve gotta get to bed. It’s like 2am.