Algorithm always matters, even when parallel

I’ve finally had some time to start looking at things that looked interesting to me. One of the things that I started looking at was parallel algorithms.

I was reading Hacker News, and I saw some link that lead me to discover Cilk, a C extension to easily make parallel programs.

thread int fib(int n) {
if (n < 2)
return n;
else {
cont int x, y;
x = spawn fib (n-2);
y = spawn fib (n-1);
sync;
return x+y;
}
}

This was the example given for calculating the Fibonacci sequence in parallel. This is the standard mathematical way to define it, and it looks clean enough. So instead of trying it out in Cilk, I fired up Erlang to try my hand at doing a port. I found it a little bit difficult because while you can easily spawn processes in Erlang, there was no quick way for the parent process to wait/sync/join child processes and get their results. Since that was besides the point of the exercise, I fired up Ruby, even though they had a slow Threading library (which is suppose to be fixed in 1.9, plus Fibers!) I’ll do with it Erlang some other time.

First I wrote the threaded version to mimic the Cilk code:

def fib_threaded(n)
if n < 2
return 1
else
threads = []
x = 0
y = 0
threads << Thread.new(n - 2) { |i| x = fib_threaded(i) }
threads << Thread.new(n - 1) { |i| y = fib_threaded(i) }
threads.each { |thread| thread.join }
return x + y
end
end

I don’t have a multi-core processor. I’m on a 3 year old 1GHz laptop. At a mere fib(18), it was taking about 21 seconds to run. To see if there was a difference, I wrote a serial version.


def fib_serial(n)
n < 2 ? 1 : fib_serial(n - 1) + fib_serial(n - 2)
end

This one ran much much faster. It took about 0.02594 seconds. At this point, it’s probably the overhead of thread creation that’s making it take so long to run. Maybe with green threads or lightweight threads, the threaded version would run much faster. That makes me want to try it in Erlang just to compare. But wtf, adding shouldn’t take that long, even if it is 0.025 seconds

When I thought about it, it was an efficient algorithm: there’s a lot of wasted cycles. In order to compute f(n), you have to calculate f(n – 1) and f(n – 2) in separate threads.

  • The f(n – 1) thread requires it to spawn two more threads to compute f(n – 2) and f(n – 3).
  • The f(n – 2) thread requires it to spawn two more threads to compute f(n – 3) and f(n – 4).

Notice that both the threads for f(n – 1) and f(n – 2) have to spawn two different threads to calculate f(n – 3). And since this algorithm has no way for threads to share their results, they have to recompute values all over again. The higher the n, the worse the problem is, exponentially.

To calculate the speedup given to an algorithm by adding more processors, you calculate the amount of total work required and divide it by the span of the parallel graph. If that didn’t make sense, read lecture one for Cilk, which is where the diagram comes from. So for fib(n)

Twork = O(n^2)

The total amount of work is the total number of processes spawned. Since every f(n) recursively spawns two other processes, it’s about n^2 processes.

Tspan = O(ln n)

The total span is how many nodes a particular calculation traverses. A la the diagram, it’s about the height of the tree, so that’s about ln n nodes.

Therefore, for fib(n), the processor speed up is at most:

Tw / Ts = O(n^2) / O(ln n)

I don’t know of any reductions for that off the top of my head, but you can see that the processor speedup gain grows somewhere in between n and n^2. On one hand, it means this algorithm can benefit from speedups by adding up to somewhere between n and n^2 processors. However, that also means that to make this algorithm go as fast as it can to process fib(1000), you need more than 1000 processors to make it worthwhile. Not so good for a parallel program that’s just doing addition.

As a last version, I wrote one that computed the Fibonacci from 0 up to n, and keeping the total as I go along, instead of the recursive version that has to work its way n back down to zero.


def fib_loop(n)
sequence = [1, 1]
if n < 2
sequence[n]
else
sequence = (1..n-1).inject(sequence) do |t, e|
t << (t[-2] + t[-1])
end
end
sequence[-1]
end

It’s not space effective since I wrote it quickly, but this beat the pants off the other two running at 0.00014 seconds. As you can see, you’re not recalculating any f(n) more times than you need to.

I wish Cilk had a better first example to parallel programs. Given that the guy making Cilk is the same guy that co-wrote the famous mobile book for algorithms, I guess I was surprised. However, it was a fun exercise, and made me think about algorithms again.

I’ll find some other little project that requires me to write in Erlang, rather than falling back on the comfortable Ruby. snippet! Below if you want to run it yourself.

#!/usr/bin/ruby
# rubinacci.rb by Wilhelm

def fib_threaded(n)
if n < 2
return 1
else
threads = []
x = 0
y = 0
threads << Thread.new(n – 2) { |i| x = fib_threaded(i) }
threads << Thread.new(n – 1) { |i| y = fib_threaded(i) }
threads.each { |thread| thread.join }
return x + y
end
end

def fib_serial(n)
n < 2 ? 1 : fib_serial(n – 1) + fib_serial(n – 2)
end

def fib_loop(n)
sequence = [1, 1]
if n < 2
sequence[n]
else
sequence = (1..n-1).inject(sequence) do |t, e|
t << (t[-2] + t[-1])
end
end
sequence[-1]
end

def run(&algorithm)
start_time = Time.now
integer = 18
answer = algorithm.call(integer)
elapsed = Time.now – start_time

puts “Fib(#{integer}) = #{answer}”
puts “Total elapsed time: #{elapsed} seconds”
end

puts “threaded:”
run do |integer|
fib_threaded(integer)
end

puts “serial:”
run do |integer|
fib_serial(integer)
end

puts “loop:”
run do |integer|
fib_loop(integer)
end

Advertisements

Erlang Advocacy and the class of problems it solves

I spend more time than I should reading hacker.news. Granted, I sometimes feel like it’s the US Weekly or Maxim of the tech world, when stories like “5 ways you know you’re failing at your start up” and the general hubbub over Facebook back in June or the recent Microsoft and Yahoo buyout. However, the quality of the comments there is generally high, and on occasion, I’ll start replying to a comment, and before I know it, two hours have passed, and it’s better off as a blog post.

The challenge of the Erlang advocate is not to convince me, over and over, that Erlang wins its class; the challenge is to convince me that Erlang’s class of problem is so important to my life that I should study Erlang rather than vascular surgery, or television repair, or other obscure technical skills that I don’t know that much about.

The follow-up question then would be how many existing problems can be converted to the type that Erlang can solve well? And how many problems previously impractical are now practical to solve? So if it ends up to be a bigger class than you thought, you may well be limiting the number problems that you can solve easily and practically out of the ones that will be important in the future.

I like Erlang (aside from some syntax ugliness), so I’ll give it a shot. I think it’s important because it allows a program to easier to scale out rather than scale up. If it was running an algorithm that are parallelize-able, then you can just technically add more cheap processors to it for a speed up, rather than designing a faster processor. We’d want speedups in this way because CPUs are becoming multi-cores, and to take advantage of them one will have to write some type of parallel program, since it’s proven difficult to fully automate parallelizing serial programs. In addition, with bandwidth pipes getting bigger and the internet more and more pervasive, it is possible that you can leverage other computers you don’t own (but given permission) for computational or storage resources in the future most of whom don’t belong to a single entity. (think more SETI@home than Amazon)

In addition, parallelized systems (not just erlang) can be more fault tolerant and can fail more gracefully (or hobble along, if you’d like to call it that). Sensor networks are one example. Instead of a single radar to detect the environment, you throw a bunch of sensors out there, and they network themselves and report what they see/hear back to you. If some of them gets destroyed that’s ok, because the system’s still functioning with less sensors.

A swarm of UVAs doing surveillance in an area is another example. If you have a single computer commanding all the UVAs, it’s actually quite hard, because you don’t want them to crash into each other, so that’s N^2 comparisons (less if you do oct-trees, probably). And if a target comes into the area, it becomes a non-trivial allocation problem: how do you decide which UVA to assign to monitoring it, and when do you switch them out when their fuel runs low? It ends up that doing it in with an actor model, where each UVA decides what to do at any given moment (local interactions) might not be optimal, but it’s redundant and fault-tolerant.

Biological systems work this way as well. Gene expression is actually a network of genes being turned on and off by proteins expressed by other genes being turned on and off recursively. If one gene can’t activate another, it might not always be detrimental, because there might be other pathways to express that gene. Since I’m not as familiar with the intricacies of biological systems, I’ll refrain from saying anymore.

I was surprised when I was watching a video of Alan Kay talking about OOP that the OOP C++/Java I learned in college wasn’t what he had in mind. Rather, he meant OOP to be more like biological systems and more process orientated. Encapsulation was only meant so that objects (analogous to actors in erlang) would have to pass messages to each other (method calls).

So what concrete examples of problems fall into this class Erlang is good at? The obvious ones are the embarrassingly parallelized algorithms, like genetic algorithms, neural networks, 3d rendering, and if I’m not mistaken, FFTs. Indexing web pages is another. But then again there are other algorithms that are inherently serial like protocol handshaking or newton’s method.

I don’t know what the ratio is between embarrassingly parallel and Serial problems are, but my gut is that with the advent of multi-cores and availability of the internet, I think there will be plenty of parallel problems to go around.

Ruby and Haxe language writers are both implementing the actor model like Erlang, if that’s any indication of how important they think it is. While I don’t think Erlang will be the 100 year language, the ideas by which it’s a poster child will reverberate in the descendant languages for a long time to come.

Erlang and Neural Networks Part IV

Ahh, Part IV. It’s been long overdue, mostly because I’ve been changing directions with my startup. I decided to drop everything I was doing, since it wasn’t working, and head in another direction with the startup. And lately, I’ve been messing around with mobile platforms as well as zoomable interfaces. I’ll talk more about that another time! But you came for neural networks. Last time in part III, we were able to connect the perceptrons to each other. This time, we’re going to look at how you’d actually learn.

The ways of learning

There are many types of neural networks. But this one that we’re building is a classic feed-forward neural network. A feed-forward neural network is a linear classifier, and the way it learns is to adjust the hyperplane that separates different classes in multi-dimensional space to minimize classification error, according to what it has seen before. The way that one would adjust the hyperplane is to change the value of the weights in the neural network. But how much to adjust it?

The classic way is to use back propagation, which we’ll explore here. People since then have used other methods to calculate the weights, such as genetic algorithms and particle swarm optimization. You can basically use any type of optimization algorithm to adjust the weights.

Carrying the Error Backwards

To figure out the error at the output node is easy. You simply subtract the output from what the output was suppose to be, and that’s your error (not exactly, but that’s the idea). The problem was, how do you assign weights to the hidden layers when you can’t directly see their output? Even if you could, how would you know which way to adjust it, since it would affect other nodes?

The basic idea of back propagation is to get the output of the network and compare its decision with the decision it should have made, and more importantly, how far off it was. That is the error rate of decision. We’ll take that error and propagate it backwards towards the input so we will know how to adjust the weights, layer by layer.

I’m not going to go too much into the hows and whys back propagation, since I feel like there’s a lot of tutorials out there that do it justice. And I won’t go into the proof either. It’s mainly just multi-dimensional calculus. It’s not too hard to follow, actually. It’s really just a matter of keeping the variables straight, since there are so many. I’ll skip all that. But I will show and explain the result, since it makes understanding the code a lot easier.

I’m going to assume that most of my audience are programmers that didn’t much like math. If they did, they probably wouldn’t be reading this, and would have read the proof themselves from a textbook. Therefore, I’ll explain some math things that I otherwise would not. Math people, bear with me…or correct me.

Starting from the back of the bus

Calculating the change in weights for the output node isn’t too bad. Using my “awesome” GIMP skillz…it looks like this:

We’ll start from the back. I color coded it to make it easier to figure out what the equations are saying. (If a variable is bolded, that means it’s a vector) The error of output of the training input is:

(1) J(w) = ½ ∑ (tkzk)2 = ½ * ||t – z||2

where t is what the output should have been, and z is what we actually got from the neural network. J(w) is basically a sum of all the errors across all output nodes. You’d want a square of the differences because you want to make all differences positive before you sum them, so the errors don’t cancel each other out. The double lines stand for norm. You can think of norm as “length of vector”. Norm is just a convenient way to write it.

If you wanted to derive back propagation, you’d take the derivative of J(w) with respect to w, and try to minimize J. Remember what I said about going in the direction of steepest change in error? Well, to calculate change, you calculate the derivative (since derivative means change), and that’s why you’d do it in the proof. If you want to follow the proof, check out page 290-293 of Pattern Classification by Duda, Hart, and Stork.

The hyperplane

So skipping all the proof, you’d get two equations. One for calculating the adjustment of weights in the output layer (red layer), and the other for calculating the adjustment in weights of all other layers before that (yellow and green layers).

(2) wkj = ɳ * (tkzk) * f'(netk) * yj

This is the equation to adjust the purple weights. It’s not too bad, and I’ll go through each part.

  • ɳ – The eta (funny looking ‘n’) in the beginning is the learning rate. This is a variable you tweak to adjust how fast the neural network learns. I’ll talk more about that some other time, but don’t think that you’d want to set this as high as possible.
  • (tkzk) – Next, note that tk – zk aren’t bolded, so they are what the output was suppose to be, and the output of the neural network of the kth output node. For us, we only have one output node.
  • f'(netk) – Remember back in part II, where we were talking about the sigmoid function? f'(x) is the derivative of the sigmoid function. If I haven’t forgotten my calculus, it should be:

    (3) f'(x) = e-x / (1 + e-2x)

  • netk is the dot product of the output node weights with the inputs (yj) of the output node. Note that yj is also the outputs of the hidden layer, and it is calculated by f(netj)–note that this is a regular sigmoid.

In equation (2) above, we’ll need a part of it to send back to the hidden layers. We’ll represent it by a lower case delta (looks like an ‘o’ with a squiggly on top). It is called the sensitivity. This is what we propagate back to the other layers, and where the technique gets its name.

(4) δk = (tkzk) * f'(netk)

The second equation dictates how to adjust all hidden layers. Note that it uses the sensitivity variable:

(5) wji = ɳ * [∑k=1 to c wkjδk] * f'(netj) * xi

  • As you can see, this is more of the same. The only difference is the second term, which is the dot product of all the output node input weights (wkj) from a hidden node and the sensitivities (δk) across all output nodes the hidden node is connected to.
  • netj is like as before–it’s the dot product of the inputs xi with the inputs weights of the hidden nodes.

You’ll note that from the perspective a single hidden node, the adjustment of its input weights depends on the set of inputs from the previous layer that is connected to it, and the set of sensitivities and the associated weights of the output layer from the next layer that the hidden node is connected to. netj is no exception since it is the dot product of xi and wji for all i. You can better see this in a picture. GIMP again.

I know we don’t have 3 output nodes and 4 input nodes. It’s just to illustrate that from the perspective of the hidden node, this would be the information it needs from the layers surrounding it. In the code base we’ve written so far, the weights are contained in the node it’s connected to. So wji would belong to the hidden layer, and wkj would belong to the output layer. Therefore, the output layer would need to send both the sensitivity and the output layer input weights back to the hidden node.

This perspective is important, because Erlang follows an Actor model, where you model the problem as individual agents that pass messages back and forth to each other. We have now written how each individual node adjusts its weights, and that will help us in our coding.

This also means that as the current implemention is headed, I am assuming an asynchronous model of the neural network. Each perceptron will update when any of its inputs change. That means, like a digital circuit, there will be a minimum time that it takes for the output to reach a correct steady state and for the weight adjustments to propagate back. What this minimum time will be, will probably depend on the number of hidden layers. We’ll see if it’ll work. I have a hunch it should be ok, as long as the inputs are throttled to wait until the minimal time passes before feeding it a new set of inputs. It might result a lot of unnecessary messages, but if we can get away with it while keeping the code simple, I think it’s probably worth it.

Whew. That all took a long time. Probably a good four or five hours. Well, I was hoping to be done by part IV when I started this, but it looks like there’ll still probably one or two more installments to this series. Next time, we’ll get to the code. I had intended to get to it this installment, but the code will make a lot more sense if you know what the math is saying about it.

In the meantime, I’ve gotta get to bed. It’s like 2am.

Erlang and Neural Networks Part I
Erlang and Neural Networks Part II
Erlang and Neural Networks Part III

The 3 body problem in Erlang

The three body problem is a simulation of the paths objects with mass in space would take, if all three had gravity effects on each other. With two bodies, the problem can be solved analytically, as done by Kepler. But with three bodies, the paths are chaotic. If you just hit play on the last link, and watch for a minute, you’ll see what I mean. And that’s the easy version of the problem, since the two suns are fixed. If they were three bodies of comparable masses, then it’d be even harder.

From Ezra: http://ezrakilty.net/research/2006/02/3body_problem_in_erlang.html

The first conceptual problem I hit was the question of synchronization. In order for the sim to be a decent finite approximation to the continuous world of physics, we need to break it into discrete time steps, each of which depends on the previous time step (at least, that’s the only way I know of to get a fair approximation). This means that each particle can’t just crunch away at it’s own speed, working as fast as it can to calculate its position at various times. Easily the particles could grow out of sync with one another, making an inaccurate physical model.

I hadn’t thought about this, but I think Ezra is right. In terms of simulation of the 3 body problem, if the correct calculation in the future depends on current calculations, and the current calculations depend on each other, you need to make sure that the calculations are ‘in step’.

This calls into question my thought before that asynchronous simulations would work, since whenever the messages arrive, that’s when they arrive and process them. In a decentralized simulation of termites gathering wood chips, I imagine an asynchronous simulation would suffice. It doesn’t really matter what exact paths the termites take, but rather, the end result of that chaos. But in a gravity simulation, asynchronous simulation doesn’t seem to work, because what you’re interested in is the actual paths.

If the calculations of all other threads must be synchronous or in lockstep, it would seem like it would give an upper bound to how fast the simulation can go, even in a multi-threaded environment. Since the calculations will be wrong, the further into the future you calculate with slightly incorrect values, what kind of useful computations can you do if you don’t have all the initial conditions in your formula?

The only thing I can think of is if you had different sets of three threads–one for each mass–processing the simulation at different simulation times, you can reduce the processing load for the trailing set of threads. So say you had a leading set of threads that operated on simulation time of t + n always. That leading set can narrow the scope of possible answers. Since it knows it’s operating on a chaotic system, it knows that what the error is given a certain lead time of n. Therefore, it should be able to limit the upper and lower bound of the possible right answers. Then, the trailing set of threads that operate on simulation time of t, only has to adjust the error, which hopefully is less computationally intensive.

Erlang and Neural Networks Part III

I had meant to do more on Erlang more quickly, but I got sidetracked by meta-programming. Here’s Part III of Erlang and Neural Networks!

Last time, I did Erlang and Neural Networks Part II. And we saw that neural network is basically made up of interconnected perceptrons (or neurons), and they are basically modeled as a linear combination of inputs and weights with a non-linear function that modifies the output.

Drawing a line in the sand

Classifiers often do very well strictly on probabilities. But often times, we don’t know what the underlying probabilities are for the data, and not only that, we don’t have lots of training data to build accurate probability densities. One way around that is to draw a line in the data space that acts as the decision boundary between two classes. That way, you only have to find the parameters (i.e. weights) of the line, which is often fewer in number than the entire probability space.

This is exactly what a perceptron does. It creates a decision boundary in data space. If the data space is a plane (2D, or having two inputs), then it draws a line. For higher data space dimensions (4D or more), it draws a hyperplane.

So Why Not Go Linear?

The problem with just using a perceptron is that it can only classify data that is linearly separable–meaning data you can separate with a line. The XOR problem is a simple illustration of how you can’t draw a line that separates between on and off in an XOR. Minsky and Papert wrote a famous paper that kinda killed off research in this field for about a decade because they pointed this out.

So to get around this linearity, smart people eventually figured out that they can chain perceptrons together in layers, and that gives them the ability to express ANY non-linear function, given an adequate number of hidden layers.

Shake my hand and link up to form Voltron

Let’s try linking our perceptrons together. We’re going to add two more messages to our perceptrons:

perceptron(Weights, Inputs, Output_PIDs) ->
receive
% The other messages from part II

{connect_to_output, Receiver_PID} ->
Combined_output = [Receiver_PID | Output_PIDs],
io:format("~w output connected to ~w: ~w~n", [self(), Receiver_PID, Combined_output]),
perceptron(Weights, Inputs, Combined_output);
{connect_to_input, Sender_PID} ->
Combined_input = [{Sender_PID, 0.5} | Inputs],
io:format("~w inputs connected to ~w: ~w~n", [self(), Sender_PID, Combined_input]),
perceptron([0.5 | Weights], Combined_input, Output_PIDs)
end.

connect(Sender_PID, Receiver_PID) ->
Sender_PID ! {connect_to_output, Receiver_PID},
Receiver_PID ! {connect_to_input, Sender_PID}.

We would never call connect_to_output() or connect_to_input() directory [1]. We’d just use connect(). It basically just adds the perceptron’s process ID to each other, so they know who to send messages to when they have an output.

We can now connect up our perceptrons, but with the way it is, currently, we’d have to send a separate message to each perceptron connected to an input to the network. This is tedious. We are programmers and we are lazy. Let’s make a perceptron also double as an source node. As source node simply passes its input to to its outputs.

perceptron(Weights, Inputs, Output_PIDs) ->
receive
% previous messages above and in part II

{pass, Input_value} ->
lists:foreach(fun(Output_PID) ->
io:format("Stimulating ~w with ~w~n", [Output_PID, Input_value]),
Output_PID ! {stimulate, {self(), Input_value}}
end,
Output_PIDs);
end.

Now we can start creating perceptrons.

64> N1_pid = spawn(ann, perceptron, [[],[],[]]).

65> N2_pid = spawn(ann, perceptron, [[],[],[]]).

66> N3_pid = spawn(ann, perceptron, [[],[],[]]).

Note that we get back three process IDs of the three perceptrons we created. Then we start connecting them.

67> ann:connect(N1_pid, N2_pid).
output connected to : []
inputs connected to : [{,0.500000}]
{connect_to_input,}
68> ann:connect(N1_pid, N3_pid).
output connected to : [,]
inputs connected to : [{,0.500000}]
{connect_to_input,}

We used N1 as an input node connected to perceptrons 2 and 3. So if N1 is passed a value, N2 and N3 should be stimulated with that value.

69> N1_pid ! {pass, 0.5}.
Stimulating with 0.500000
{pass,0.500000}Stimulating with 0.500000

outputs: 0.562177

outputs: 0.562177

Hurray! So now, the network’s got tentacles, that we can connect all over the place, writhing, and wiggling with all its glee. However, this is currently a DUMB network. It can’t classify anything because we haven’t told it how to learn anything yet. How does it learn to classify things? It does so by adjusting the weights of the inputs of each perceptron in the network. And this, is the crux of neural networks in all its glory. But you’ll have to wait til next time!

(1) Note that the last message connect_to_input() isn’t followed by a semicolon. That means every message before it in perceptron needs to end with one. So if you’ve been following along, the stimulate() message from part II needs a semicolon at the end of it now.

Erlang and Neural Networks Part I
Erlang and Neural Networks Part II
Erlang and Neural Networks Part III

Comments on the death of computing

This article is starts off as a complaint or a lament in the area of edge CS, and probably serves as a warning, though the conclusion is probably not as hopeful or optimistic as it could be. Or it could possibly be the lack of imagination. To start:

There was excitement at making the computer do anything at all. Manipulating the code of information technology was the realm of experts: the complexities of hardware, the construction of compliers and the logic of programming were the basis of university degrees.

However, the basics of programming have not changed. The elements of computing are the same as fifty years ago, however we dress then up as object-oriented computing or service-oriented architecture. What has changed is the need to know low-level programming or any programming at all. Who needs C when there’s Ruby on Rails?

Well, part of it is probably a lament by the author–presumably a scholar–on the loss of status and the general dilution in the quality of people in the field. And the other part is about how there’s nowhere interesting left to explore in the field.

To address the first part, it’s well known that engineers, programmers (or any other profession) likes to work with great and smart people. Usually, when a leading field explodes you’re going to attract these great and smart people to the field. However, the nature of the field of technology is to make doing something cheaper, faster, or easier. And as technology matures, the more the barriers to entry in the field lowers. And as a result, you’ll get more people that couldn’t make it before in the field and the average quality of people dilutes. People use to do all sorts of research on file access. But now, any joe programmer doesn’t think about any of that and just uses the ‘open’ method to access files on disk. But that’s the nature of technology, and it’s as it should be.

The environment within which computing operates in the 21 century is dramatically different to that of the 60s, 70s, 80s and even early 90s. Computers are an accepted part of the furniture of life, ubiquitous and commoditised.

And again, this is the expected effect of technology. Unlike other professions, in engineering one is able to make technology which gives people leverage over those that don’t use it. This gives the advantage of acceleration and productivity that’s scalable that you won’t find in other professions. If you’re a dentist, there is an upper limit to the number of patients you can see. In order to be even more productive, you’ll need to create a clinic–a dentist farm–to parallelize patient treating and you need other dentists to do that. If you’re an engineer, the technology that you build is a multiplier, and you don’t even need other people to use the multiplier.

But at a certain point, the mass adoption of a technology makes it cheaper, and hence, your leverage over other people isn’t that great, and you begin to look for other technologies to make your life easier or give you an edge over your competition. But these are all applications arguments to CS; while important in attracting new talent, it doesn’t address where the field has yet left to go on the edge.

As for whether CS is really dead or not, I think there’s still quite a bit of work to be done at the edges. Physics in the late 1800’s claimed that there wasn’t much interesting going on there until General Relativity blew up in their face. Biology has had its big paradigm shift with Darwin, but there’s still a host of interesting unknown animals being discovered (like the giant squid) and I’m sure alien biology or revival of Darwin’s sexual selection would help open up another shift. Engineering suffered the same thing in the early 1900’s, when people with only a background in electromechanical and steam powered devices thought there wasn’t much left to invent or explore, until the advent of computing spurred on by the Second World War.

In terms of near-term computing problems, there’s still a lot of work to be done in AI, and all its offshoot children, such as data mining, information retrieval, and information extraction. We still can’t build software systems reliably, so better programming constructs are being ever-explored. Also, since multi-core processors are starting to emerge, so better concurrent programming constructs are being developed (or rather, taken up again…Seymour Cray was doing vector processors a long while back)

But I’m guessing the author of the article is looking for something like a paradigm shift, something so grand that it’ll be prestigious again, and attract some bright minds again.

In the end, he is somewhat hopeful:

The new computing discipline will really be an inter-discipline, connecting with other spheres, working with diverse scientific and artistic departments to create new ideas. Its strength and value will be in its relationships.

There is a need for innovation, for creativity, for divergent thinking which pulls in ideas from many sources and connects them in different ways.

This, I don’t disagree with. I think far-term computing can draw from other disciplines as well as being applied to others. With physics, there’s currently work on quantum computers. In biology, there’s contribution to biology from bioinformatics and the sequencing of genes, as well as drawing from it like ant optimization algorithms and DNA computers. In social sciences, there’s contribution to it using concurrent and decentralized simulation of social phenomenon, as well as drawing from it like particle swarm optimization.

One day, maybe it will be feasible to hack your own bacteria, and program them just as you would a computer. And then, a professor might lament that any 14 year old kid can hack his own lifeform when it use to be in the realm of professors. But rest assured, there will always be other horizons in the field to pursue.

Erlang and neural networks, part II

Two weeks ago, I did a post about Erlang (Part I), and how a simple feed-forward neural network might be a nice little project to do on the side, just to learn about Erlang. Here’s what came next.

State of the Purely Functional

In the transition from imperative/procedural programming to functional programming, there are obviously things that you have to get over. You’ll hear this from a lot of people just learning functional programming for the first time (myself included). The hardest thing for me to get over in a pure functional language is the absence of state. My first reaction was, “Well, how do you get anything done?”

Not having state has its advantages, and you’ll hear stuff about side-effects and referential transparency. But I’d like to think of it as, things that don’t have state can’t be broken–they just exist. However, state is useful in computation, and different languages have different ways of getting around it. With Haskell, you use monads. At first, I figured it was the same with Erlang. But in this short tutorial on Erlang, it simply states that Erlang uses the threads to keep state.

This maps pretty well with what I’m trying to do. Each perceptron will be a thread, and send messages back and forth to each other as they fire and stimulate each other.

The essence of a perceptron

So once again, this is a perceptron. It’s a weighted sum (a dot product) of the inputs, which is then thresholded by f(e). So we’ll write a thresholding function and a weighted sum in Erlang.

We start by declaring the name of the module, and the functions to export from the module.

-module(ann).
-export([perceptron/3, sigmoid/1, dot_prod/2, feed_forward/2,
replace_input/2, convert_to_list/1]).

I exported most of the functions, so I can run them from the command line. I’ll remove them later on.

First we write our thresholding function. We will use the sigmoid function as our thresholding function. It’s pretty easy to explain. A value, X goes in, another value comes out. It’s a math function.

sigmoid(X) ->
1 / (1 + math:exp(-X)).

Since I wasn’t as familiar with all the libraries in Erlang, and I wrote a dot product function, and it wasn’t too bad. Erlang, for the most part, doesn’t use loops, just as Ruby doesn’t. They both can, if you want to write a FOR control function, but the common way is to use library functions for list processing, list comprehensions, or recursion. The first part is the base case, and the second part is what you’d do if the “recursion fairy” took care of the rest.

dot_prod([], []) ->
0;
dot_prod([X_head | X_tail], [Y_head | Y_tail]) ->
X_head * Y_head + dot_prod(X_tail, Y_tail).

Simple, so far, right? So to calculate the feed forward output of a perceptron, we’ll do this:

feed_forward(Weights, Inputs) ->
sigmoid(dot_prod(Weights, Inputs)).

The body of a nerve

So far, so good. But we still need to create the actual perceptron! This is where the threads and state-keeping comes up.

perceptron(Weights, Inputs, Output_PIDs) ->
receive
{stimulate, Input} ->
% add Input to Inputs to get New_Inputs...
% calculate output of perceptron...
perceptron(Weight, New_inputs, Output_PIDs)
end.

This is a thread, and it receives messages from other threads. Currently, it only accepts one message, stimulate(Input) from other threads. This is a message that other perceptrons will use to send its output to this perceptron’s inputs. Notice that at the end of the message, we call the thread again, with New_Inputs. That’s how we will maintain and change state.

Note this won’t result in a stack overflow, because Erlang somehow figures out not to keep the call stack. I’m guessing it knows it can do so, since no state is ever kept between messages calls that everything you need to know is passed into the function perceptron, so we can throw away the previous instances of the call to perceptron.

We do come to a snag though. How do we know which other perceptron the incoming input is from? We need to know this because we need to be able to weight it correctly. My solution is that Input is actually a tuple, consisting of {Process_ID_of_sender, Input_value}. And then I keep a list of these tuples, like a hash of PID to input values, and convert them to a list of input values when I need to calculate the output. Therefore, we end up with:

perceptron(Weights, Inputs, Output_PIDs) ->
receive
{stimulate, Input} ->
% add Input to Inputs to get New_Inputs...
New_inputs = replace_input(Inputs, Input),

% calculate output of perceptron...
Output = feed_forward(Weights, convert_to_list(New_inputs)),

perceptron(Weights, New_inputs, Output_PIDs)
end.

replace_input(Inputs, Input) ->
{Input_PID, _} = Input,
lists:keyreplace(Input_PID, 1, Inputs, Input).

convert_to_list(Inputs) ->
lists:map(fun(Tup) ->
{_, Val} = Tup,
Val
end,
Inputs).

The map function you see in convert_to_list() is the same as the map function in ruby that would go:

def convert_to_list(inputs)
inputs.map { |tup| tup.last }
end

Now, there’s one last thing that needs to be done. Once we calculate an output, we need to fire that off to other perceptrons that accept this perceptron’s output as its input. And if it’s not connected to another perceptron, then it should just output its value. So then we end up with:

perceptron(Weights, Inputs, Output_PIDs) ->
receive
{stimulate, Input} ->
New_inputs = replace_input(Inputs, Input),
Output = feed_forward(Weights, convert_to_list(New_inputs)),
if Output_PIDs =/= [] ->
lists:foreach(fun(Output_PID) ->
Output_PID ! {stimulate, {self(), Output}}
end,
Output_PIDs);
Output_PIDs =:= [] ->
io:format("~n~w outputs: ~w", [self(), Output])
end,
perceptron(Weights, New_inputs, Output_PIDs)
end.

We know which perceptrons to output to, because we keep a list of perceptron PIDs that registered with us. So if the list of Output_PIDs is not empty, then for each PID, send them a message with a tuple that contains this perceptron’s PID as well as the calculated Output value. Let’s try it out:

Test Drive

1> c(ann).
{ok,ann}
2> Pid = spawn(ann, perceptron, [[0.5, 0.2], [{1,0.6}, {2,0.9}], []]).

3> Pid ! {stimulate, {1,0.3}}.

outputs: 0.581759
{stimulate,{1,0.300000}}
4>

So you can see, we got an output of 0.581759. We can verify this by doing this on our TI-85 calculator:

x = 0.5 * 0.3 + 0.2 * 0.9
Done
1 / (1 + e^-x)
.581759376842

And so we know our perceptron is working feeding forward! Next time, we’ll have to figure out how to propagate its error back to adjust the weights, and how to connect them up to each other.

Erlang and Neural Networks Part I
Erlang and Neural Networks Part II
Erlang and Neural Networks Part III