What the heck is a Ycombinator

I woke up at 4am this morning for no reason and decided to go through Ycombinator in Ruby. Given that I read Hacker News daily, it’s a shame I didn’t know what a Ycombinator was. I thought the article was pretty clear, at least compared to Raganwald’s article. As an aside, from purely a learning experience, it was kinda fun and eye opening. At least I can see why geeks think it’s neat.

I had written a reply to a comment on hacker news about it, and it turned out to be really long, so I just extracted it out here. I was suppose to start some coding, but I got wrapped up in writing up the comment. My comment ended up to be pretty long, but it was shorter than it could have been. You guys get to suffer through the long version. Haha. But I’m not going to go into details. You can read that in the linked articles. Instead, I’ll paint some broad strokes.

Y = λf·(λx·f (x x)) (λx·f (x x))

or in Ruby, copied from the aforementioned article:

y = proc { |generator|
proc { |x|
proc { |*args|
}.call(proc { |x|
proc { |*args|

or if you prefer elegance, Tom’s solution in response to Raganwald

def y(&f)
lambda { |x| x[x] } [
lambda { |yf| lambda { |*args| f[yf[yf]][*args] } } ]

I found lambda calculus above hard to read. However, if you go through the code in Y Combinator in Ruby, you’ll find it’s not too bad. I find that this lecture is also pretty good, as it takes you step by step, with a little bit of humor as well.

If I had to take a stab at it, A Ycombo is a way to implement recursion mechanism when the language doesn’t provide named recursion, loops, or iterators, and all you get are first-class functions and a few substitution rules.

Functions are just mappings from one set of things to another set of things–ie. give it a number, and it’ll point to another number. Ycombo relies on a property of functions that sometimes, when you put something into a function, you get the exact same thing out, i.e. something gets mapped to itself, like f(x) = x^2, where f(1) = 1 is an example of this property. They call this the fixed point of a function.

The thing about functions is that they don’t just have to map numbers to numbers. They can map functions to other functions. A derivative is an example of a function that takes one function as an input, and spits another function back out. like d(x^2) = 2x. Where is the fixed-point of a derivative? One of them is when you put e^x into it. d(e^x) = e^x. I’m sure there are more.

This is important because if you can find the point in which a function can return a function unchanged, you can use that to call the function again, which is what we call recursion. And all the trickiness you see in ycombinator that you see is mainly a result of functional programming not keeping state, so you have to pass everything you need into a function. So if you have to do recursion, you have to have a mechanism pass the function itself, so you can call it again. And this mechanism kinda bends back onto itself, and that’s why you see a part of the ycombinator repeated twice in the code above, and in the lambda calculus.

It seems pretty inane to use ycombo given that modern high level languages provide named recursions, and if anything, for loops and iterators with closures. But what if you don’t? How do you process lists/arrays without loops or named recursion? Generally, you’d have to make your own recursion mechanism.

When would you ever have languages that don’t have those things? Probably not often. But Forth comes to mind. I’ve never used Forth, but from what little I know about it, the language starts off with some basic primitives and that’s it. No loops, no if statements. How do you get anything done? Well, you build your own control structures. People use Forth because it’s possible to build your own very small compiler from the ground up written in Forth itself, and still understand the entire thing. I suspect it’s used in embedded programming because of that reason.

So you’d pretty much use it when you want to have a powerful system from first principles. I’m guessing it’s possible to build your own computer by implementing only functions and substitution rules in hardware, and then you can derive everything else, numbers, pairings, and even recursions in software. That way, you keep your hardware costs down, while retaining the power of a Turing machine.

Speculating here…but another thing that might be interesting is that ycombinator might be a way to compress code. Software size should be reducible if the code is compressible. And recursion can be thought of as a compressing code, and expanding it when the recursion is run. I wonder if there’s other ways to reduce software bloat with the use of a ycombinator besides recursion?


Naive Bayesian Classifier Hashed and Slashed

I know, I’ve been bad at posting over the course of the week, where readers would be more likely to be reading this stuff at work. But I have been busy myself too…mostly thinking about non-technical stuff.

Well, some weeks ago, I spent some time looking at Bayseian Probabilities. I learned it way back in high school, though I never gave it much thought. Then I stumbled on it again in college in my ECE courses, and nearly failed the class. And then I took it again in grad school, and though I did well enough in the class, I still felt weak in probability.

This time, when implementing a bayesian classfifier, I learned in and outs of a seemingly simple naive bayesian classifier, and I learned how to spell ‘bayesian’. That ‘e’ after the ‘y’ gets me every time.

Anyway, I was looking at Paul Graham’s famous a plan for spam, and I couldn’t figure out where he got that last formula. Kinda mad at myself for not seeing it earlier, cuz, it really is pretty simple. Anyway, it took me a while, but I worked it out. Turns out I learned more when I implemented it…or rather, I would have learned it had I concentrated the first two times.

We know Bayes Theorem is as follows:

(1) P(a,b) = P(a|b) * P(b) = P(b|a) * P(a)

With some algebra of the above we can also derive

(2) P(a|b) = P(b|a) * P(a) / P(b)

But also note that if we take (1), and put a given ‘d’ behind it, it’d hold true if the probabily on the other side also had a given ‘d’ behind it. If you draw out the Venn Diagrams, you’ll see this is true.

(3) P(a,b|d) = P(a|b,d) * P(b|d) = P(b|a,d) * P(a|d)

We also have the Total Probability Rule, which says that the total probability is made up of its parts. If you apply bayes rule, in (1), you’ll see that it’s true.

(4) P(b) = P(b|a) * P(a) + P(b|a’) * P(a’)

So this means that Baye’s rule in (2) can be rewritten with (4) as:

(5) P(a|b) = P(b|a) * P(a) / (P(b|a) * P(a) + P(b|a’) * P(a’))

We also need the Probability Chain Rule. It says that the joint probability of a, b, and c can be rewritten as the following due to equation (1), applied over and over again:

(6) P(a,b,c) = P(a,b|c) * P(c) = P(a|b) * P(b|c) * P(c)

And lastly, the Independence Rule, which makes the bayesian classifier naive:

(7) P(a,b) = P(a|b) * P(b) => P(a) * P(b) iff “a” indp. from “b”

Now, we can solve for what’s the probability of spam given these joint probability of words, where each word is considered an orthogonal and independent dimension?

P(s|f0, f1) = P(f0, f1|s) * P(s) / 
(1) P(f0, f1)
= P(f0, f1|s) * P(s) /
(4) (P(f0, f1|s) * P(s) + P(f0, f1|s') * P(s'))
= P(f0|f1,s) * P(f1|s) * P(s) /
(6) (P(f0|f1,s) * P(f1|s) * P(s) + P(f0|f1,s') * P(f1|s') * P(s')
= P(f0|s) * P(f1|s) * P(s) /
(6) (P(f0|s) * P(f1|s) * P(s) + P(f0|s') * P(f1|s') * P(s')
~= P(f0|s)*..*P(fn|s) * P(s) /
(7) (P(f0|s)*..*P(fn|s) * P(s) + P(f0|s')*..*P(fn|s') * P(s'))
~= P(f0|s)*..*P(fn|s) /
(P(f0|s)*..*P(fn|s) + P(f0|s')*..*P(fn|s'))

The last step needs a little explaining. all the P(s) and P(s’) drop out of the equation when we’re doing a classifier, since for any piece of evidence, f0…fn, the P(s), the probability of spam occurring, is alway the same across any classification. Since P(s) is constant, and P(s’) is (1 – P(s)), it is also constant. Therefore, when we are comparing the values to determine if it belong in the spam or ham category, we can get rid of constants.

The actual hard part about bayesian classifiers is how to estimate the underlying probability distribution. If you’ve never seen a piece of evidence in training, you’re going to say the probability of it occurring is zero, which isn’t correct if the evidence shows up during classification. There’s various techniques for dealing with this, mostly under the term ‘smoothing’. I won’t describe the various techniques here, but that should be enough to get you started.