Painite teaser

It seemed like “What the heck is a Ycombinator” struck a chord with people. In retrospect, I probably should have done less rambling. So after a silence of a week, I’ll get to the point. Here’s a little teaser of what I’ve been working on over the holidays.


require 'painite'

p_es = Prob::EventSpace.new do |es|
es << Prob::RandVar.new("cancer") do |pdf|
pdf["cancer = :sick"] = 0.01
pdf["cancer = :healthy"] = 0.99
end
es << Prob::RandVar.new("test | cancer") do |pdf|
pdf["test = :pos | cancer = :sick"] = 0.8
pdf["test = :neg | cancer = :sick"] = 0.2
pdf["test = :pos | cancer = :healthy"] = 0.096
pdf["test = :neg | cancer = :healthy"] = 0.904
end
end

p_es["cancer = :sick | test = :pos"] # => 0.8 * 0.01 / (0.8 * 0.01 + 0.096 * 0.99)

It’s still not quite complete, as it’s harder than I expected to generalize it. The interface is still a little bit in flux, but I hope to nail it down soon.

Advertisements

What the heck is a Ycombinator

I woke up at 4am this morning for no reason and decided to go through Ycombinator in Ruby. Given that I read Hacker News daily, it’s a shame I didn’t know what a Ycombinator was. I thought the article was pretty clear, at least compared to Raganwald’s article. As an aside, from purely a learning experience, it was kinda fun and eye opening. At least I can see why geeks think it’s neat.

I had written a reply to a comment on hacker news about it, and it turned out to be really long, so I just extracted it out here. I was suppose to start some coding, but I got wrapped up in writing up the comment. My comment ended up to be pretty long, but it was shorter than it could have been. You guys get to suffer through the long version. Haha. But I’m not going to go into details. You can read that in the linked articles. Instead, I’ll paint some broad strokes.


Y = λf·(λx·f (x x)) (λx·f (x x))

or in Ruby, copied from the aforementioned article:


y = proc { |generator|
proc { |x|
proc { |*args|
generator.call(x.call(x)).call(*args)
}
}.call(proc { |x|
proc { |*args|
generator.call(x.call(x)).call(*args)
}
})
}

or if you prefer elegance, Tom’s solution in response to Raganwald


def y(&f)
lambda { |x| x[x] } [
lambda { |yf| lambda { |*args| f[yf[yf]][*args] } } ]
end

I found lambda calculus above hard to read. However, if you go through the code in Y Combinator in Ruby, you’ll find it’s not too bad. I find that this lecture is also pretty good, as it takes you step by step, with a little bit of humor as well.

If I had to take a stab at it, A Ycombo is a way to implement recursion mechanism when the language doesn’t provide named recursion, loops, or iterators, and all you get are first-class functions and a few substitution rules.

Functions are just mappings from one set of things to another set of things–ie. give it a number, and it’ll point to another number. Ycombo relies on a property of functions that sometimes, when you put something into a function, you get the exact same thing out, i.e. something gets mapped to itself, like f(x) = x^2, where f(1) = 1 is an example of this property. They call this the fixed point of a function.

The thing about functions is that they don’t just have to map numbers to numbers. They can map functions to other functions. A derivative is an example of a function that takes one function as an input, and spits another function back out. like d(x^2) = 2x. Where is the fixed-point of a derivative? One of them is when you put e^x into it. d(e^x) = e^x. I’m sure there are more.

This is important because if you can find the point in which a function can return a function unchanged, you can use that to call the function again, which is what we call recursion. And all the trickiness you see in ycombinator that you see is mainly a result of functional programming not keeping state, so you have to pass everything you need into a function. So if you have to do recursion, you have to have a mechanism pass the function itself, so you can call it again. And this mechanism kinda bends back onto itself, and that’s why you see a part of the ycombinator repeated twice in the code above, and in the lambda calculus.

It seems pretty inane to use ycombo given that modern high level languages provide named recursions, and if anything, for loops and iterators with closures. But what if you don’t? How do you process lists/arrays without loops or named recursion? Generally, you’d have to make your own recursion mechanism.

When would you ever have languages that don’t have those things? Probably not often. But Forth comes to mind. I’ve never used Forth, but from what little I know about it, the language starts off with some basic primitives and that’s it. No loops, no if statements. How do you get anything done? Well, you build your own control structures. People use Forth because it’s possible to build your own very small compiler from the ground up written in Forth itself, and still understand the entire thing. I suspect it’s used in embedded programming because of that reason.

So you’d pretty much use it when you want to have a powerful system from first principles. I’m guessing it’s possible to build your own computer by implementing only functions and substitution rules in hardware, and then you can derive everything else, numbers, pairings, and even recursions in software. That way, you keep your hardware costs down, while retaining the power of a Turing machine.

Speculating here…but another thing that might be interesting is that ycombinator might be a way to compress code. Software size should be reducible if the code is compressible. And recursion can be thought of as a compressing code, and expanding it when the recursion is run. I wonder if there’s other ways to reduce software bloat with the use of a ycombinator besides recursion?

(Aha!) Part of the reason why great hackers are 10 times as productive

I knew in college that some dudes were faster than I was in terms of programming. Since peer programming wasn’t exactly encouraged in college, and at work I did mostly prototyping work, I never really knew how fast other programmers worked.

So when I read Paul Graham (and Joel’s) claim that great hackers are at least ten times as productive as average programmers (too lazy to cite right now), I was kinda shocked. Surely, ten times is an order of magnitude! Something that takes an average programmer a 40 hour week to do the great hacker can do in a 4 hour afternoon?

I wondered about that, since there are times when I get stuck on something, then I just start stabbing around randomly out of frustration. I had assumed that great hackers were faster only because they had either the experience or insight to side-step whatever I was doing wrong.

But lately, I’ve been re-reading everyone’s essays that write about programming productivity. And one thing that caught my eye the second time around was when Paul Graham was talking about bottom up programming and how he didn’t really believe in objects, but rather, he bent the language to his will. He was building new blocks for himself so he could think about the problem at a higher level of abstraction.

This is basic stuff! I mean, that’s the whole point of higher-level programming. When you refactor methods out, you’re creating a vernacular so that you can express the problem in terms of the problem domain, not in terms of general computing. This is much like if you want to drive a car, you’d want to be able to step on the gas, rather than time the firings of the pistons. And if you want to control traffic in a city, you’d rather tell all cars to go to a destination, rather than stepping on the gas and turning the steering wheel for each one.

But taken into the light of bending a language to your will, it makes it more clear for me as to how great hackers are ten times as productive. Great hackers are productive not only because they know what problems to sidestep and can problem solve systematically and quickly, but they also build a set of tools for the problem domain as they go along. They are very good pattern recognizers and will be able to generalize a particular pattern of code, so that they can use it again. But not only that, great hackers will create an implicit understanding attached to the abstraction, ie. what we might call common sense.

A case in point. Before Ruby, I’d used for loops over and over again, never really thinking that I could abstract a for loop. It wasn’t until they were taken away in Ruby did I realize that map, inject, and their cousins are all abstractions of the for loop. When I see “map” I know that it performs a transformation on every element. But I also know that the array I get back will be the same size, that each element operation doesn’t affect other elements, among other things. These are implicitly stated, and they allow for shorter code.

When that happens, you can simply read “map”, and get all the connotations it comes with, and hence it comes with meaning. It also becomes easier to remember, since it’s a generalized concept that you can apply in different places in the code. The more times you use it, the easier it is to remember, instead of having specialized cases of the same kind of code where the behavior is different in different parts of the code.

A great hacker will take the initial time upfront to create this generalized code, and will save in the long run being able to use it. Done over and over again, it all adds up. So it’s not that for any given problem, a great hacker will be done in 4 hours what it takes an average programmer 40 hours, but that over time, a great hacker will invest the time to create a tools and vocabulary that lets him express things easier. That leads to substantial savings in time over the long haul.

I hesitated writing about it, as it’s nothing I (nor you) haven’t heard before. But I noticed that until recently, I almost never lifted my level abstraction beyond what the library gave me. I would always be programming at the level of the framework, not at the level of the domain. It wasn’t until I started writing plugins for rails extracted from my own work and reading the Paul Graham article that a light went off for me. It was easier to plug things like act_as_votable together, rather than to still mess around with associations (at the level of the framework). I still believe you should know how things work underneath the hood, but afterwards, but all means, go up to the level of abstraction appropriate for the problem domain.

DSLs (Domain specific languages) are really just tool-making and vernacular creation taken to the level of bending the language itself. It’s especially powerful if you can add implicit meaning to the vernacular in your DSL. It’s not only a way of giving your client power in their expression, but it’s also a refactoring tool so that you can better express your problem in the language of the problem domain. Instead of only adding methods to your vernacular, you can change how the language works. It was with this in mind that I did a talk on DSLs this past weekend at the local Ruby meetup. First part is on Dwemthy’s Array, and the second is using pattern matching to parse logo. Both seemed pretty neat when I first read about it. Enjoy!


DSL in Ruby through metaprogramming and pattern matching

Getting rid of emacs 23 splash screen

Pretty Emacs Reloaded

The latest emacs 23 is pretty neat, but it noticeably has that splash screen in the beginning. It’s generally good practice for an app not to present new users with just a blank screen. There’s should be something there to say, ‘hey, these are the next steps’.

That said, I’ve found it annoying, probably because I’m use to it not being there. So I’ve found that you can turn it off through this comment.


(setq inhibit-splash-screen t)

Just put that in your .emacs file in your home directory, and you’re all set. tip~

Generating rdoc for gems that refuse to generate docs

I recently upgraded to capistrano 2.1, and it’s woefully lacking in documentation. Jamis Buck had already picked a documentation coordinator about a month ago, but nothing seemed to be happening since.

So it was time to go dumpster diving in cappy code. I at least wanted to see what the standard variables were. To my surprise, there were some docs in code, but I couldn’t generate it with


gem rdoc capistrano

For those of us that had never made a gem, all you have to do to force it to is to edit the associated specification for the gem (/usr/lib/ruby/gems/1.8/specifications/), and add


s.has_rdoc = true

Maybe if I dig enough stuff out of it, I’ll pose some prelim documentation for cappy 2.1 and donate it to the documentation effort.

As an aside and musing, ideally, the code itself would be documentation. However, just because you’re reading code, doesn’t mean that you can get the overall picture of how to use it. Even if you filtered out the details, and saw just the class and method declarations, that still wouldn’t be enough since you can’t see how things fit together. I don’t know what a good solution would be. The simplest API is often complex enough to be nearly useless without good documentation.

Small tip!

The user owns this and that

In most any community-based website, your users are creating objects. And sometimes, you’d only like to display them when they’re owned by the user. Usually, it’s easy enough to do the check based on the association in the template views.



"story"

However, for the sake of perhaps over-reaching, but something more readable, we try:


class Story < ActiveRecord::Base
belongs_to :account

def owned_by?(user)
self.account == user
end
end



"story"

But this gets to suck, because you have duplicate code in different classes that are just slightly different due to association names. One way to solve it is to put it in a module, and include it in different classes. After all, that’s what mixins are for.


module Ownership
def owned_by?(user)
self.account == user
end
end

class Story < ActiveRecord::Base
include Ownership
belongs_to :account
# blah blah blah
end

class Post < ActiveRecord::Base
include Ownership
belongs_to :account
end

Or alternatively, you can put it in the account, but in reverse. But now, you have to search through the associations of the owned ActiveRecord object.


class Account < ActiveRecord::Base
def owns?(ar_object)
ar_object.class.reflect_on_all_associations(:belongs_to).detect { |assoc|
ar_object.send(assoc.name) == self
} ? true : false
end
end

I find the tertiary operator to be kinda ugly there, but it doesn’t make sense to return the reflection object. Regardless, this lets you do:



"story"

However, this doesn’t work for self-associated AR objects, or objects that have two or more belongs_to to the same account. It relies on a unique belongs_to association for every object belonging to account. I’m not sure yet, which way’s the better way to go, and in the end it probably doesn’t matter as much, but I do like being able to say user.owns?(anything) for any object without really thinking about what the associations are. half-tip.

A simple distributed crawler in Ruby

A week ago, I took a break from Mobtropolis, and…of all things ended up writing a simple distributed crawler in Ruby. I hesitated posting it at first, since crawlers are conceptually pretty simple. But eh, what the heck.

This was just an exercise to do some DRb and Hpricot, so don’t use this for your production work, whatever it may be. An actual crawler is far more robust than what I wrote. And don’t keep it running hammering at stuff, since it’ll get you banned.

First, this is how you use it:


WebCrawler.start("http://en.wikipedia.org/") do |doc|
puts "#{doc.search("title").inner_html}"
end

And that’s it. It returns documents in an XPath traversable form, courtesy of Hpricot.

A web crawler is a program that simply downloads pages, takes notes of what links there are on that page, and puts those links on its queue of links to crawl. Then it takes the next link off its queue and downloads that page and does the same thing. Rise and Repeat.

First, we create a class method named start that creates an instance of a webcrawler and then starts it. Of course, we could have done without this helper method, but it makes it easier to call.


module Crawler
class WebCrawler
class << self
def start(url)
crawler = WebCrawler.new
crawler.start(url) do |doc|
yield doc
end
return crawler
end
end
end

So next, we define the initialization method.


module Crawler
class WebCrawler
def initialize
puts "Starting WebCrawler..."
begin
DRb.start_service "druby://localhost:9999"
puts "Initializing first crawler"
puts "Starting RingServer..."
Rinda::RingServer.new(Rinda::TupleSpace.new)

puts "Starting URL work queue"
@work_provider = Rinda::RingProvider.new(:urls_to_crawl, Rinda::TupleSpace.new, "Queue of URLs to crawl")
@work_provider.provide

puts "Starting URL visited tuple"
@visited_provider = Rinda::RingProvider.new(:urls_status, Hash.new, "Tuplespace of URLs visited")
@visited_provider.provide
rescue Errno::EADDRINUSE => e
puts "Initialize other crawlers"
DRb.start_service
end
puts "Looking for RingServer..."
@ring_server = Rinda::RingFinger.primary

@urls_to_crawl = @ring_server.read([:name, :urls_to_crawl, nil, nil])[2]
@urls_status = @ring_server.read([:name, :urls_status, nil, nil])[2]
@delay = 1
end
end
end

This bears a little explaining. The first webcrawler you start will create a DRb server if it doesn’t already exist and do the setup. Then, every subsequent webcrawler it’ll connect to the server and start picking URLs off the work queue.

So when you start a DRb server, you call start_server with a URI, then you start a RingServer. What a RingServer provides is a way from subsequent clients to find services provided by the server or other clients.

Next, we register a URL work queue and a URLs visited hash as services. The URL work queue is a TupleSpace. If you haven’t heard of TupleSpace, the easiest way to think of it is as like a bulletin board. Clients post items on there, and other clients can take them out. This is what we’ll use as a work queue of URLs to crawl.

The URLs visited is a Hash so we can check which URLs we’ve already visited. Ideally, we’d use the URL work queue, but DRb seems to only provide blocking calls for reading/taking from the TupleSpace. That doesn’t make sense, but I couldn’t find a call that day. Lemme know if I’m wrong.


module Crawler
class WebCrawler
def start(start_url)
@urls_to_crawl.write([:url, URI(start_url)])
crawl do |doc|
yield doc
end
end

private

def crawl
loop do
url = @urls_to_crawl.take([:url, nil])[1]
@urls_status[url.to_s] = true

doc = download_resource(url) do |file|
Hpricot(file)
end or next
yield doc

time_begin = Time.now
add_new_urls(extract_urls(doc, url))
puts "Elapsed: #{Time.now - time_begin}"
end
end
end
end

Here is the guts of the crawler. It loops forever taking a url off the work queue using take(). It looks for a pattern in the TupleSpace, and finds the first one that matches. Then, we mark it as ‘visited’ in @urls_status. Then, we download the resource at the url and use Hpricot to parse it into a document then yield it. If we can’t download it for whatever reason, then we grab the next URL. Lastly, extract all the urls in the document and add it to the work queue TupleSpace. Then we do it again.

The private methods download_resource(), extract_urls(), and add_new_urls() are merely details, and I won’t go over it. But if you want to check it out, you can download the entire file. There are weaknesses to it that I haven’t solved, of course. If the first client goes down, everyone goes down. Also, there’s no central place to put the processing done by the clients. But like lazy textbook writers, I’ll say I’ll leave that as an exercise for the readers. snippet!


webcrawler.rb

Communicating your intent in Ruby

I’ve been using Ruby most everyday for about two years now. While I’m no expert, I know enough to be fairly productive in it. And beyond liking the succinctness and power that you often hear other people talk about, it’s made me a better programmer. But there’s an aspect of Ruby that worries me somewhat.

To start, programming is recognized rightfully as a means to build something from pure thought. But it’s also a form of communication, to other programmers that will touch your code later, and to yourself when you look at it months from now. We’re at a point that other than embedded and spacecraft programming, we have the luxury of using programming languages that focus ease for the programmer, rather than for the ease of the machine. Fundamentally, that’s the philosophy that Ruby takes.

And while Ruby’s nice in a lot of ways, I’m not sure about how it communicates an object’s interface. When you’re allowed to modify objects and classes on the fly, how do you communicate interfaces between modules you mixin and methods/modules you add? By interface, I mean, how do you use this class so that it does what it’s suppose to? Normally, it’s pretty obvious–you look at the names of the methods declared in the code. A well-written class has public methods exposed, or you look at its ancestor’s public methods. You might need some documentation to figure out how to call them in the right order, but generally, you have some idea just by looking at the method signatures.

However, when you throw mixins and metaprogramming in the mix, it becomes less easy to tell just from looking at the method signatures in the code–the structure of the code. You have to specifically read the code, or you have to rely on someone who knew intent to document it in detail.

An example communicating interfaces for mixins: the module Enumerable contains a lot of the Collections related methods. The cool thing is that if you wanted these functions in your own class, all you have to do is define each() in your class, mixin the Enumerable module, and you get all of these “for free”. However, outside of documentation explicitly stating it, it’s not as immediately obvious in method signatures that this is what you have to do in order to use it. It’s only after scanning through the entire code that you notice each() being used for all the methods.

Of course, Ruby contains enough metaprogramming power to protect yourself against this. one can do something like this:


class MethodNeededError < RuntimeError
def initialize(method_symbol, klass)
super "Method #{method_symbol.to_s} needs to be in client class #{klass.inspect}"
end
end

module Enumerable
def self.included(mod)
raise MethodNeededError.new(:each, mod) unless mod.method_defined?(:each)
end
end

This only works if you put the include after you define each(). That’s just asking for trouble when the order of your definitions in your class matter.

A fair number of people are writing mini-DSLs in ruby using metaprogramming tricks. One of the common ones is the use method_missing to define or execute methods on the fly. ActiveRecord’s dynamic finds are implemented this way. The advantage of communication of interface here in the structure of the code is obvious. Unless it was documented well, you can’t tell just by looking at the method signatures.

Why do I harp on interface signatures? I mean, in the instance of requiring each(), it works by just letting it fail in the enumerated methods, since it’ll complain about each itself. In the instance of method_missing, just read the regex in the body. While these are true, none of these allow for rdoc to generate proper documentation. The whole point of documentation is to show you the interface–how to use that piece of code. I’m just afraid that given Ruby’s philosophy of being able to write clear, powerful, and succinct code, it might fall short when people start using these metaprogramming tricks like alias_method_chain and method_missing more and more. Maybe rdoc needs to be more powerful and read code bodies for regex in method_missing?. It already documents yields in code bodies, but that seems awfully specific.

I’m not a exactly a fan of dictating interfaces like in Java. When you’re first coding something up, you’re sketching, so things are bound to change. Having plumbing like interface declaration gets in the way, imo. However, when something’s a bit more nailed down, it’d be nice to be able to communicate to other programmers your intent without them having to read code bodies all the time.

In the end, I side on flexibility. However, I kinda wish Ruby had some type of pattern matching for methods so I didn’t have to read method_missing all the time. But then again, that would be messy in all but the simplest schemes. Can you imagine a class that responded to email addresses as method calls? I guess I’d have to file this one under “bad ideas”

Don’t reopen ActiveRecord in another file

The power of Ruby lies partially in how one can reopen classes to redefine them. Besides namespace clashes, this is usually a good way to extend and refine classes to your own uses. However, last night, I got bitten in the ass trying to refactor a couple classes. In Rails, you’re allowed to extend associations by adding a class the association call.


class User < ActiveRecord::Base
has_many :stories, :through => :entries, :source => :story,
:extend => StoryAssociationExtensions
end

where StoryAssociationExtensions is a class module holding methods, like expired() that I can perform on the challenges association, so I can do stuff like


@user = User.find(:first)
@user.stories.expired # gives all expired stories

So when refactoring and cleaning up, I renamed StoryAssociationExtensions to AssociationExtensions and reopened up Story class and put it in there. I just wanted to clean up the namespace, and put the association extensions somewhere that made semantic sense. Naturally, I thought putting association extensions for a class belongs in a class. Well, it doesn’t work. And don’t do it. Hopefully, I’m saving you some pain.


class Story < ActiveRecord::Base
module AssociationExtensions
def expired
self.select { |c| c.expired? }
end
end
end

Well, this works if you’ve reopened the class within the same model file, story.rb in this case. However, if you reopen the class in another file elsewhere, your model definition won’t get loaded properly, which leads to associations and methods you defined not to exist.

So imagine my bewilderment when associations didn’t work on only certain ActiveRecord Models. In addition, they worked on the unit tests and script/console, but didn’t work when the server was running. All that at 3am in the morning. 😦

Good thing for source control, so I could revert (but I have to say, svn isn’t as easy to use as it could be).

I ended up creating a directory in model called collection_associations and putting the associations in there under a module CollectionAssociations namespace. Not exactly the best arrangement but it’ll do for now.

I’m still not sure why ActiveRecord::Base instances don’t like being reopened, but I’m guessing it has something to do with only getting loaded once. If anyone has an explanation, I’d like to read about it.

free warning!