Find all posts with at least 2 comments (Or how I finally learned SQL with a rant afterwards)

Honestly, it’s shameful that I just never took the time to really learn the aggregation methods in SQL. I did lots of SQL in a database class a long time ago, but after picking up ActiveRecord, I’ve just been lazy and getting by with u.contacts.find(:all).select { u.traits.count > 2 } which I knew to be slow, but it was always just a one off query in the console.

Recently, I’ve had to write an admin panel that queries our database of users for some measurable stats. It’s actually quite simple using group by and having, especially when you’re operating on one table.

But in order to find all posts with at least 2 comments, you’ll need to do a proper join on the tables, group by the right id. In this case, you’ll need to start with posts table and do a left outer join with comments table, and then group all of them by post.id

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# User has_many :posts
# Post has_many :comments

class Post < ActiveRecord::Model
  class << self
    def posts_with_ruby
      u = User.find_by_email("iamwilhelm@twitter.com")
      u.posts.find(:all).select { u.comments.count > 2 }
    end

    def posts_with_include
      u = User.find_by_email("iamwilhelm@twitter.com")
      u.posts.find(:all, :include => :comments).select { u.comments.count > 2 }
    end

    def posts_with_sql
      u = User.find_by_email("iamwilhelm@twitter.com")
      Post.find(:all,
                   :select => "posts.*, count(comments.id) as comment_num",
                   :joins => "left outer join comments on comments.post_id = posts.id",
                   :conditions => ["posts.user_id = ?", u.id],
                   :group => "posts.id, posts.name, posts.created_at, posts.updated_at, posts.user_id",
                   :having => "count(comments.id) > 2")
    end
  end
end

# > ./script/performance/benchmarker 100 "Post.posts_with_ruby" "Post.posts_with_include" "Post.posts_with_sql"
# user system total real
# #1 2.790000 0.170000 2.960000 ( 4.267908)
# #2 4.570000 0.210000 4.780000 ( 6.206539)
# #3 0.220000 0.010000 0.230000 ( 0.323412)
# Loaded suite ./script/performance/benchmarker

Note that I’m using postgres, which requires you to group by all the attributes you’ve selected. Since I selected post.*, I need all those attributes in the group argument. there probably is a shorter way to express it (maybe ‘post.*’ ?), but I haven’t tried it. 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class Post < ActiveRecord::Base
  named_scope :with_comment_count, lambda { |expression, num|
    { :select => "posts.*, count(comments.id) as comment_count",
      :joins => "left outer join comments on comments.post_id = posts.id",
      :group => "posts.id, posts.name, posts.created_at, posts.updated_at, posts.user_id",
      :having => "count(comments.id) #{expression} #{num}" }
  }
end

u = User.first

# finds all posts of a user with comment count greater than 2
u.posts.with_comment_count(:>, 2)

# finds all posts of a user with comment count less than 1
u.posts.with_comment_count(:<=, 1)

# finds all posts of a user with comment count not equal to 2
u.posts.with_comment_count(:"!=", 2)

Here, you can see that I benchmarked the different ways, and the SQL version is by far faster. Hell, let’s use it as a named scope

Pretty spiffy. Rant time.

But really, I wish I could do u.contacts.find(:all).select { u.traits.count > 2 } in the programming language of my choice. Being able to select the dataset of my choice through a method chain of filters is much easier to think about. You start with the entire set, and you start picking out records/objects you want and reject ones you don’t.

The complexity of SQL has much to do with having multiple tables to join. It means you have to know something about the structure of the database schema to use it in any meaningful way. Whenever I have to context-switch my brain from code to thinking about set combinations, it’s distracting and slows me down. It’s like how I have to think about doing memory management when coding in C. That’s much of the appeal of NoSQL databases for developers. There’s less to keep in mind. 

However, there are problems doing it this way. First, objects likes to bring everyone to the party, even those that aren’t invited. 

“The problem with object-oriented languages is they’ve got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.” – Joe Armstrong

The whole point of a database is to store what you can’t fit in memory and grab only parts of the entire set you want to work on at a time. But where do you define the limits of how much to grab? Do you just pull in posts? Or do you pull in posts and all their comments?

Currently, we handle this manually, where we decide the extent of the object associations we want to instantiate from database records based on the work we know we’ll be doing. 

But sometimes, you may not know you need a smaller set at the time of the query until later in the program. You may call @posts = user.posts, but perhaps some time later in the method, you want all the comments with those posts, which then you’d need @posts.map(&:comments).flatten.sort.

This is a pity because mapping to comments goes to the database for every post, which has quite a bit of overhead, and sorting it in application layer is a waste because the database could have done that for us. To me, this has all the hallmarks of memory management–something that should be done for me.

I’m not sure what a good solution would look like. I know that ActiveRecord already caches the associations like post.comments, and using named_scope, it performs lazy evaluation on the set only when you need it at the last minute. But for now, it doesn’t cover many of the cases, and we’re forced to handle it ourselves manually.

If they can copy a fighter jet, they can copy your damn website

Media_httpcachegawker_kcxkj

They copy computers, cellphones, cameras, video players, watches, bags, and everything man-made—but did you know that the Chinese also copy entire combat airplanes? It’s simple: They buy one and measure every part and system to clone it.

Lately, I’ve taken note of both ASmartBear’s It’s not a Competitive Advantage and Sachin’s assertion (can’t find the quote right now) that when investors ask him about defensibility, he replies that they don’t understand the internet.

If they can copy a fighter jet, they can copy your damn website.

What’s missing from a copy? I’d say intent is one of them. Especially if your product iterates and always tries to do something hard. It’ll take them longer to copy, and they’ll always be on the defensive. For actual competitive advantages, check out another list by the bear.

More dead than alive

At some point soon we’ll start having an awful lot of dead people around. I’m very sure Google and other search engines will start to filter for deceased people search. It won’t be long at at until we have way more dead people online than live people online.

That’s a thought I never thought about. But it may be true.

But what would even be more memorialistic(?) is if we could have online versions of ourselves running around the online world commenting and participating, as if we were alive.

I remember thinking this when I saw “I, Robot”, where Will Smith talks to a holographic 2D projection of a dead investigator for clues, but the investigator can only reply to what he knew up to the point of his death.

Though I’m not sure we can recreate something like what’s in the movie, it’s not infeasible to be able to feed all the text archived in all your chat logs, emails, wall posts, and status messages into some machine learning algorithm so given some text directed at you, it would be able to generate what you would mostly likely say in reply. It may be just an elaborate hidden markov model or something else, but either way, just a elaborate quote machine–that quotes you.

It could be a service that you’d go to, in order to talk to someone dead–like at an oracle, shrine, or memorial. Or it would be a service that let your personality loose online and your facsimile would participate in the diggs, reddits, and facebooks of the day.

I don’t even want to think about what this would mean for future religious figures born today. Imagine if you could talk to Jesus or Budda’s quote machine.

It’s a bit creepy, to be able to talk to a facsimile of a dead person. But it could be it’s just because we’re not use to the idea. I suppose it’s the same as when photographs came out and you could see dead people.

However, just as the dead can’t reach out from photographs to the present, quote machines of the dead wouldn’t be able to reach out from the past into the present. At least, I don’t know that they should.

Picking the right problem

Nowadays, software can be updated continuously and iteratively. This lets us build what we often call minimum viable product, or MVP, and improve it from there. However, most of us don’t end up building MVP despite best intentions. 

The reasons are varied, from fear of negative feedback, not knowing what MVP really looks like, or thinking that your early customers want more than they actually do. The first problem founders-to-be come across when doing MVP is actually picking the scope for the problem.

When you’re first building a product, you spend time thinking about what features should be included and what the benefits would be to the end-user. Then you start thinking of all the things that a user might need before they start using it. The vision for your product is big and could go in any number of directions. The opportunity and potential may be huge!

I think this is the wrong way to go about it. Before you start thinking about all that, you need pick one problem to focus on. This problem is what you’re going learn whether people even want this problem solved.

But even when you find a problem that people want solved, the next most common pitfall is picking the wrong size of problem to work on. We often try to solve a version of the problem that’s too big. Don’t try save the world on first shot. If you can subdivide the problem into smaller problems, if it’s a problem to different types of people, or solves related problems in different contexts, your problem is too large.

The advantages of picking the right, focused problem are many. It makes the product easy enough to do, it’s tractable, and you can see an end in sight–which is extremely motivating. In addition, it’s easy to explain to others. Don’t underestimate the power that stems from the ease of conveying your idea and the problem it’s trying to solve.

Don’t worry about the problem being too shallow. All problems are interesting when you look deep enough. 

Paypal on the surface seems easy enough in the beginning. It’s like a web-bank that only does transfers, and you don’t even need to connect to financial institutions for paypal to paypal transactions. But what people don’t realize in trying to copy paypal is that fraud will increase the more popular your service gets, and that will eventually kill you unless you get it under control. 

Twitter seemed easy enough as well in the beginning. Search the web and you’ll read a lot of naysayers back in 2006-2008 where they can’t understand why posting a message to the web generates so many fail whales. By nature of the interconnectedness of the data in producing the feeds for a given user with a specific set of followers, it’s actually a non-trivial problem.

Groupon has been interesting because it changed how people thought of the intersection between local businesses and online ecommerce. With groupon, you go into group buying agreements with other strangers on the web to get deals. However, it’s actually a small part of a larger set of problems that you could solve with a similar set of mechanics, such as group campaigning and petitioning. Instead of trying to solve all these problems, the founder decided to just focus on one: group buying.

And even if you pick the wrong problem type or size, you can always iterate. There will always be more potential users to ask whether they have this particular problem. There will always be a myriad of interesting problems centered around human needs and wants. There’s no bank account that you’re withdrawing from when you ask people about whether they have a particular problem, as the world likes nothing better than people that are looking to solve their problems.