Find all posts with at least 2 comments (Or how I finally learned SQL with a rant afterwards)

Honestly, it’s shameful that I just never took the time to really learn the aggregation methods in SQL. I did lots of SQL in a database class a long time ago, but after picking up ActiveRecord, I’ve just been lazy and getting by with u.contacts.find(:all).select { u.traits.count > 2 } which I knew to be slow, but it was always just a one off query in the console.

Recently, I’ve had to write an admin panel that queries our database of users for some measurable stats. It’s actually quite simple using group by and having, especially when you’re operating on one table.

But in order to find all posts with at least 2 comments, you’ll need to do a proper join on the tables, group by the right id. In this case, you’ll need to start with posts table and do a left outer join with comments table, and then group all of them by post.id

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# User has_many :posts
# Post has_many :comments

class Post < ActiveRecord::Model
  class << self
    def posts_with_ruby
      u = User.find_by_email("iamwilhelm@twitter.com")
      u.posts.find(:all).select { u.comments.count > 2 }
    end

    def posts_with_include
      u = User.find_by_email("iamwilhelm@twitter.com")
      u.posts.find(:all, :include => :comments).select { u.comments.count > 2 }
    end

    def posts_with_sql
      u = User.find_by_email("iamwilhelm@twitter.com")
      Post.find(:all,
                   :select => "posts.*, count(comments.id) as comment_num",
                   :joins => "left outer join comments on comments.post_id = posts.id",
                   :conditions => ["posts.user_id = ?", u.id],
                   :group => "posts.id, posts.name, posts.created_at, posts.updated_at, posts.user_id",
                   :having => "count(comments.id) > 2")
    end
  end
end

# > ./script/performance/benchmarker 100 "Post.posts_with_ruby" "Post.posts_with_include" "Post.posts_with_sql"
# user system total real
# #1 2.790000 0.170000 2.960000 ( 4.267908)
# #2 4.570000 0.210000 4.780000 ( 6.206539)
# #3 0.220000 0.010000 0.230000 ( 0.323412)
# Loaded suite ./script/performance/benchmarker

Note that I’m using postgres, which requires you to group by all the attributes you’ve selected. Since I selected post.*, I need all those attributes in the group argument. there probably is a shorter way to express it (maybe ‘post.*’ ?), but I haven’t tried it. 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class Post < ActiveRecord::Base
  named_scope :with_comment_count, lambda { |expression, num|
    { :select => "posts.*, count(comments.id) as comment_count",
      :joins => "left outer join comments on comments.post_id = posts.id",
      :group => "posts.id, posts.name, posts.created_at, posts.updated_at, posts.user_id",
      :having => "count(comments.id) #{expression} #{num}" }
  }
end

u = User.first

# finds all posts of a user with comment count greater than 2
u.posts.with_comment_count(:>, 2)

# finds all posts of a user with comment count less than 1
u.posts.with_comment_count(:<=, 1)

# finds all posts of a user with comment count not equal to 2
u.posts.with_comment_count(:"!=", 2)

Here, you can see that I benchmarked the different ways, and the SQL version is by far faster. Hell, let’s use it as a named scope

Pretty spiffy. Rant time.

But really, I wish I could do u.contacts.find(:all).select { u.traits.count > 2 } in the programming language of my choice. Being able to select the dataset of my choice through a method chain of filters is much easier to think about. You start with the entire set, and you start picking out records/objects you want and reject ones you don’t.

The complexity of SQL has much to do with having multiple tables to join. It means you have to know something about the structure of the database schema to use it in any meaningful way. Whenever I have to context-switch my brain from code to thinking about set combinations, it’s distracting and slows me down. It’s like how I have to think about doing memory management when coding in C. That’s much of the appeal of NoSQL databases for developers. There’s less to keep in mind. 

However, there are problems doing it this way. First, objects likes to bring everyone to the party, even those that aren’t invited. 

“The problem with object-oriented languages is they’ve got all this implicit environment that they carry around with them. You wanted a banana but what you got was a gorilla holding the banana and the entire jungle.” – Joe Armstrong

The whole point of a database is to store what you can’t fit in memory and grab only parts of the entire set you want to work on at a time. But where do you define the limits of how much to grab? Do you just pull in posts? Or do you pull in posts and all their comments?

Currently, we handle this manually, where we decide the extent of the object associations we want to instantiate from database records based on the work we know we’ll be doing. 

But sometimes, you may not know you need a smaller set at the time of the query until later in the program. You may call @posts = user.posts, but perhaps some time later in the method, you want all the comments with those posts, which then you’d need @posts.map(&:comments).flatten.sort.

This is a pity because mapping to comments goes to the database for every post, which has quite a bit of overhead, and sorting it in application layer is a waste because the database could have done that for us. To me, this has all the hallmarks of memory management–something that should be done for me.

I’m not sure what a good solution would look like. I know that ActiveRecord already caches the associations like post.comments, and using named_scope, it performs lazy evaluation on the set only when you need it at the last minute. But for now, it doesn’t cover many of the cases, and we’re forced to handle it ourselves manually.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s