A few years ago I wrote the following code.
When I first wrote it Data.Map was at least 10 times as fast as Data.IntMap. A few months later it was the opposite, then it switched again with one of the later 6 series releases. Now, with 7.0.3, IntMap is back to being faster and using much less memory. I love writing Haskell but I wish the performance wasn’t so mysterious at times.
My Ongoing Battle with Data.Map and Data.IntMap
PostgreSQL and MapReduce
I recently had to analyze some logs and my first tool choice was Hive. With its SQL-like interface and good scalability it makes a lot of large data analysis easy. Once I started planning the details of my analysis two things made me change my mind about it. The first was that I was going to have to either reformat the data before giving it to Hive or write an InputFormat. The data was in a multiline format so writing an input format for it wouldn’t have been particularly easy and the data was small enough that it wasn’t impractical to have a single machine do the reformatting so MapReduce wasn’t necessary to reformat it. The second thing is that since the data wasn’t that large significant overhead the Hive has for starting jobs would make probably outweigh the speed advantage I could get from using multiple machines. I ended up using a Ruby script to generate a CSV file from the logs and then used PostgreSQL’s COPY command to load it. Not as exciting as writing a distributed MapReduce program in Haskell but it worked really well.
In this case the analysis I wanted to do could have been done with something like Dremel to have the benefits of quick queries and multiple machines, but not everyone has access to a system like Dremel. What everyone does have access to is Hive and PostgreSQL. I’ve found that almost every time I want to analyze a lot of data there’s only a small part of the data that I’m actually interested in and Hive is usually great for extracting it. Combined with Hive’s ability to export data as a CSV file and PostgreSQL’s previously mentioned COPY command you can easily build a pipeline to process more data than fits on one computer and still have quick queries for iterative analysis. Another option is for someone to build an open source version of Dremel, but that’s considerably more work.
