Taco Bell Programming

Ted Dziuba has an interesting article on what he calls Taco Bell Programming; it’s worth reading – there is a lot of value in the concept he’s promoting. I had some concerns about the practicality of the approach, so I ran some tests.

I produced a 297MB file containing 20 million lines of more-or-less random data:

ruby -e '20000000.times { |x| 
  puts "#{x}03#{rand(1000000)}" 
}' > bigfile.txt

Then I ran:

cat bigfile.txt |
  gawk -F '03' '{print $1, $0}' |
  xargs -n2 -P7 printf "%s -- %sn" > newfile.txt

This is a baseline; a data producer, such as pulling from a database, is going to produce data more slowly than cat, and printf is going to write lines more quickly than whatever we’re doing to process the data. Here are the results:

P	Time
7	68m
3	88m
1	241m

xargs was doing all of the work in this test, as far as top could tell, but the multiple processing helped. This confirmed my suspicions: xargs is having to spawn off a new Linux process for every line, which is not cheap. For comparison, a similar program written in Erlang (not known as being the fastest language in the world) was able to process the same amount of data, on a machine with half as many cores, in 20 minutes. In addition, the machine that it was running on was also: (1) running a process which was pulling data from a SqlServer database, itself, consuming 90% of a core, (2) inserting the results into a MongoDB database, and (3) running the MongoDB server that was being inserted into. So: half the resources, doing much more work, and it still runs over 3x as fast.

Ted’s main point – that code is a liability – is still valid, and it’s always useful starting out a project asking yourself how you could solve your problem with such tools. However, take such approaches with a grain of salt; if you can afford lackluster performance, it’s probably a worthwhile solution. If performance is any sort of consideration, you may need to seek other solutions.