Taco Bell Programming

Ted Dziuba has an interesting article on what he calls Taco Bell Programming; it’s worth reading – there is a lot of value in the concept he’s promoting.  I had some concerns about the practicality of the approach, so I ran some tests.

I produced a 297MB file containing 20 million lines of more-or-less random data:

ruby -e '20000000.times { |x| 
  puts "#{x}03#{rand(1000000)}" 
}' > bigfile.txt

Then I ran:

cat bigfile.txt |
  gawk -F '03' '{print $1, $0}' |
  xargs -n2 -P7 printf "%s -- %sn" > newfile.txt

This is a baseline; a data producer, such as pulling from a database, is going to produce data more slowly than cat, and printf is going to write lines more quickly than whatever we’re doing to process the data.  Here are the results:

P     Time
7     68m
3     88m
1     241m

xargs was doing all of the work in this test, as far as top could tell, but the multiple processing helped.  This confirmed my suspicions: xargs is having to spawn off a new Linux process for every line, which is not cheap.  For comparison, a similar program written in Erlang (not known as being the fastest language in the world) was able to process the same amount of data, on a machine with half as many cores, in 20 minutes.  In addition, the machine that it was running on was also: (1) running a process which was pulling data from a SqlServer database, itself, consuming 90% of a core, (2) inserting the results into a MongoDB database, and (3) running the MongoDB server that was being inserted into.  So: half the resources, doing much more work, and it still runs over 3x as fast.

Ted’s main point – that code is a liability – is still valid, and it’s always useful starting out a project asking yourself how you could solve your problem with such tools.  However, take such approaches with a grain of salt; if you can afford lackluster performance, it’s probably a worthwhile solution.  If performance is any sort of consideration, you may need to seek other solutions.