Taco Bell Programming
Ted Dziuba has an interesting article on what he calls Taco Bell Programming; it’s worth reading – there is a lot of value in the concept he’s promoting. I had some concerns about the practicality of the approach, so I ran some tests.
I produced a 297MB file containing 20 million lines of more-or-less random data:
ruby -e '20000000.times { |x|
puts "#{x}03#{rand(1000000)}"
}' > bigfile.txt
Then I ran:
cat bigfile.txt |
gawk -F '03' '{print $1, $0}' |
xargs -n2 -P7 printf "%s -- %sn" > newfile.txt
This is a baseline; a data producer, such as pulling from a database, is going to produce data more slowly than cat
, and printf
is going to write lines more quickly than whatever we’re doing to process the data. Here are the results:
P | Time |
---|---|
7 | 68m |
3 | 88m |
1 | 241m |
xargs was doing all of the work in this test, as far as top
could tell, but the multiple processing helped. This confirmed my suspicions: xargs is having to spawn off a new Linux process for every line, which is not cheap. For comparison, a similar program written in Erlang (not known as being the fastest language in the world) was able to process the same amount of data, on a machine with half as many cores, in 20 minutes. In addition, the machine that it was running on was also: (1) running a process which was pulling data from a SqlServer database, itself, consuming 90% of a core, (2) inserting the results into a MongoDB database, and (3) running the MongoDB server that was being inserted into. So: half the resources, doing much more work, and it still runs over 3x as fast.
Ted’s main point – that code is a liability – is still valid, and it’s always useful starting out a project asking yourself how you could solve your problem with such tools. However, take such approaches with a grain of salt; if you can afford lackluster performance, it’s probably a worthwhile solution. If performance is any sort of consideration, you may need to seek other solutions.