Monthly Archives: August 2013

Using GNU Parallel to roll-your-own Map Reduce!

Map Reduce is a popular paradigm for algorithms to process big data in parallel, but its use in multi-core servers and clustered servers has been primarily been the purview of frameworks like Hadoop.

Well guess what: rather than mucking around with big nasty Java systems like  Hadoop, you can actually run Map-Reduce style calculations using GNU Parallel. Yes, you can use all the cores on your server as well as *other* servers  nearby (in your cluster usually).

This is particularly handy when you want to operate on big files or a lot of files, and you know your tool (like grep) is not designed to use all your cores.

Here’s how to apply Map Reduce in a regime like this. Imagine you have a directory with thousands of files, and you know you want to search through a directory of thousands of logs, remove all the instances of integers, and then prints out the mean average of those numbers. You have a multi-core box  and you get anxious when you look at htop and see all those lonely cores looking for work. You get the idea.

 

So, assume your logs are in the directory /var/log/ 

sudo find /var/log/  -type f | sudo parallel egrep -I -i -o ‘[[:digit:]]+’ {} | awk ‘{s+=$1} END {print s/NR}’

…and you just mapped and reduced the shit out of those files, using all of your cores on your local machine.

Here’s the pipeline:

  • sudo find /var/log/  -type f

This is basically collecting the list of file names you want to process.

  • Mapping Step (Number of mappers equal to number of cores):
  • sudo parallel -j0 egrep -I -i -o ‘[[:digit:]]+’ {}

This is GNU Parallel, turning each core into a map task, like Hadoop! It is taking all your input files and using as many possible cores (-j0 flag). The ‘egrep’ call is just scraping out all instances of integers…you could do whatever you want here.

  • Reducer Step (One reducer in this case) 
  • awk ‘{s+=$1} END {print s/NR}’

Once the mappers above all emit their integers, this Reducer task effectively just takes list and adds them up.

This really works like multi-core magic.