Using GNU Parallel to roll-your-own Map Reduce!

Map Reduce is a popular paradigm for algorithms to process big data in parallel, but its use in multi-core servers and clustered servers has been primarily been the purview of frameworks like Hadoop.

Well guess what: rather than mucking around with big nasty Java systems like  Hadoop, you can actually run Map-Reduce style calculations using GNU Parallel. Yes, you can use all the cores on your server as well as *other* servers  nearby (in your cluster usually).

This is particularly handy when you want to operate on big files or a lot of files, and you know your tool (like grep) is not designed to use all your cores.

Here’s how to apply Map Reduce in a regime like this. Imagine you have a directory with thousands of files, and you know you want to search through a directory of thousands of logs, remove all the instances of integers, and then prints out the mean average of those numbers. You have a multi-core box  and you get anxious when you look at htop and see all those lonely cores looking for work. You get the idea.

 

So, assume your logs are in the directory /var/log/ 

sudo find /var/log/  -type f | sudo parallel egrep -I -i -o ‘[[:digit:]]+’ {} | awk ‘{s+=$1} END {print s/NR}’

…and you just mapped and reduced the shit out of those files, using all of your cores on your local machine.

Here’s the pipeline:

  • sudo find /var/log/  -type f

This is basically collecting the list of file names you want to process.

  • Mapping Step (Number of mappers equal to number of cores):
  • sudo parallel -j0 egrep -I -i -o ‘[[:digit:]]+’ {}

This is GNU Parallel, turning each core into a map task, like Hadoop! It is taking all your input files and using as many possible cores (-j0 flag). The ‘egrep’ call is just scraping out all instances of integers…you could do whatever you want here.

  • Reducer Step (One reducer in this case) 
  • awk ‘{s+=$1} END {print s/NR}’

Once the mappers above all emit their integers, this Reducer task effectively just takes list and adds them up.

This really works like multi-core magic.

2 thoughts on “Using GNU Parallel to roll-your-own Map Reduce!

  1. Hello Aris Vlasakakis,

    Thanks for sharing !!! Your posts are just awesome . Actually I had used GNU parallel just for the compression purpose. But, you gave nice illustrations for using parallel with sed, awk. It’s just time saver. We have workstations & servers with multiple threads ranging from 24-160 threads & hundreds of Gigabytes of RAM but are not effectively used for grep sed like tasks when observed on htop. I frequently Google for efficient ways for managing / processing large chunks of genomic data . Just similar to cpu threads I would also like to know about excess RAM allocations to allow in memory processing ( like concepts of SAP’s HANA database).
    I hope you’d keep posting such optimizations.

Leave a Reply

Your email address will not be published. Required fields are marked *