Use multiple CPU Cores with your Linux commands — awk, sed, bzip2, grep, wc, etc.

Here’s a  common problem: You ever want to add up a very large list (hundreds of megabytes) or grep through it, or other kind of operation that is embarrassingly parallel? Data scientists, I am talking to you. You probably  have about four cores or more, but our tried and true tools like grepbzip2wcawksed and so forth are singly-threaded and will just use one CPU core. To paraphrase Cartman, “How do I reach these cores”? Let’s use all of our CPU cores on our Linux box by using GNU Parallel and doing a little in-machine map-reduce magic by using all of our cores and using the little-known parameter –pipes (otherwise known as –spreadstdin). Your pleasure is proportional to the number of CPUs, I promise.   BZIP2 So, bzip2 is better compression than gzip, but it’s so slow! Put down the razor, we have the technology to solve this. Instead of this:

cat bigfile.bin | bzip2 --best > compressedfile.bz2

Do this:

cat bigfile.bin | parallel --pipe --recend '' -k bzip2 --best > compressedfile.bz2

Especially with bzip2, GNU parallel is dramatically faster on multiple core machines. Give it a whirl and you will be sold.     GREP If you have an enormous text file, rather than this:

grep pattern bigfile.txt

do this:

cat bigfile.txt | parallel  --pipe grep 'pattern'

or this:

cat bigfile.txt | parallel --block 10M --pipe grep 'pattern'

These second command shows you using –block with 10 MB of data from your file — you might play with this parameter to find our how many input record lines you want per CPU core. I gave a previous example of how to use grep with a large number of files, rather than just a single large file. AWK Here’s an example of using awk to add up the numbers in a very large file. Rather than this:

cat rands20M.txt | awk '{s+=$1} END {print s}'

do this!

cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'

This is more involved: the –pipe option in parallel spreads out the output to multiple chunks for the awk call, giving a bunch of sub-totals. These sub totals go into the second pipe with the identical awk call, which gives the final total. The first awk call has three backslashes in there due to the need to escape the awk call for GNU parallel. WC Want to create a super-parallel count of lines in a file? Instead of this:

wc -l bigfile.txt

Do this:

cat bigfile.txt | parallel  --pipe wc -l | awk '{s+=$1} END {print s}'

This is pretty neat: What is happening here is during the parallel call, we are ‘mapping’ a bunch of calls to wc -l , generating sub-totals, and finally adding them up with the final pipe pointing to awk. SED Feel like using sed to do a huge number of replacements in a huge file? Instead of this:

sed s^old^new^g bigfile.txt

Do this:

cat bigfile.txt | parallel --pipe sed s^old^new^g

…and then pipe it into your favorite file to store the output.

Enjoy!

–Aris

36 thoughts on “Use multiple CPU Cores with your Linux commands — awk, sed, bzip2, grep, wc, etc.

  1. “These second command shows you using –block with 10 million lines” actually it’s ~10MB of data block instead of 10 million lines.

      1. You may want to warn that this breaks grep’s semantics, as each grep process is not guaranteed to be working with whole lines.

        1. Thank you for the thought – but from my testing, this use case works well. If you use the grep -A (after) or -B (before) features, it probably will break semantics if you have a boundary problem. But…for straightforward grep filtering this should work. By default GNU Parallel cuts records on ‘\n’ newlines. Do you have an example where grep gets broken?

  2. Hi there, you might just want to include in this article that people can use pbzip2 and pigz, they are the multicore versions of bzip2 and gzip.

  3. Parallel is awesome! Just a note for Ubuntu fellows, if you don’t have parallel installed it will recommend the package moreutils. However, this comes with a different tool with different syntax, and will seem to fail silently. You should look for the package parallel when installing.

  4. For bzip2 and gzip there are multithreaded implementations already available, pbzip2 and pigz which appear in most major repositories. Definitely worth looking at, they make a terrific difference.

  5. I tried this with bzip2, and was impressed with the results – until I went to decompress the file. The file is damaged, and will not uncompress. I would test this if you are going to use it. The line I used was:

    time cat dds.sql | parallel –pipe –recend ” -k bzip2 –best > ddsp.sql.bz2

    1. You have a weird character error, you copied-and-pasted from the website and it somehow got mangled. You want it to say –recend ” (two single quotes). I have tested this many times.

    1. Thank you – an unexpected problem from a plug-in I am using to highlight my code.

  6. Your greater-than characters are showing up as ampersand-g-t-semicolon (at least in Chrome on Debian).

  7. Great stuff.

    One suggestion: You could really tidy things up if you put that awk summing snippet into its own file, or even bash function. That’d also eliminate the need for awkward escaping.

    1. Good idea. I am used to doing a little escaping when necessary to keep things very simple.

  8. I don’t think it’s valid to create a single file from the parallel bzips. Bzip uses a huffman encoding. Lets say byte a comes before byte b in the compressed file. Byte b will depend on a. When you run the compression in parallel like that, you can no longer guarantee that property. As a result, decompressing the file with a different number of splits or with different split locations will fail.

    The only way this could work is if each bzip writes it’s own header and end of stream footers for each split and bzip understands what to do when it sees multiple bzip headers and footers.

    The valid way to do this would be to have separate output files instead of a single one.

    Secondly, the command you provided does not work:
    cat index.html | parallel –pipe –recend ” -k bzip2 –best > compressed.2.bzi
    parallel: invalid option — ‘-‘

    1. Thanks! It appears my double hyphens (–) are somehow converted to single hyphens or m-dashes or something on YOUR rendering. The command I provided does work – I just verified it. I don’t know why you had that weird double-to-single hyphen conversion happen. It is not happening on Google Chrome for me. It could have to do with a code highlight plug-in I am using….

      Secondly – your point on Huffman encoding could stand, but at least in the implementation of bzip2 on Ubuntu 12.04, bzip2 1.0.6, I have generated random data up to 1GB in size, performed the parallel compression I describe, and then decompress and compare with the original. It all works fine. If you could generate an example that fails I would appreciate it, since I don’t know about the internals of Huffman encoding! Your analysis could stand, but my experiments show that the method works for large files.

      1. Seconded that it works for me, but I need to test on files larger than 12 MB.

        However due to the parallel nature it should be noted that the encoding will not be as efficient and compressed files will be slightly larger than it would if bzip2 was used by itself. Probably not a world of difference but I would be interested to see how different the sizes would be given a 1 GB file. The difference in compression of a 12 MB file is ~50 KB.

  9. He’all

    Tested parallel on a dataset with ~8M lines ans ~18G. Results surprised me for the wrong reasons:

    time cat dataset.csv | parallel –pipe wc -l | awk ‘{ sum += $1 }; END { print sum }’
    8087816

    real 1m10.816s
    user 0m37.240s
    sys 1m29.492s

    time wc -l dataset.csv
    8087816 dataset.csv

    real 0m2.158s
    user 0m0.564s
    sys 0m1.584s

    (8 CPU and 32G RAM)
    Why is it taking so long?

  10. With regards to the parallel grep, does this effectively handle when a pattern spans multiple blocks?

  11. Note: This article only makes sense if you have SSD. If you have a traditional, rotating platter disk, the disk i/o time so dominates the runtime that having an effectively faster CPU just can’t make any difference.

    1. This article is trying to demonstrate some basics of using multiple cores for various tasks – so take the lessons and expound on them. I do use SSDs, so your point makes sense – but depending on the types of computations you DO, it may be much more computationally expensive than a stupid little ‘wc’ of course.

      Teach a man to fish here….

  12. The other thing you can do to show comparative performance is to insert ‘pv’ into the piped commands. (pipeviewer). So you could change the first couple to:

    cat bigfile.bin | pv | bzip2 –best > compressedfile.bz2

    and

    cat bigfile.bin | pv | parallel –pipe –recend ” -k bzip2 –best > compressedfile.bz2

    So you can see the speedup directly in the MiB/s. :)

  13. Why are you using cat all the time for single files? cat only makes sense to concatenate multiple files.

    1. Read up on how GNU Parallel takes input from stdin – piping in is an easy way to do it, otherwise you can append arguments via :::

  14. Didn’t know that grep is singly threaded command, indeed in today’s world, if you are not using gnu parallel then your are missing something.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>