How to split a file, process the pieces in multiple threads and combine results using a shell script

Say you are in a situation where you have a file with a huge number of records to be processed and the processing of one record does not need data from the processing of previous records (ie. a perfectly paralellizable situation), what can you do to speed up things? Well, here’s what I did when a client recently made a request for statistical data for 300K records instead of his usual request of 20 records as the program that we had earlier made for the purpose wasn’t really made to run fast and wasn’t multi threaded.

Use the “split” command to split a file by number of lines into an appropriate number of files depending upon the configuration of your hardware. Use the “-l” option to specify the number of lines in each file. Then run multiple instances of your program to process the different files in parallel using an “&”. Use “wait” to wait for all background tasks to end. And finally when things are done, merge the different output files together with “cat” in append mode. Voila! Things finish much, much faster.

I used the above steps to make 20 files with 15K lines each since the server I was running the script on was a Sun Solaris 10 T2000 system with 32GB RAM which has an octa-core processor supposedly capable of running 32 threads in parallel. It worked like a charm!

Sample script follows:

split -l 15000 originalFile.txt

for f in x*
runDataProcessor $f > $f.out &


for k in *.out
cat $k >> combinedResult.txt

Tags: , , , , , ,
This entry was posted on Monday, February 11th, 2008 at 1:20 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

4 Responses to “How to split a file, process the pieces in multiple threads and combine results using a shell script”

  1. Using Parallel Processing for Text File Processing (and Shell Scripts) « UNIX Administratosphere

    […] February 2008 Over at Onkar Joshi’s Blog, he wrote an article about how to write a shell script to process a text file using parallel processing. He provided an […]

  2. TSO

    you’ve been tagged by me! go have a look at my blog.

  3. Carlos

    Thanks, excellent tip, I used it in a situation where I have so many files on my directory that I can not do even a rm (I get the argument list too long error) , so I populate a file with the directory listing and treated files in lots of 1000s!

  4. Ole Tange

    Consider having a look at Parallel It makes the script more readable.

    split -l 15000 originalFile.txt
    ls x* | parallel runDataProcessor >> combinedResult.txt

    If the order of the input needs to be kept in the output:

    ls x* | parallel -k runDataProcessor >> combinedResult.txt

    If runDataProcessor is CPU intensive, run one for each core:

    ls x* | parallel -j+0 runDataProcessor >> combinedResult.txt