How to split a file, process the pieces in multiple threads and combine results using a shell script
Say you are in a situation where you have a file with a huge number of records to be processed and the processing of one record does not need data from the processing of previous records (ie. a perfectly paralellizable situation), what can you do to speed up things? Well, here’s what I did when a client recently made a request for statistical data for 300K records instead of his usual request of 20 records as the program that we had earlier made for the purpose wasn’t really made to run fast and wasn’t multi threaded.
Use the “split” command to split a file by number of lines into an appropriate number of files depending upon the configuration of your hardware. Use the “-l” option to specify the number of lines in each file. Then run multiple instances of your program to process the different files in parallel using an “&”. Use “wait” to wait for all background tasks to end. And finally when things are done, merge the different output files together with “cat” in append mode. Voila! Things finish much, much faster.
I used the above steps to make 20 files with 15K lines each since the server I was running the script on was a Sun Solaris 10 T2000 system with 32GB RAM which has an octa-core processor supposedly capable of running 32 threads in parallel. It worked like a charm!
Sample script follows:
split -l 15000 originalFile.txt
for f in x*
runDataProcessor $f > $f.out &
for k in *.out
cat $k >> combinedResult.txt
Tags: Code, Linux, Shell script, Solaris, Threads, Tips, Unix