Million files and one laptop

Consider the example of an online store, using a laptop to analyze a million files.

If there is a sufficiently modern computer, it is possible to process data of “medium size” using reasonable utilization of the GNU Parallel utility and stream processing.
')

Step 1: Concatenating (cat * >> out.txt?!)

The utility of the cat utility in Unix systems is known to most of those who have ever opened a Terminal. Simply select all or some of the files in the folder and merge them together into one large file. But that's what comes out, as soon as there are many files:

$ cat * >> out.txt -bash: /bin/cat: Argument list too long

The number of files exceeds the allowed and the computer can not always keep track of them. Many Unix tools only take about 10,000 arguments; using an asterisk in the cat command extends control and passes 1,234,567 arguments to the utility. As a result, an error message appears.

You can do the following:

 for f in *; do cat "$f" >> ../transactions_cat/transactions.csv; done

And after about 10.093 seconds, the compound file is formed.

Step 2: GNU Parallel & Concatenation

But you can improve the process with GNU Parallel:

 ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"

The $ f argument in the code comes to the fore, so you can choose the level of parallelism ; but at the same time the linear scale will not be uniform (as in the figure below - graph code ):

Step 3: Data> RAM

After a million files are converted to one file, another problem arises. The data volume of 19.93 GB does not fit in the RAM (this is a 2014 MBP laptop, 16 GB RAM). Thus, for analysis, you need either a more powerful machine, or processing through streaming. Or you can use chunked ( Chunked transfer encoding ).

But continuing to talk about the use of GNU Parallel, you should answer a number of questions relating to operational data (for example, an online store):

How many unique products have been sold?
How many deals were made per day?
How many products were sold in the store in a month?

Unique products

 # Serial method (ie no parallelism) # This is a simple implementation of map & reduce; tr statements represent one map, sort -u statements one reducer # cut -d ' ' -f 5- transactions.csv | \ - Using cut, take everything from the 5th column and over from the transactions.csv file # tr -d \" | \ - Using tr, trim off double-quotes. This leaves us with a comma-delimited string of products representing a transaction # sort -u | \ - Using sort, put similar items together, but only output the unique values # wc -l - Count number of unique lines, which after de-duping, represents number of unique products $ time cut -d ' ' -f 5- transactions.csv | tr -d \" | tr ',' '\n' | sort -u | wc -l 331 real 292m7.116s # Parallelized version, default chunk size of 1MB. This will use 100% of all CPUs (real and virtual) # Also map & reduce; tr statements a single map, sort -u statements multiple reducers (8 by default) $ time cut -d ' ' -f 5- transactions.csv | tr -d \" | tr ',' '\n' | parallel --pipe --block 1M sort -u | sort -u | wc -l 331 # block size performance - Making block size smaller might improve performance # Number of jobs can also be manipulated (not evaluated) # --500K: 73m57.232s # --Default 1M: 75m55.268s (3.84x faster than serial) # --2M: 79m30.950s # --3M: 80m43.311s

Deals for the day

If the file format is undesirable in order to be considered the first question, then for the second one it will be perfect. Since each line represents an operation, all we have to do is execute the SQL equivalent of “Group By” per day and summarize the lines:

 # Data is at transaction level, so just need to do equivalent of 'group by' operation # Using cut again, we choose field 3, which is the date part of the timestamp # sort | uniq -c is a common pattern for doing a 'group by' count operation # Final tr step is to trim the leading quotation mark from date string time cut -d ' ' -f 3 transactions.csv | sort | uniq -c | tr -d \" real 76m51.223s # Parallelized version # Quoting can be annoying when using parallel, so writing a Bash function is often much easier than dealing with escaping quotes # To do 'group by' operation using awk, need to use an associative array # Because we are doing parallel operations, need to pass awk output to awk again to return final counts awksub () { awk '{a[$3]+=1;}END{for(i in a)print i" "a[i];}';} export -f awksub time parallel --pipe awksub < transactions.csv | awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' | tr -d \" | sort real 8m22.674s (9.05x faster than serial)

Total sales per day and per month

For this example, it could happen that the fu command line is rather weak, but the sequential method is one of the fastest. Of course, in the 14-minute runtime, the real-time advantages for “parallelization” are not so big.

 # Serial method uses 40-50% all available CPU prior to `sort` step. Assuming linear scaling, best we could achieve is halving the time. # Grand Assertion: this pipeline actually gives correct answer! This is a very complex way to calculate this, SQL would be so much easier... # cut -d ' ' -f 2,3,5 - Take fields 2, 3, and 5 (store, timestamp, transaction) # tr -d '[A-Za-z\"/\- ]' - Strip out all the characters and spaces, to just leave the store number, timestamp, and commas to represent the number of items # awk '{print (substr($1,1,5)"-"substr($1,6,6)), length(substr($1,14))+1}' - Split the string at the store, yearmo boundary, then count number of commas + 1 (since 3 commas = 4 items) # awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' - Sum by store-yearmo combo # sort - Sort such that the store number is together, then the month time cut -d ' ' -f 2,3,5 transactions.csv | tr -d '[A-Za-z\"/\- ]' | awk '{print (substr($1,1,5)"-"substr($1,6,6)), length(substr($1,14))+1}' | awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' | sort real 14m5.657s # Parallelize the substring awk step # Actually lowers processor utilization! awksub2 () { awk '{print (substr($1,1,5)"-"substr($1,6,6)), length(substr($1,14))+1}';} export -f awksub2 time cut -d ' ' -f 2,3,5 transactions.csv | tr -d '[A-Za-z\"/\- ]' | parallel --pipe -m awksub2 | awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}' | sort real 19m27.407s (worse!) # Move parallel to aggregation step awksub3 () { awk '{a[$1]+=$2;}END{for(i in a)print i" "a[i];}';} export -f awksub3 time cut -d ' ' -f 2,3,5 transactions.csv | tr -d '[A-Za-z\"/\- ]' | awk '{print (substr($1,1,5)"-"substr($1,6,6)), length(substr($1,14))+1}' | parallel --pipe awksub3 | awksub3 | sort real 19m24.851s (Same as other parallel run)

These three examples showed that using GNU Parallel in a reasonable time it is possible to process data sets that exceed RAM. However, the examples also showed that working with Unix utilities can be complicated. The command line script helps the movement outside the “one-liner” syndrome when pipelining becomes so long that every logical trace is lost. But in the end, problems are easily solved by using other tools.

Source: https://habr.com/ru/post/280384/

All Articles