I often have this kind of construct where a process generates a lot of data and a second one does something with it. Think for instance about a big select in a database, the output being a csv we want to compress.
‘SELECT a lot of data FROM a big table ‘ | gzip > data.csv.gz
I wanted to know the size of the original data (I know that in the case of gzip I can use the -l flag, but this is just an example).
There are 3 ways to do this. In my examples the big data process is yes | head -n 1000000000 which generates 1 billion rows (without IO, which is nice for benchmarking), the consumer process is just a dd which dumps everything in /dev/null.
awk
yes | head -n 1000000000 | awk '{print $0; count++} END{print count >"/dev/stderr";}' | dd bs=64M of=/dev/null
Good points:
- Quite easy to read, semantically easy to understand,
- does not duplicate the data stream.
Bad point
- As the data still flows to STDOUT, the row count is printed on STDERR which is not ideal but is still usable afterwards.
Tee and wc
yes | head -n 1000000000 | tee >(dd bs=64M of=/dev/null 2>/dev/null) | wc -l | { read -r rowcount; } echo $rowcount
Good point:
- You get the rowcount in a variable, easy to use afterwards.
Bad points
- Semantically weird, tricky to understand,
- duplicate the data stream.
pv
yes | head -n 1000000 | pv -l | dd bs=64M of=/dev/null 2>/dev/null
Good points
- Semantically pleasing,
- pv has a lot of options which might be interesting.
Bad point
- Nice for interactive use, but it displays a progress bar on STDERR so it’s next to impossible to get the output in a script.
Out of those 3 options, which one is the fastest?
After running each option 10 times, here are the results, in seconds.
awk | tee | pv | |
---|---|---|---|
Mean | 139.5 | 18.3 | 6516 |
Min | 133 | 17 | 6446 |
Max | 154 | 22 | 6587 |
Stdev | 8.28 | 1.81 | 48.4 |
Yes, I double checked my data. We are indeed talking about 20 seconds for tee, 2:20 minutes for awk and about 1h45 for pv.