Count bytes in a stream, in bash

I often have this kind of construct where a process generates a lot of data and a second one does something with it. Think for instance about a big select in a database, the output being a csv we want to compress.

‘SELECT a lot of data FROM a big table ‘ | gzip > data.csv.gz

I wanted to know the size of the original data (I know that in the case of gzip I can use the -l flag, but this is just an example).

There are 3 ways to do this. In my examples the big data process is yes | head -n 1000000000 which generates 1 billion rows (without IO, which is nice for benchmarking), the consumer process is just a dd which dumps everything in /dev/null.

awk

yes | head -n 1000000000 | awk '{print $0; count++} END{print count >"/dev/stderr";}' | dd bs=64M of=/dev/null

Good points:

  • Quite easy to read, semantically easy to understand,
  • does not duplicate the data stream.

Bad point

  • As the data still flows to STDOUT, the row count is printed on STDERR which is not ideal but is still usable afterwards.

Tee and wc

yes | head -n 1000000000 | tee >(dd bs=64M of=/dev/null 2>/dev/null) | wc -l | { read -r rowcount; }
echo $rowcount

Good point:

  • You get the rowcount in a variable, easy to use afterwards.

Bad points

  • Semantically weird, tricky to understand,
  • duplicate the data stream.

pv

yes | head -n 1000000 | pv -l | dd bs=64M of=/dev/null 2>/dev/null

Good points

  • Semantically pleasing,
  • pv has a lot of options which might be interesting.

Bad point

  • Nice for interactive use, but it displays a progress bar on STDERR so it’s next to impossible to get the output in a script.

Out of those 3 options, which one is the fastest?

After running each option 10 times, here are the results, in seconds.

awk tee pv
Mean 139.5 18.3  6516
Min 133 17  6446
Max 154 22 6587
Stdev 8.28 1.81  48.4

Yes, I double checked my data. We are indeed talking about 20 seconds for tee, 2:20 minutes for awk and about 1h45 for pv.

Advertisements