I often have this kind of construct where a process generates a lot of data and a second one does something with it. Think for instance about a big select in a database, the output being a csv we want to compress.

‘SELECT a lot of data FROM a big table ‘ | gzip > data.csv.gz

I wanted to know the size of the original data (I know that in the case of gzip I can use the -l flag, but this is just an example).

There are 3 ways to do this. In my examples the big data process is yes | head -n 1000000000 which generates 1 billion rows (without IO, which is nice for benchmarking), the consumer process is just a dd which dumps everything in /dev/null.

awk

yes | head -n 1000000000 | awk '{print $0; count++} END{print count >"/dev/stderr";}' | dd bs=64M of=/dev/null

Good points:

Quite easy to read, semantically easy to understand,
does not duplicate the data stream.

Bad point

As the data still flows to STDOUT, the row count is printed on STDERR which is not ideal but is still usable afterwards.

Tee and wc

yes | head -n 1000000000 | tee >(dd bs=64M of=/dev/null 2>/dev/null) | wc -l | { read -r rowcount; }
echo $rowcount

Good point:

You get the rowcount in a variable, easy to use afterwards.

Bad points

Semantically weird, tricky to understand,
duplicate the data stream.

pv

yes | head -n 1000000 | pv -l | dd bs=64M of=/dev/null 2>/dev/null

Good points

Semantically pleasing,
pv has a lot of options which might be interesting.

Bad point

Nice for interactive use, but it displays a progress bar on STDERR so it’s next to impossible to get the output in a script.

Out of those 3 options, which one is the fastest?

After running each option 10 times, here are the results, in seconds.

	awk	tee	pv
Mean	139.5	18.3	6516
Min	133	17	6446
Max	154	22	6587
Stdev	8.28	1.81	48.4

Yes, I double checked my data. We are indeed talking about 20 seconds for tee, 2:20 minutes for awk and about 1h45 for pv.

This Data Guy

Journey in a world of big(ger) data

Count bytes in a stream, in bash

awk

Tee and wc

pv

Leave a comment Cancel reply

awk

Tee and wc

pv

Share this:

Related

Leave a comment Cancel reply