Summary: Compatibility and decompression speed is more important than compression ratios for many use cases. Gzip is nearly universal, where lz4, xz, and parallel bzip2 are not.
The challenge of sharing internet-wide scan data has unearthed a few issues with creating and processing large datasets.
The IC12 project[1] used zpaq, which ended up compressing to almost half the size of gzip. The downside is that it took nearly two weeks and 16 cores to convert the zpaq data to a format other tools could use.
The Critical.IO project[2] used pbzip2, which worked amazingly well, except when processing the data with Java-based tool chains (Hadoop, etc). The Java BZ2 libraries had trouble with the parallel version of bzip2.
We chose gzip with Project Sonar[3], and although the compression isn't great, it was widely compatible with the tools people used to crunch the data, and we get parallel compression/decompression via pigz.
In the latest example, the Censys.io[4] project switched to LZ4 and threw data processing compatibility to the wind (in favor of bandwith and a hosted search engine).
The challenge of sharing internet-wide scan data has unearthed a few issues with creating and processing large datasets.
The IC12 project[1] used zpaq, which ended up compressing to almost half the size of gzip. The downside is that it took nearly two weeks and 16 cores to convert the zpaq data to a format other tools could use.
The Critical.IO project[2] used pbzip2, which worked amazingly well, except when processing the data with Java-based tool chains (Hadoop, etc). The Java BZ2 libraries had trouble with the parallel version of bzip2.
We chose gzip with Project Sonar[3], and although the compression isn't great, it was widely compatible with the tools people used to crunch the data, and we get parallel compression/decompression via pigz.
In the latest example, the Censys.io[4] project switched to LZ4 and threw data processing compatibility to the wind (in favor of bandwith and a hosted search engine).
-HD
1. http://internetcensus2012.bitbucket.org/images.html 2. https://scans.io/study/sonar.cio 3. https://sonar.labs.rapid7.com/ 4. https://censys.io/