1) Test Setup ************* The tests were performed an a system with the following specifications: AMD Ryzen 7 3700X 32GiB DDR4 RAM Fedora 33 The following gcc versions of GCC and Linux were used: gcc (GCC) 10.2.1 20201125 (Red Hat 10.2.1-9) Linux 5.11.9-200.fc33.x86_64 The following squashfs-tools-ng commit was tested: 7d2b3b077d7e204e64a1c57845524250c5b4a142 An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs: $ mkdir /dev/shm/temp $ ln -s /dev/shm/temp out $ ./autogen.sh $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \ LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out $ make -j install-strip $ cd out Working in a tmpfs was done to eliminate any influence of I/O performance and I/O caching side effects to the extend possible and only measure the actual processing time. For all benchmark tests, a Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE was used. The Debian image is expected to contain realistic input data for a Linux file system and also provide enough data for an interesting benchmark. For all performed benchmarks, graphical representations of the results and derived values can be seen in "benchmark.ods". 1) Parallel Compression Benchmark ********************************* 1.1) What was measured? The Debian image was first converted to a tarball: $ ./bin/sqfs2tar debian.sqfs > test.tar The tarball was then repacked and time was measured as follows: $ time -p ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar The repacking was repeated 4 times and the worst wall-clock time ("real") was used for comparison. The <NUM_CPU> was varied from 1 to 16 and for <COMPRESSOR>, all available compressors were used. All possible combinations <NUM_CPU> and <COMPRESSOR> were measured. In addition, a serial reference version was compiled by running configure with the additional option --without-pthread and re-running the tests for all compressors without the <NUM_CPU> option. In addition to the existing compressors, the LZO compressor in libcommon.a was briefly patched to not perform any compression at all. This way, a baseline comparison was established for a completely uncompressed SquashFS image. 1.2) What was computed from the results? The relative and absolute speedup were determined as follows: runtime_parallel(compressor, num_cpu) spedup_rel(compressor, num_cpu) = ------------------------------------- runtime_parallel(compressor, 1) runtime_parallel(compressor, num_cpu) spedup_abs(compressor, num_cpu) = ------------------------------------- runtime_serial(compressor) In addition, relative and absolute efficiency of the parallel implementation were determined: speedup_rel(compressor, num_cpu) efficiency_rel(compressor, num_cpu) = -------------------------------- num_cpu speedup_abs(compressor, num_cpu) efficiency_abs(compressor, num_cpu) = -------------------------------- num_cpu Furthermore, although not relevant for this specific benchmark, having the converted tarballs available, the compression ratio was computed as follows: size(tarball) max_throughput(compressor) = -------------------------- min(runtime(compressor)) 1.4) Results The raw timing results are as follows: Jobs XZ lzma gzip LZO LZ4 zstd none serial 1108.39s 995.43s 609.79s 753.14s 13.58s 550.59s 5.86s 1 1116.06s 990.33s 598.85s 753.53s 11.25s 550.37s 4.23s 2 591.21s 536.61s 312.14s 394.21s 6.41s 294.12s 4.13s 3 415.90s 370.48s 215.92s 273.14s 4.84s 205.14s 4.58s 4 320.02s 288.35s 165.50s 210.32s 4.29s 159.71s 4.62s 5 263.94s 235.69s 136.28s 172.33s 4.19s 132.27s 4.94s 6 224.23s 200.63s 116.44s 146.80s 4.28s 112.79s 5.08s 7 196.78s 176.35s 100.66s 128.61s 4.24s 99.26s 5.43s 8 175.04s 157.82s 89.79s 113.47s 4.46s 88.22s 5.68s 9 166.52s 148.88s 83.01s 106.14s 4.64s 84.97s 5.76s 10 159.35s 141.08s 77.04s 99.92s 4.84s 81.61s 5.94s 11 151.08s 136.27s 71.52s 94.23s 5.00s 77.51s 6.14s 12 144.72s 128.91s 67.21s 89.33s 5.28s 74.10s 6.39s 13 137.91s 122.67s 63.43s 84.39s 5.41s 71.83s 6.51s 14 132.94s 117.79s 59.45s 80.87s 5.71s 68.86s 6.68s 15 126.76s 113.51s 56.37s 76.68s 5.74s 65.78s 6.91s 16 119.06s 107.15s 52.56s 71.49s 6.37s 62.52s 7.10s 1.5) Discussion Most obviously, the results indicate that LZ4, unlike the other compressors, is clearly I/O bound and not CPU bound and doesn't benefit from parallelization beyond 2-4 worker threads and even that benefit is marginal with efficiency plummeting immediately. The other compressors are clearly CPU bound. Speedup increases linearly until about 8 cores, but with a slope < 1, as evident by efficiency linearly decreasing and reaching 80% for 8 cores. A reason for this sub-linear scaling may be the choke point introduced by the creation of fragment blocks, that *requires* a synchronization. To test this theory, a second benchmark should be performed with fragment block generation completely disabled. This requires a new flag to be added to tar2sqfs (and also gensquashfs). Using more than 8 jobs causes a much slower increase in speedup and efficiency declines even faster. This is probably due to the fact that the test system only has 8 physical cores and beyond that, SMT has to be used. It should also be noted that for most of the compressors, as well as the uncompressed version, the thread pool compressor with only a single thread turns out to be *slightly* faster than the serial reference implementation. A possible explanation for this might be that the fragment blocks are actually assembled in the main thread, in parallel to the worker that can still continue with other data blocks. Because of this decoupling there is in fact some degree of parallelism, even if only one worker thread is used. For the uncompressed version, the work still done in the thread pool is the hashing of blocks and fragments for de-duplication. Also of interest are the changes from the previous version of the benchmark, performed on v0.9 of squashfs-tools-ng. Since then, the thread pool design has been overhauled to spend a lot less time in the critical regions, but to also perform byte-for-byte equivalence checks before considering blocks or fragments to be identical. This may require a read-back and decompression step in the main thread in order to access already written fragment blocks. While the overall behavior has stayed the same, performance for XZ & LZMA has decreased slightly, whereas performance for the gzip, LZ4 & ZSTD has improved slightly. As the decompression benchmark shows, the first two are a lot slower at decompression, which needs to be done when reading back a fragment block from disk, and due to the higher data density also have a higher chance of actually having to decompress a block, so as a net result, the performance penalty from exact fragment matching eats all gains from the new thread pool design. For the more I/O bound compressors like LZ4 & ZSTD, decompressing a block is done much faster and due to the low data density for LZ4, the chance of actually having to decompress a block is lowered. As a result, the gains from the new thread pool design apparently outweigh the read-back penalty. Also noteworthy, due to the inclusion of an uncompressed reference, is that the LZ4 compressor is actually very close in performance to the uncompressed version, in some cases even outperforming it. This might be due to the fact that LZ4 actually does compress blocks, so in many cases where the uncompressed version needs to read back a full block during deduplication, the LZ4 version only needs to read a considerably smaller amount of data, reducing the penalty of having to read back blocks. 2) Reference Decompression Benchmark ************************************ 2.1) What was measured? A SquashFS image was generated for each supported compressor: $ ./bin/sqfs2tar debian.sqfs | ./bin/tar2sqfs -c <COMPRESSOR> test.sqfs And then, for each compressor, the unpacking time was measured: $ time -p ./bin/sqfs2tar test.sqfs > /dev/null The unpacking step was repeated 4 times and the worst wall-clock time ("real") was used for comparison. 2.2) What was computed from the results? The throughput was established by dividing the size of the resulting tarball by the time taken to produce it from the image. For better comparison, this was also normalized to the throughput of the uncompressed SquashFS image. 2.3) Results xz 120.53s lzma 118.91s gzip 20.57s lzo 10.65s zstd 7.74s lz4 2.59s uncompressed 1.42s 2.4) Discussion From the measurement, it becomes obvious that LZ4 and zstd are the two fastest decompressors, both being very close to the uncompressed version. Zstd is particularly noteworthy here, because it is not far behind LZ4 in speed, but also achieves a substantially better compression ratio that is between gzip and lzma. LZ4, despite being the fastest in decompression and beating the others in compression speed by orders of magnitudes, has by far the worst compression ratio. It should be noted that the number of actually compressed blocks has not been determined. A worse compression ratio can lead to more blocks being stored uncompressed, reducing the workload and thus affecting decompression time. However, since zstd has a better compression ratio than gzip, takes only 30% of the time to decompress, and in the serial compression benchmark only takes 2% of the time to compress, we can safely say that in this benchmark, zstd beats gzip by every metric. Furthermore, while XZ stands out as the compressor with the best compression ratio, zstd only takes ~6% of the time to decompress the entire image, while being ~17% bigger than XZ. Shaving off 17% is definitely significant, especially considering that in absolute numbers it is in the 100MB range, but it clearly comes at a substantial performance cost. Also interesting are the results for the LZO compressor. Its compression speed is between gzip and LZMA, decompression speed is about 50% of gzip, and only a little bit worse than zstd, but its compression ratio is the second worst only after LZ4, which beats it by a factor of 5 in decompression speed and by ~60 in compression speed. Concluding, for applications where a good compression ratio is most important, XZ is obviously the best choice, but if speed is favored, zstd is probably a very good option to go with. LZ4 is much faster, but has a lot worse compression ratio. It is probably best suited as transparent compression for a read/write file system or network protocols. Finally, it should be noted, that this serial decompression benchmark is not representative of a real-life workload where only a small set of files are accessed in a random access fashion. In that case, a caching layer can largely mitigate the decompression cost, translating it into an initial or only occasionally occurring cache miss latency. But this benchmark should in theory give an approximate idea how those cache miss latencies are expected to compare between the different compressors. 3) Compression Size and Overhead Benchmark ****************************************** 3.1) What was measured? For each compressor, a SquashFS image was created in the way outlined in the parallel compression benchmark and the resulting file size was recorded. In addition, the raw tarball size was recorded for comparison. 3.2) What was computed from the results? The compression ratio was established as follows: size(compressor) ratio(compressor) = -------------------- size(uncompressed) 3.3) Results SquashFS tar Uncompressed ~6.1GiB (6,542,389,248) ~6.5GiB (7,008,118,272) LZ4 ~3.1GiB (3,381,751,808) LZO ~2.5GiB (2,732,015,616) gzip ~2.3GiB (2,471,276,544) zstd ~2.1GiB (2,295,078,912) lzma ~2.0GiB (2,102,169,600) XZ ~2.0GiB (2,098,466,816) 3.4) Discussion Obviously XZ and lzma achieve the highest data density, shrinking the SquashFS image down to less than a third of the input size. Noteworthy is also Zstd achieving higher data density than gzip while being faster in compression as well as decompression. Interestingly, even the uncompressed SquashFS image is still smaller than the uncompressed tarball. Obviously SquashFS packs data and meta data more efficiently than the tar format, shaving off ~7% in size.