1) Test Setup ************* The tests were performed an a system with the following specifications: AMD Ryzen 7 3700X 32GiB DDR4 RAM Fedora 32 An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs: $ mkdir /dev/shm/temp $ ln -s /dev/shm/temp out $ ./autogen.sh $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \ LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out $ make -j install $ cd out This was done to eliminate any influence of I/O performance and I/O caching side effects to the extend possible and only measure the actual processing time. For all benchmark tests, a Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE was used. The Debian image is expected to contain realistic input data for a Linux file system and also provide enough data for an interesting benchmark. For all performed benchmarks, graphical representations of the results and derived values can be seen in "benchmark.ods". 1) Parallel Compression Benchmark ********************************* 1.1) What was measured? The Debian image was first converted to a tarball: $ ./bin/sqfs2tar debian.sqfs > test.tar The tarball was then repacked and time was measured as follows: $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar The repacking was repeated 4 times and the worst wall-clock time ("real") was used for comparison. Altough not relevant for this benchmark, the resulting image sizes were measured once for each compressor, so that the compression ratio could be estimated: $ stat test.tar $ stat test.sqfs The <NUM_CPU> was varied from 1 to 16 and for <COMPRESSOR>, all available compressors were used. All possible combinations <NUM_CPU> and <COMPRESSOR> were measured. In addition, a serial reference version was compiled by running configure with the additional option --without-pthread and re-running the tests for all compressors without the <NUM_CPU> option. 1.2) What was computed from the results? The relative and absolute speedup were determined as follows: runtime_parallel(compressor, num_cpu) spedup_rel(compressor, num_cpu) = ------------------------------------- runtime_parallel(compressor, 1) runtime_parallel(compressor, num_cpu) spedup_abs(compressor, num_cpu) = ------------------------------------- runtime_serial(compressor) In addition, relative and absolute efficiency of the parellel implementation were determined: speedup_rel(compressor, num_cpu) efficiency_rel(compressor, num_cpu) = -------------------------------- num_cpu speedup_abs(compressor, num_cpu) efficiency_abs(compressor, num_cpu) = -------------------------------- num_cpu Furthermore, altough not relevant for this specific benchmark, having the converted tarballs available, the compression ratio was computed as follows: file_size(tarball) compression_ratio(compressor) = --------------------- file_size(compressor) 1.3) What software versions were used? squashfs-tools-ng v0.9 TODO: update data and write the *exact* commit hash here, as well as gcc and Linux versions. 1.4) Results The raw timing results are as follows: Jobs XZ lzma gzip LZO LZ4 zstd serial 17m39.613s 16m10.710s 9m56.606s 13m22.337s 12.159s 9m33.600s 1 17m38.050s 15m49.753s 9m46.948s 13m06.705s 11.908s 9m23.445s 2 9m26.712s 8m24.706s 5m08.152s 6m53.872s 7.395s 5m 1.734s 3 6m29.733s 5m47.422s 3m33.235s 4m44.407s 6.069s 3m30.708s 4 5m02.993s 4m30.361s 2m43.447s 3m39.825s 5.864s 2m44.418s 5 4m07.959s 3m40.860s 2m13.454s 2m59.395s 5.749s 2m16.745s 6 3m30.514s 3m07.816s 1m53.641s 2m32.461s 5.926s 1m57.607s 7 3m04.009s 2m43.765s 1m39.742s 2m12.536s 6.281s 1m43.734s 8 2m45.050s 2m26.996s 1m28.776s 1m58.253s 6.395s 1m34.500s 9 2m34.993s 2m18.868s 1m21.668s 1m50.461s 6.890s 1m29.820s 10 2m27.399s 2m11.214s 1m15.461s 1m44.060s 7.225s 1m26.176s 11 2m20.068s 2m04.592s 1m10.286s 1m37.749s 7.557s 1m22.566s 12 2m13.131s 1m58.710s 1m05.957s 1m32.596s 8.127s 1m18.883s 13 2m07.472s 1m53.481s 1m02.041s 1m27.982s 8.704s 1m16.218s 14 2m02.365s 1m48.773s 1m00.337s 1m24.444s 9.494s 1m14.175s 15 1m58.298s 1m45.079s 58.348s 1m21.445s 10.192s 1m12.134s 16 1m55.940s 1m42.176s 56.615s 1m19.030s 10.964s 1m11.049s The sizes of the tarball and the resulting images: - LZ4 compressed SquashFS image: ~3.1GiB (3,381,751,808) - LZO compressed SquashFS image: ~2.5GiB (2,732,015,616) - zstd compressed SquashFS image: ~2.1GiB (2,295,017,472) - gzip compressed SquashFS image: ~2.3GiB (2,471,276,544) - lzma compressed SquashFS image: ~2.0GiB (2,102,169,600) - XZ compressed SquashFS image: ~2.0GiB (2,098,466,816) - raw tarball: ~6.5GiB (7,008,118,272) 1.5) Discussion Most obviously, the results indicate that LZ4, unlike the other compressors, is clearly I/O bound and not CPU bound and doesn't benefit from parallelization beyond 2-4 worker threads and even that benefit is marginal with efficiency plummetting immediately. The other compressors are clearly CPU bound. Speedup increases linearly until about 8 cores, but with a slope < 1, as evident by efficiency linearly decreasing and reaching 80% for 8 cores. A reason for this sub-linear scaling may be the choke point introduced by the creation of fragment blocks, that *requires* a synchronization. To test this theory, a second benchmark should be performed with fragment block generation completely disabled. This requires a new flag to be added to tar2sqfs (and also gensquashfs). Using more than 8 jobs causes a much slower increase in speedup and efficency declines even faster. This is probably due to the fact that the test system only has 8 physical cores and beyond that, SMT has to be used. It should also be noted that the thread pool compressor with only a single thread turns out to be *slightly* faster than the serial reference implementation. A possible explanation for this might be that the fragment blocks are actually assembled in the main thread, in parallel to the worker that can still continue with other data blocks. Because of this decoupling there is in fact some degree of parallelism, even if only one worker thread is used. As a side effect, this benchmark also produces some insights into the compression ratio and throughput of the supported compressors. Indicating that for the Debian live image, XZ clearly provides the highest data density, while LZ4 is clearly the fastest compressor available. The throughput of the zstd compressor is comparable to gzip, while the resulting compression ratio is closer to LZMA. Repeating the benchmark without tail-end-packing and with fragments completely disabled would also show the effectiveness of tail-end-packing and fragment packing as a side effect. 2) Reference Decompression Benchmark ************************************ 1.1) What was measured? A SquashFS image was generated for each supported compressor: $ ./bin/sqfs2tar debian.sqfs | ./bin/tar2sqfs -c <COMPRESSOR> test.sqfs And then, for each compressor, the unpacking time was measured: $ time ./bin/sqfs2tar test.sqfs > /dev/null The unpacking step was repeated 4 times and the worst wall-clock time ("real") was used for comparison. 2.2) What software version was used? squashfs-tools-ng commit cc1141984a03da003e15ff229d3b417f8e5a24ad gcc version: 10.2.1 20201016 (Red Hat 10.2.1-6) Linux version: 5.8.16-200.fc32.x86_64 2.3) Results gzip 20.466s lz4 2.519s lzma 1m58.455s lzo 10.521s xz 1m59.451s zstd 7.833s 2.4) Discussion From the measurement, it becomes obvious that LZ4 and zstd are the two fastest decompressors. Zstd is particularly noteworth here, because it is not far behind LZ4 in speed, but also achievs a substantially better compression ratio that is somewhere between gzip and lzma. LZ4, despite being the fastest in decompression and beating the others in compression speed by orders of magnitudes, has by far the worst compression ratio. It should be noted that the number of actually compressed blocks has not been determined. A worse compression ratio can lead to more blocks being stored uncompressed, reducing the workload and thus affecting decompression time. However, since zstd has a better compression ratio than gzip, takes only 30% of the time to decompress, and in the serial compression benchmark only takes 2% of the time to compress, we cane safely say that in this benchmark, zstd beats gzip by every metric. Furthermore, while XZ stands out as the compressor with the best compression ratio, zstd only takes ~6% of the time to decompress the entire image, while being ~17% bigger than XZ. Shaving off 17% is definitely signifficant, especially considering that in absolute numbers it is in the 100MB range, but it clearly comes at a substential performance cost. Also interesting are the results for the LZO compressor. Its compression speed is between gzip and LZMA, decompression speed is about 50% of gzip, and only a little bit worse than zstd, but its compression ratio is the second worst only after LZ4, which beats it by a factor of 5 in decompression speed and by ~60 in compression speed. Concluding, for applications where a good compression ratio is most imporant, XZ is obviously the best choice, but if speed is favoured, zstd is probably a very good option to go with. LZ4 is much faster, but has a lot worse compression ratio. It is probably best suited as transparent compression for a read/write file system or network protocols. Finally, it should be noted, that this serial decompression benchmark is not representative of a real-life workload where only a small set of files are accessed in a random access fashion. In that case, a caching layer can largely mitigate the decompression cost, translating it into an initial or only occasionally occouring cache miss latency. But this benchmark should in theory give an approximate idea how those cache miss latencies are expected to compare between the different compressors.