1) Test Setup
 *************

 The tests were performed an a system with the following specifications:

  AMD Ryzen 7 3700X
  32GiB DDR4 RAM
  Fedora 33

 The following gcc versions of GCC and Linux were used:

  gcc (GCC) 10.2.1 20201125 (Red Hat 10.2.1-9)
  Linux 5.11.9-200.fc33.x86_64

 The following squashfs-tools-ng commit was tested:

  7d2b3b077d7e204e64a1c57845524250c5b4a142


 An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs:

  $ mkdir /dev/shm/temp
  $ ln -s /dev/shm/temp out
  $ ./autogen.sh
  $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \
                LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out
  $ make -j install-strip
  $ cd out


 Working in a tmpfs was done to eliminate any influence of I/O performance and
 I/O caching side effects to the extend possible and only measure the actual
 processing time.


 For all benchmark tests, a Debian image extracted from the Debian 10.2 LiveDVD
 for AMD64 with XFCE was used.

 The Debian image is expected to contain realistic input data for a Linux
 file system and also provide enough data for an interesting benchmark.


 For all performed benchmarks, graphical representations of the results and
 derived values can be seen in "benchmark.ods".


 1) Parallel Compression Benchmark
 *********************************

 1.1) What was measured?

 The Debian image was first converted to a tarball:

  $ ./bin/sqfs2tar debian.sqfs > test.tar

 The tarball was then repacked and time was measured as follows:

  $ time -p ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar


 The repacking was repeated 4 times and the worst wall-clock time ("real") was
 used for comparison.

 The <NUM_CPU> was varied from 1 to 16 and for <COMPRESSOR>, all available
 compressors were used. All possible combinations <NUM_CPU> and <COMPRESSOR>
 were measured.

 In addition, a serial reference version was compiled by running configure
 with the additional option --without-pthread and re-running the tests for
 all compressors without the <NUM_CPU> option.


 In addition to the existing compressors, the LZO compressor in libcommon.a was
 briefly patched to not perform any compression at all. This way, a baseline
 comparison was established for a completely uncompressed SquashFS image.


 1.2) What was computed from the results?

 The relative and absolute speedup were determined as follows:

                                     runtime_parallel(compressor, num_cpu)
   spedup_rel(compressor, num_cpu) = -------------------------------------
                                        runtime_parallel(compressor, 1)

                                     runtime_parallel(compressor, num_cpu)
   spedup_abs(compressor, num_cpu) = -------------------------------------
                                           runtime_serial(compressor)


 In addition, relative and absolute efficiency of the parallel implementation
 were determined:

                                         speedup_rel(compressor, num_cpu)
   efficiency_rel(compressor, num_cpu) = --------------------------------
                                                      num_cpu

                                         speedup_abs(compressor, num_cpu)
   efficiency_abs(compressor, num_cpu) = --------------------------------
                                                      num_cpu


 Furthermore, although not relevant for this specific benchmark, having the
 converted tarballs available, the compression ratio was computed as follows:

                                      size(tarball)
   max_throughput(compressor) = --------------------------
                                 min(runtime(compressor))


 1.4) Results

 The raw timing results are as follows:

 Jobs    XZ        lzma     gzip     LZO      LZ4     zstd     none
 serial  1108.39s  995.43s  609.79s  753.14s  13.58s  550.59s  5.86s
      1  1116.06s  990.33s  598.85s  753.53s  11.25s  550.37s  4.23s
      2   591.21s  536.61s  312.14s  394.21s   6.41s  294.12s  4.13s
      3   415.90s  370.48s  215.92s  273.14s   4.84s  205.14s  4.58s
      4   320.02s  288.35s  165.50s  210.32s   4.29s  159.71s  4.62s
      5   263.94s  235.69s  136.28s  172.33s   4.19s  132.27s  4.94s
      6   224.23s  200.63s  116.44s  146.80s   4.28s  112.79s  5.08s
      7   196.78s  176.35s  100.66s  128.61s   4.24s   99.26s  5.43s
      8   175.04s  157.82s   89.79s  113.47s   4.46s   88.22s  5.68s
      9   166.52s  148.88s   83.01s  106.14s   4.64s   84.97s  5.76s
     10   159.35s  141.08s   77.04s   99.92s   4.84s   81.61s  5.94s
     11   151.08s  136.27s   71.52s   94.23s   5.00s   77.51s  6.14s
     12   144.72s  128.91s   67.21s   89.33s   5.28s   74.10s  6.39s
     13   137.91s  122.67s   63.43s   84.39s   5.41s   71.83s  6.51s
     14   132.94s  117.79s   59.45s   80.87s   5.71s   68.86s  6.68s
     15   126.76s  113.51s   56.37s   76.68s   5.74s   65.78s  6.91s
     16   119.06s  107.15s   52.56s   71.49s   6.37s   62.52s  7.10s


 1.5) Discussion

 Most obviously, the results indicate that LZ4, unlike the other compressors,
 is clearly I/O bound and not CPU bound and doesn't benefit from parallelization
 beyond 2-4 worker threads and even that benefit is marginal with efficiency
 plummeting immediately.


 The other compressors are clearly CPU bound. Speedup increases linearly until
 about 8 cores, but with a slope < 1, as evident by efficiency linearly
 decreasing and reaching 80% for 8 cores.

 A reason for this sub-linear scaling may be the choke point introduced by the
 creation of fragment blocks, that *requires* a synchronization. To test this
 theory, a second benchmark should be performed with fragment block generation
 completely disabled. This requires a new flag to be added to tar2sqfs (and
 also gensquashfs).


 Using more than 8 jobs causes a much slower increase in speedup and efficiency
 declines even faster. This is probably due to the fact that the test system
 only has 8 physical cores and beyond that, SMT has to be used.


 It should also be noted that for most of the compressors, as well as the
 uncompressed version, the thread pool compressor with only a single thread
 turns out to be *slightly* faster than the serial reference implementation.
 A possible explanation for this might be that the fragment blocks are actually
 assembled in the main thread, in parallel to the worker that can still
 continue with other data blocks. Because of this decoupling there is in fact
 some degree of parallelism, even if only one worker thread is used. For the
 uncompressed version, the work still done in the thread pool is the hashing of
 blocks and fragments for de-duplication.


 Also of interest are the changes from the previous version of the benchmark,
 performed on v0.9 of squashfs-tools-ng. Since then, the thread pool design has
 been overhauled to spend a lot less time in the critical regions, but to also
 perform byte-for-byte equivalence checks before considering blocks or fragments
 to be identical. This may require a read-back and decompression step in the
 main thread in order to access already written fragment blocks.

 While the overall behavior has stayed the same, performance for XZ & LZMA has
 decreased slightly, whereas performance for the gzip, LZ4 & ZSTD has improved
 slightly. As the decompression benchmark shows, the first two are a lot slower
 at decompression, which needs to be done when reading back a fragment block
 from disk, and due to the higher data density also have a higher chance of
 actually having to decompress a block, so as a net result, the performance
 penalty from exact fragment matching eats all gains from the new thread pool
 design. For the more I/O bound compressors like LZ4 & ZSTD, decompressing a
 block is done much faster and due to the low data density for LZ4, the chance
 of actually having to decompress a block is lowered. As a result, the gains
 from the new thread pool design apparently outweigh the read-back penalty.


 Also noteworthy, due to the inclusion of an uncompressed reference, is that
 the LZ4 compressor is actually very close in performance to the uncompressed
 version, in some cases even outperforming it. This might be due to the fact
 that LZ4 actually does compress blocks, so in many cases where the
 uncompressed version needs to read back a full block during deduplication,
 the LZ4 version only needs to read a considerably smaller amount of data,
 reducing the penalty of having to read back blocks.


 2) Reference Decompression Benchmark
 ************************************

 2.1) What was measured?

 A SquashFS image was generated for each supported compressor:

  $ ./bin/sqfs2tar debian.sqfs | ./bin/tar2sqfs -c <COMPRESSOR> test.sqfs

 And then, for each compressor, the unpacking time was measured:

  $ time -p ./bin/sqfs2tar test.sqfs > /dev/null


 The unpacking step was repeated 4 times and the worst wall-clock time ("real")
 was used for comparison.


 2.2) What was computed from the results?

 The throughput was established by dividing the size of the resulting tarball by
 the time taken to produce it from the image.

 For better comparison, this was also normalized to the throughput of the
 uncompressed SquashFS image.


 2.3) Results

 xz             120.53s
 lzma           118.91s
 gzip            20.57s
 lzo             10.65s
 zstd             7.74s
 lz4              2.59s
 uncompressed     1.42s


 2.4) Discussion

 From the measurement, it becomes obvious that LZ4 and zstd are the two fastest
 decompressors, both being very close to the uncompressed version. Zstd is
 particularly noteworthy here, because it is not far behind LZ4 in speed, but
 also achieves a substantially better compression ratio that is
 between gzip and lzma. LZ4, despite being the fastest in decompression and
 beating the others in compression speed by orders of magnitudes, has by far
 the worst compression ratio.

 It should be noted that the number of actually compressed blocks has not been
 determined. A worse compression ratio can lead to more blocks being stored
 uncompressed, reducing the workload and thus affecting decompression time.

 However, since zstd has a better compression ratio than gzip, takes only 30% of
 the time to decompress, and in the serial compression benchmark only takes 2%
 of the time to compress, we can safely say that in this benchmark, zstd beats
 gzip by every metric.

 Furthermore, while XZ stands out as the compressor with the best compression
 ratio, zstd only takes ~6% of the time to decompress the entire image, while
 being ~17% bigger than XZ. Shaving off 17% is definitely significant,
 especially considering that in absolute numbers it is in the 100MB range, but
 it clearly comes at a substantial performance cost.


 Also interesting are the results for the LZO compressor. Its compression speed
 is between gzip and LZMA, decompression speed is about 50% of gzip, and only a
 little bit worse than zstd, but its compression ratio is the second worst only
 after LZ4, which beats it by a factor of 5 in decompression speed and by ~60
 in compression speed.


 Concluding, for applications where a good compression ratio is most important,
 XZ is obviously the best choice, but if speed is favored, zstd is probably a
 very good option to go with. LZ4 is much faster, but has a lot worse
 compression ratio. It is probably best suited as transparent compression for a
 read/write file system or network protocols.


 Finally, it should be noted, that this serial decompression benchmark is not
 representative of a real-life workload where only a small set of files are
 accessed in a random access fashion. In that case, a caching layer can largely
 mitigate the decompression cost, translating it into an initial or only
 occasionally occurring cache miss latency. But this benchmark should in theory
 give an approximate idea how those cache miss latencies are expected to
 compare between the different compressors.


 3) Compression Size and Overhead Benchmark
 ******************************************

 3.1) What was measured?

 For each compressor, a SquashFS image was created in the way outlined in the
 parallel compression benchmark and the resulting file size was recorded.

 In addition, the raw tarball size was recorded for comparison.


 3.2) What was computed from the results?

 The compression ratio was established as follows:

                        size(compressor)
  ratio(compressor) = --------------------
                       size(uncompressed)

 3.3) Results

               SquashFS                   tar
 Uncompressed  ~6.1GiB (6,542,389,248)    ~6.5GiB (7,008,118,272)
 LZ4           ~3.1GiB (3,381,751,808)
 LZO           ~2.5GiB (2,732,015,616)
 gzip          ~2.3GiB (2,471,276,544)
 zstd          ~2.1GiB (2,295,078,912)
 lzma          ~2.0GiB (2,102,169,600)
 XZ            ~2.0GiB (2,098,466,816)


 3.4) Discussion

 Obviously XZ and lzma achieve the highest data density, shrinking the SquashFS
 image down to less than a third of the input size.

 Noteworthy is also Zstd achieving higher data density than gzip while being
 faster in compression as well as decompression.


 Interestingly, even the uncompressed SquashFS image is still smaller than the
 uncompressed tarball. Obviously SquashFS packs data and meta data more
 efficiently than the tar format, shaving off ~7% in size.