From 105b454b944e24cf1e033d1b7211cadabbcfe7df Mon Sep 17 00:00:00 2001
From: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Date: Sat, 31 Oct 2020 21:19:07 +0100
Subject: Backport changes to the benchmark writeup

Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
---
 doc/benchmark.ods   | Bin 53760 -> 58458 bytes
 doc/benchmark.txt   | 277 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 doc/parallelism.txt | 131 -------------------------
 3 files changed, 277 insertions(+), 131 deletions(-)
 create mode 100644 doc/benchmark.txt
diff --git a/doc/benchmark.ods b/doc/benchmark.ods
index 62ee480..167d323 100644
Binary files a/doc/benchmark.ods and b/doc/benchmark.ods differ
diff --git a/doc/benchmark.txt b/doc/benchmark.txt
new file mode 100644
index 0000000..4b5e01e
--- /dev/null
+++ b/doc/benchmark.txt
@@ -0,0 +1,277 @@
+
+ 1) Test Setup
+ *************
+
+ The tests were performed an a system with the following specifications:
+
+  AMD Ryzen 7 3700X
+  32GiB DDR4 RAM
+  Fedora 32
+
+
+ An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs:
+
+  $ mkdir /dev/shm/temp
+  $ ln -s /dev/shm/temp out
+  $ ./autogen.sh
+  $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \
+                LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out
+  $ make -j install
+  $ cd out
+
+
+ This was done to eliminate any influence of I/O performance and I/O caching
+ side effects to the extend possible and only measure the actual processing
+ time.
+
+
+ For all benchmark tests, a Debian image extracted from the Debian 10.2 LiveDVD
+ for AMD64 with XFCE was used.
+
+ The Debian image is expected to contain realistic input data for a Linux
+ file system and also provide enough data for an interesting benchmark.
+
+
+ For all performed benchmarks, graphical representations of the results and
+ derived values can be seen in "benchmark.ods".
+
+
+ 1) Parallel Compression Benchmark
+ *********************************
+
+ 1.1) What was measured?
+
+ The Debian image was first converted to a tarball:
+
+  $ ./bin/sqfs2tar debian.sqfs > test.tar
+
+ The tarball was then repacked and time was measured as follows:
+
+  $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar
+
+
+ The repacking was repeated 4 times and the worst wall-clock time ("real") was
+ used for comparison.
+
+ Altough not relevant for this benchmark, the resulting image sizes where
+ for a specific compressor, so that the compression ratio could be estimated:
+
+  $ stat test.tar
+  $ stat test.sqfs
+
+
+
+ The <NUM_CPU> was varied from 1 to 16 and for <COMPRESSOR>, all available
+ compressors were used. All possible combinations <NUM_CPU> and <COMPRESSOR>
+ were measured.
+
+ In addition, a serial reference version was compiled by running configure
+ with the additional option --without-pthread and re-running the tests for
+ all compressors without the <NUM_CPU> option.
+
+
+ 1.2) What was computed from the results?
+
+ The relative and absolute speedup were determined as follows:
+
+                                     runtime_parallel(compressor, num_cpu)
+   spedup_rel(compressor, num_cpu) = -------------------------------------
+                                        runtime_parallel(compressor, 1)
+
+                                     runtime_parallel(compressor, num_cpu)
+   spedup_abs(compressor, num_cpu) = -------------------------------------
+                                           runtime_serial(compressor)
+
+
+ In addition, relative and absolute efficiency of the parellel implementation
+ was determined:
+
+                                         speedup_rel(compressor, num_cpu)
+   efficiency_rel(compressor, num_cpu) = --------------------------------
+                                                      num_cpu
+
+                                         speedup_abs(compressor, num_cpu)
+   efficiency_abs(compressor, num_cpu) = --------------------------------
+                                                      num_cpu
+
+
+ Furthermore, altough not relevant for this specific benchmark, having the
+ converted tarballs available, the compression ratio was computed as follows:
+
+                                    file_size(tarball)
+   compression_ratio(compressor) = ---------------------
+                                   file_size(compressor)
+
+
+ 1.3) What software versions were used?
+
+ squashfs-tools-ng v0.9
+
+ TODO: update data and write the *exact* commit hash here, as well as gcc and
+ Linux versions.
+
+
+ 1.4) Results
+
+ The raw timing results are as follows:
+
+ Jobs    XZ          lzma        gzip        LZO         LZ4      zstd
+ serial  17m39.613s  16m10.710s   9m56.606s  13m22.337s  12.159s  9m33.600s
+      1  17m38.050s  15m49.753s   9m46.948s  13m06.705s  11.908s  9m23.445s
+      2   9m26.712s   8m24.706s   5m08.152s   6m53.872s   7.395s  5m 1.734s
+      3   6m29.733s   5m47.422s   3m33.235s   4m44.407s   6.069s  3m30.708s
+      4   5m02.993s   4m30.361s   2m43.447s   3m39.825s   5.864s  2m44.418s
+      5   4m07.959s   3m40.860s   2m13.454s   2m59.395s   5.749s  2m16.745s
+      6   3m30.514s   3m07.816s   1m53.641s   2m32.461s   5.926s  1m57.607s
+      7   3m04.009s   2m43.765s   1m39.742s   2m12.536s   6.281s  1m43.734s
+      8   2m45.050s   2m26.996s   1m28.776s   1m58.253s   6.395s  1m34.500s
+      9   2m34.993s   2m18.868s   1m21.668s   1m50.461s   6.890s  1m29.820s
+     10   2m27.399s   2m11.214s   1m15.461s   1m44.060s   7.225s  1m26.176s
+     11   2m20.068s   2m04.592s   1m10.286s   1m37.749s   7.557s  1m22.566s
+     12   2m13.131s   1m58.710s   1m05.957s   1m32.596s   8.127s  1m18.883s
+     13   2m07.472s   1m53.481s   1m02.041s   1m27.982s   8.704s  1m16.218s
+     14   2m02.365s   1m48.773s   1m00.337s   1m24.444s   9.494s  1m14.175s
+     15   1m58.298s   1m45.079s     58.348s   1m21.445s  10.192s  1m12.134s
+     16   1m55.940s   1m42.176s     56.615s   1m19.030s  10.964s  1m11.049s
+
+
+ The sizes of the tarball and the resulting images:
+
+  - LZ4 compressed SquashFS image:  ~3.1GiB (3,381,751,808)
+  - LZO compressed SquashFS image:  ~2.5GiB (2,732,015,616)
+  - zstd compressed SquashFS image: ~2.1GiB (2,295,017,472)
+  - gzip compressed SquashFS image: ~2.3GiB (2,471,276,544)
+  - lzma compressed SquashFS image: ~2.0GiB (2,102,169,600)
+  - XZ compressed SquashFS image:   ~2.0GiB (2,098,466,816)
+  - raw tarball:                    ~6.5GiB (7,008,118,272)
+
+
+
+ 1.5) Discussion
+
+ Most obviously, the results indicate that LZ4, unlike the other compressors,
+ is clearly I/O bound and not CPU bound and doesn't benefit from parallelization
+ beyond 2-4 worker threads and even that benefit is marginal with efficiency
+ plummetting immediately.
+
+
+ The other compressors are clearly CPU bound. Speedup increases linearly until
+ about 8 cores, but with a slope < 1, as evident by efficiency linearly
+ decreasing and reaching 80% for 8 cores.
+
+ A reason for this sub-linear scaling may be the choke point introduced by the
+ creation of fragment blocks, that *requires* a synchronization. To test this
+ theory, a second benchmark should be performed with fragment block generation
+ completely disabled. This requires a new flag to be added to tar2sqfs (and
+ also gensquashfs).
+
+
+ Using more than 8 jobs causes a much slower increase in speedup and efficency
+ declines even faster. This is probably due to the fact that the test system
+ only has 8 physical cores and beyond that, SMT has to be used.
+
+
+ It should also be noted that the thread pool compressor with only a single
+ thread turns out to be *slightly* faster than the serial reference
+ implementation. A possible explanation for this might be that the fragment
+ blocks are actually assembled in the main thread, in parallel to the worker
+ that can still continue with other data blocks. Because of this decoupling
+ there is in fact some degree of parallelism, even if only one worker thread
+ is used.
+
+
+ As a side effect, this benchmark also produces some insights into the
+ compression ratio and throughput of the supported compressors. Indicating that
+ for the Debian live image, XZ clearly provides the highest data density, while
+ LZ4 is clearly the fastest compressor available.
+
+ The throughput of the zstd compressor is comparable to gzip, while the
+ resulting compression ratio is closer to LZMA.
+
+ Repeating the benchmark without tail-end-packing and with fragments completely
+ disabled would also show the effectiveness of tail-end-packing and fragment
+ packing as a side effect.
+
+
+ 2) Reference Decompression Benchmark
+ ************************************
+
+ 1.1) What was measured?
+
+ A SquashFS image was generated for each supported compressor:
+
+  $ ./bin/sqfs2tar debian.sqfs | ./bin/tar2sqfs -c <COMPRESSOR> test.sqfs
+
+ And then, for each compressor, the unpacking time was measured:
+
+  $ time ./bin/sqfs2tar test.sqfs > /dev/null
+
+
+ The unpacking step was repeated 4 times and the worst wall-clock time ("real")
+ was used for comparison.
+
+
+ 2.2) What software version was used?
+
+ squashfs-tools-ng commit cc1141984a03da003e15ff229d3b417f8e5a24ad
+
+ gcc version: 10.2.1 20201016 (Red Hat 10.2.1-6)
+ Linux version: 5.8.16-200.fc32.x86_64
+
+
+ 2.3) Results
+
+ gzip    20.466s
+ lz4      2.519s
+ lzma  1m58.455s
+ lzo     10.521s
+ xz    1m59.451s
+ zstd     7.833s
+
+
+ 2.4) Discussion
+
+ From the measurement, it becomes obvious that LZ4 and zstd are the two fastest
+ decompressors. Zstd is particularly noteworth here, because it is not far
+ behind LZ4 in speed, but also achievs a substantially better compression ratio
+ that is somewhere between gzip and lzma. LZ4, despite being the fastest in
+ decompression and beating the others in compression speed by orders of
+ magnitudes, has by far the worst compression ratio.
+
+ It should be noted that the actual number of actually compressed blocks has not
+ been determined. A worse compression ratio can lead to more blocks being stored
+ uncompressed, reducing the workload and thus affecting decompression time.
+
+ However, since zstd has a better compression ratio than gzip, takes only 30% of
+ the time to decompress, and in the serial compression benchmark only takes 2%
+ of the time to compress, we cane safely say that in this benchmark, zstd beats
+ gzip by every metric.
+
+ Furthermore, while XZ stands out as the compressor with the best compression
+ ratio, zstd only takes ~6% of the time to decompress the entire image, while
+ being ~17% bigger than XZ. Shaving off 17% is definitely signifficant,
+ especially considering that in absolute numbers it is in the 100MB range, but
+ it clearly comes at a substential performance cost.
+
+
+ Also interesting are the results for the LZO compressor. Its compression speed
+ is between gzip and LZMA, decompression speed is about 50% of gzip, and only a
+ little bit worse than zstd, but its compression ratio is the second worst only
+ after LZ4, which beats it by a factor of 5 in decompression speed and by ~60
+ in compression speed.
+
+
+ Concluding, for applications where a good compression ratio is most imporant,
+ XZ is obviously the best choice, but if speed is favoured, zstd is probably a
+ very good option to go with. LZ4 is much faster, but has a lot worse
+ compression ratio. It is probably best suited as transparent compression for a
+ read/write file system or network protocols.
+
+
+ Finally, it should be noted, that this serial decompression benchmark is not
+ representative of a real-life workload where only a small set of files are
+ accessed in a random access fashion. In that case, a caching layer can largely
+ mitigate the decompression cost, translating it into an initial or only
+ occasionally occouring cache miss latency. But this benchmark should in theory
+ give an approximate idea how those cache miss latencies are expected to
+ compare between the different compressors.
diff --git a/doc/parallelism.txt b/doc/parallelism.txt
index 315a631..ca18add 100644
--- a/doc/parallelism.txt
+++ b/doc/parallelism.txt
@@ -107,134 +107,3 @@
  add a lot for I/O bound compressors like zstd.
 
  If you have a better idea how to do this, please let me know.
-
-
- 2) Benchmarks
- *************
-
- 2.1) How was the Benchmark Performed?
-
- An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs:
-
-  $ mkdir /dev/shm/temp
-  $ ln -s /dev/shm/temp out
-  $ ./autogen.sh
-  $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \
-                LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out
-  $ make -j install
-  $ cd out
-
- A SquashFS image to be tested was unpacked in this directory:
-
-  $ ./bin/sqfs2tar <IMAGE> > test.tar
-
- And then repacked as follows:
-
-  $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar
-
-
- Out of 4 runs, the worst wall-clock time ("real") was used for comparison.
-
-
- For the serial reference version, configure was re-run with the option
- --without-pthread, the tools re-compiled and re-installed.
-
-
- 2.2) What Image was Tested?
-
- A Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE
- was used.
-
- The input size and resulting output sizes turned out to be as follows:
-
-  - As uncompressed tarball:           ~6.5GiB (7,008,118,272)
-  - As LZ4 compressed SquashFS image:  ~3.1GiB (3,381,751,808)
-  - As LZO compressed SquashFS image:  ~2.5GiB (2,732,015,616)
-  - As zstd compressed SquashFS image: ~2.1GiB (2,295,017,472)
-  - As gzip compressed SquashFS image: ~2.3GiB (2,471,276,544)
-  - As lzma compressed SquashFS image: ~2.0GiB (2,102,169,600)
-  - As XZ compressed SquashFS image:   ~2.0GiB (2,098,466,816)
-
-
- The Debian image is expected to contain realistic input data for a Linux
- file system and also provide enough data for an interesting benchmark.
-
-
- 2.3) What Test System was used?
-
-  AMD Ryzen 7 3700X
-  32GiB DDR4 RAM
-  Fedora 31
-
-
- 2.4) Results
-
- The raw timing results are as follows:
-
- Jobs    XZ          lzma        gzip        LZO         LZ4      zstd
- serial  17m39.613s  16m10.710s   9m56.606s  13m22.337s  12.159s  9m33.600s
-      1  17m38.050s  15m49.753s   9m46.948s  13m06.705s  11.908s  9m23.445s
-      2   9m26.712s   8m24.706s   5m08.152s   6m53.872s   7.395s  5m 1.734s
-      3   6m29.733s   5m47.422s   3m33.235s   4m44.407s   6.069s  3m30.708s
-      4   5m02.993s   4m30.361s   2m43.447s   3m39.825s   5.864s  2m44.418s
-      5   4m07.959s   3m40.860s   2m13.454s   2m59.395s   5.749s  2m16.745s
-      6   3m30.514s   3m07.816s   1m53.641s   2m32.461s   5.926s  1m57.607s
-      7   3m04.009s   2m43.765s   1m39.742s   2m12.536s   6.281s  1m43.734s
-      8   2m45.050s   2m26.996s   1m28.776s   1m58.253s   6.395s  1m34.500s
-      9   2m34.993s   2m18.868s   1m21.668s   1m50.461s   6.890s  1m29.820s
-     10   2m27.399s   2m11.214s   1m15.461s   1m44.060s   7.225s  1m26.176s
-     11   2m20.068s   2m04.592s   1m10.286s   1m37.749s   7.557s  1m22.566s
-     12   2m13.131s   1m58.710s   1m05.957s   1m32.596s   8.127s  1m18.883s
-     13   2m07.472s   1m53.481s   1m02.041s   1m27.982s   8.704s  1m16.218s
-     14   2m02.365s   1m48.773s   1m00.337s   1m24.444s   9.494s  1m14.175s
-     15   1m58.298s   1m45.079s     58.348s   1m21.445s  10.192s  1m12.134s
-     16   1m55.940s   1m42.176s     56.615s   1m19.030s  10.964s  1m11.049s
-
- The file "benchmark.ods" contains those values, values derived from this and
- charts depicting the results.
-
-
- 2.5) Discussion
-
- Most obviously, the results indicate that LZ4, unlike the other compressors,
- is clearly I/O bound and not CPU bound and doesn't benefit from parallelization
- beyond 2-4 worker threads and even that benefit is marginal with efficiency
- plummetting immediately.
-
-
- The other compressors are clearly CPU bound. Speedup increases linearly until
- about 8 cores, but with a slope < 1, as evident by efficiency linearly
- decreasing and reaching 80% for 8 cores.
-
- A reason for this sub-linear scaling may be the choke point introduced by the
- creation of fragment blocks, that *requires* a synchronization. To test this
- theory, a second benchmark should be performed with fragment block generation
- completely disabled. This requires a new flag to be added to tar2sqfs (and
- also gensquashfs).
-
-
- Using more than 8 jobs causes a much slower increase in speedup and efficency
- declines even faster. This is probably due to the fact that the test system
- only has 8 physical cores and beyond that, SMT has to be used.
-
-
- It should also be noted that the thread pool compressor with only a single
- thread turns out to be *slightly* faster than the serial reference
- implementation. A possible explanation for this might be that the fragment
- blocks are actually assembled in the main thread, in parallel to the worker
- that can still continue with other data blocks. Because of this decoupling
- there is in fact some degree of parallelism, even if only one worker thread
- is used.
-
-
- As a side effect, this benchmark also produces some insights into the
- compression ratio and throughput of the supported compressors. Indicating that
- for the Debian live image, XZ clearly provides the highest data density, while
- LZ4 is clearly the fastest compressor available.
-
- The throughput of the zstd compressor is comparable to gzip, while the
- resulting compression ratio is closer to LZMA.
-
- Repeating the benchmark without tail-end-packing and with fragments completely
- disabled would also show the effectiveness of tail-end-packing and fragment
- packing as a side effect.
-- 
cgit v1.2.3