aboutsummaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/benchmark.txt137
-rw-r--r--doc/parallelism.txt131
2 files changed, 137 insertions, 131 deletions
diff --git a/doc/benchmark.txt b/doc/benchmark.txt
new file mode 100644
index 0000000..ed44cb3
--- /dev/null
+++ b/doc/benchmark.txt
@@ -0,0 +1,137 @@
+
+ 1) Parallel Compression Benchmark
+ *********************************
+
+ 1.1) How was the Benchmark Performed?
+
+ An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs:
+
+ $ mkdir /dev/shm/temp
+ $ ln -s /dev/shm/temp out
+ $ ./autogen.sh
+ $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \
+ LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out
+ $ make -j install
+ $ cd out
+
+ A SquashFS image to be tested was unpacked in this directory:
+
+ $ ./bin/sqfs2tar <IMAGE> > test.tar
+
+ And then repacked as follows:
+
+ $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar
+
+
+ Out of 4 runs, the worst wall-clock time ("real") was used for comparison.
+
+
+ For the serial reference version, configure was re-run with the option
+ --without-pthread, the tools re-compiled and re-installed.
+
+
+ 1.2) What Image was Tested?
+
+ A Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE
+ was used.
+
+ The input size and resulting output sizes turned out to be as follows:
+
+ - As uncompressed tarball: ~6.5GiB (7,008,118,272)
+ - As LZ4 compressed SquashFS image: ~3.1GiB (3,381,751,808)
+ - As LZO compressed SquashFS image: ~2.5GiB (2,732,015,616)
+ - As zstd compressed SquashFS image: ~2.1GiB (2,295,017,472)
+ - As gzip compressed SquashFS image: ~2.3GiB (2,471,276,544)
+ - As lzma compressed SquashFS image: ~2.0GiB (2,102,169,600)
+ - As XZ compressed SquashFS image: ~2.0GiB (2,098,466,816)
+
+
+ The Debian image is expected to contain realistic input data for a Linux
+ file system and also provide enough data for an interesting benchmark.
+
+
+ 1.3) What Test System was used?
+
+ AMD Ryzen 7 3700X
+ 32GiB DDR4 RAM
+ Fedora 31
+
+
+ 1.4) What software version was used?
+
+ squashfs-tools-ng v0.9
+
+ TODO: update data and write the *exact* commit hash here.
+
+
+ 1.5) Results
+
+ The raw timing results are as follows:
+
+ Jobs XZ lzma gzip LZO LZ4 zstd
+ serial 17m39.613s 16m10.710s 9m56.606s 13m22.337s 12.159s 9m33.600s
+ 1 17m38.050s 15m49.753s 9m46.948s 13m06.705s 11.908s 9m23.445s
+ 2 9m26.712s 8m24.706s 5m08.152s 6m53.872s 7.395s 5m 1.734s
+ 3 6m29.733s 5m47.422s 3m33.235s 4m44.407s 6.069s 3m30.708s
+ 4 5m02.993s 4m30.361s 2m43.447s 3m39.825s 5.864s 2m44.418s
+ 5 4m07.959s 3m40.860s 2m13.454s 2m59.395s 5.749s 2m16.745s
+ 6 3m30.514s 3m07.816s 1m53.641s 2m32.461s 5.926s 1m57.607s
+ 7 3m04.009s 2m43.765s 1m39.742s 2m12.536s 6.281s 1m43.734s
+ 8 2m45.050s 2m26.996s 1m28.776s 1m58.253s 6.395s 1m34.500s
+ 9 2m34.993s 2m18.868s 1m21.668s 1m50.461s 6.890s 1m29.820s
+ 10 2m27.399s 2m11.214s 1m15.461s 1m44.060s 7.225s 1m26.176s
+ 11 2m20.068s 2m04.592s 1m10.286s 1m37.749s 7.557s 1m22.566s
+ 12 2m13.131s 1m58.710s 1m05.957s 1m32.596s 8.127s 1m18.883s
+ 13 2m07.472s 1m53.481s 1m02.041s 1m27.982s 8.704s 1m16.218s
+ 14 2m02.365s 1m48.773s 1m00.337s 1m24.444s 9.494s 1m14.175s
+ 15 1m58.298s 1m45.079s 58.348s 1m21.445s 10.192s 1m12.134s
+ 16 1m55.940s 1m42.176s 56.615s 1m19.030s 10.964s 1m11.049s
+
+ The file "benchmark.ods" contains those values, values derived from this and
+ charts depicting the results.
+
+
+ 1.6) Discussion
+
+ Most obviously, the results indicate that LZ4, unlike the other compressors,
+ is clearly I/O bound and not CPU bound and doesn't benefit from parallelization
+ beyond 2-4 worker threads and even that benefit is marginal with efficiency
+ plummetting immediately.
+
+
+ The other compressors are clearly CPU bound. Speedup increases linearly until
+ about 8 cores, but with a slope < 1, as evident by efficiency linearly
+ decreasing and reaching 80% for 8 cores.
+
+ A reason for this sub-linear scaling may be the choke point introduced by the
+ creation of fragment blocks, that *requires* a synchronization. To test this
+ theory, a second benchmark should be performed with fragment block generation
+ completely disabled. This requires a new flag to be added to tar2sqfs (and
+ also gensquashfs).
+
+
+ Using more than 8 jobs causes a much slower increase in speedup and efficency
+ declines even faster. This is probably due to the fact that the test system
+ only has 8 physical cores and beyond that, SMT has to be used.
+
+
+ It should also be noted that the thread pool compressor with only a single
+ thread turns out to be *slightly* faster than the serial reference
+ implementation. A possible explanation for this might be that the fragment
+ blocks are actually assembled in the main thread, in parallel to the worker
+ that can still continue with other data blocks. Because of this decoupling
+ there is in fact some degree of parallelism, even if only one worker thread
+ is used.
+
+
+ As a side effect, this benchmark also produces some insights into the
+ compression ratio and throughput of the supported compressors. Indicating that
+ for the Debian live image, XZ clearly provides the highest data density, while
+ LZ4 is clearly the fastest compressor available.
+
+ The throughput of the zstd compressor is comparable to gzip, while the
+ resulting compression ratio is closer to LZMA.
+
+ Repeating the benchmark without tail-end-packing and with fragments completely
+ disabled would also show the effectiveness of tail-end-packing and fragment
+ packing as a side effect.
diff --git a/doc/parallelism.txt b/doc/parallelism.txt
index 315a631..ca18add 100644
--- a/doc/parallelism.txt
+++ b/doc/parallelism.txt
@@ -107,134 +107,3 @@
add a lot for I/O bound compressors like zstd.
If you have a better idea how to do this, please let me know.
-
-
- 2) Benchmarks
- *************
-
- 2.1) How was the Benchmark Performed?
-
- An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs:
-
- $ mkdir /dev/shm/temp
- $ ln -s /dev/shm/temp out
- $ ./autogen.sh
- $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \
- LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out
- $ make -j install
- $ cd out
-
- A SquashFS image to be tested was unpacked in this directory:
-
- $ ./bin/sqfs2tar <IMAGE> > test.tar
-
- And then repacked as follows:
-
- $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar
-
-
- Out of 4 runs, the worst wall-clock time ("real") was used for comparison.
-
-
- For the serial reference version, configure was re-run with the option
- --without-pthread, the tools re-compiled and re-installed.
-
-
- 2.2) What Image was Tested?
-
- A Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE
- was used.
-
- The input size and resulting output sizes turned out to be as follows:
-
- - As uncompressed tarball: ~6.5GiB (7,008,118,272)
- - As LZ4 compressed SquashFS image: ~3.1GiB (3,381,751,808)
- - As LZO compressed SquashFS image: ~2.5GiB (2,732,015,616)
- - As zstd compressed SquashFS image: ~2.1GiB (2,295,017,472)
- - As gzip compressed SquashFS image: ~2.3GiB (2,471,276,544)
- - As lzma compressed SquashFS image: ~2.0GiB (2,102,169,600)
- - As XZ compressed SquashFS image: ~2.0GiB (2,098,466,816)
-
-
- The Debian image is expected to contain realistic input data for a Linux
- file system and also provide enough data for an interesting benchmark.
-
-
- 2.3) What Test System was used?
-
- AMD Ryzen 7 3700X
- 32GiB DDR4 RAM
- Fedora 31
-
-
- 2.4) Results
-
- The raw timing results are as follows:
-
- Jobs XZ lzma gzip LZO LZ4 zstd
- serial 17m39.613s 16m10.710s 9m56.606s 13m22.337s 12.159s 9m33.600s
- 1 17m38.050s 15m49.753s 9m46.948s 13m06.705s 11.908s 9m23.445s
- 2 9m26.712s 8m24.706s 5m08.152s 6m53.872s 7.395s 5m 1.734s
- 3 6m29.733s 5m47.422s 3m33.235s 4m44.407s 6.069s 3m30.708s
- 4 5m02.993s 4m30.361s 2m43.447s 3m39.825s 5.864s 2m44.418s
- 5 4m07.959s 3m40.860s 2m13.454s 2m59.395s 5.749s 2m16.745s
- 6 3m30.514s 3m07.816s 1m53.641s 2m32.461s 5.926s 1m57.607s
- 7 3m04.009s 2m43.765s 1m39.742s 2m12.536s 6.281s 1m43.734s
- 8 2m45.050s 2m26.996s 1m28.776s 1m58.253s 6.395s 1m34.500s
- 9 2m34.993s 2m18.868s 1m21.668s 1m50.461s 6.890s 1m29.820s
- 10 2m27.399s 2m11.214s 1m15.461s 1m44.060s 7.225s 1m26.176s
- 11 2m20.068s 2m04.592s 1m10.286s 1m37.749s 7.557s 1m22.566s
- 12 2m13.131s 1m58.710s 1m05.957s 1m32.596s 8.127s 1m18.883s
- 13 2m07.472s 1m53.481s 1m02.041s 1m27.982s 8.704s 1m16.218s
- 14 2m02.365s 1m48.773s 1m00.337s 1m24.444s 9.494s 1m14.175s
- 15 1m58.298s 1m45.079s 58.348s 1m21.445s 10.192s 1m12.134s
- 16 1m55.940s 1m42.176s 56.615s 1m19.030s 10.964s 1m11.049s
-
- The file "benchmark.ods" contains those values, values derived from this and
- charts depicting the results.
-
-
- 2.5) Discussion
-
- Most obviously, the results indicate that LZ4, unlike the other compressors,
- is clearly I/O bound and not CPU bound and doesn't benefit from parallelization
- beyond 2-4 worker threads and even that benefit is marginal with efficiency
- plummetting immediately.
-
-
- The other compressors are clearly CPU bound. Speedup increases linearly until
- about 8 cores, but with a slope < 1, as evident by efficiency linearly
- decreasing and reaching 80% for 8 cores.
-
- A reason for this sub-linear scaling may be the choke point introduced by the
- creation of fragment blocks, that *requires* a synchronization. To test this
- theory, a second benchmark should be performed with fragment block generation
- completely disabled. This requires a new flag to be added to tar2sqfs (and
- also gensquashfs).
-
-
- Using more than 8 jobs causes a much slower increase in speedup and efficency
- declines even faster. This is probably due to the fact that the test system
- only has 8 physical cores and beyond that, SMT has to be used.
-
-
- It should also be noted that the thread pool compressor with only a single
- thread turns out to be *slightly* faster than the serial reference
- implementation. A possible explanation for this might be that the fragment
- blocks are actually assembled in the main thread, in parallel to the worker
- that can still continue with other data blocks. Because of this decoupling
- there is in fact some degree of parallelism, even if only one worker thread
- is used.
-
-
- As a side effect, this benchmark also produces some insights into the
- compression ratio and throughput of the supported compressors. Indicating that
- for the Debian live image, XZ clearly provides the highest data density, while
- LZ4 is clearly the fastest compressor available.
-
- The throughput of the zstd compressor is comparable to gzip, while the
- resulting compression ratio is closer to LZMA.
-
- Repeating the benchmark without tail-end-packing and with fragments completely
- disabled would also show the effectiveness of tail-end-packing and fragment
- packing as a side effect.