diff options
| author | David Oberhollenzer <david.oberhollenzer@sigma-star.at> | 2020-10-27 00:45:19 +0100 | 
|---|---|---|
| committer | David Oberhollenzer <david.oberhollenzer@sigma-star.at> | 2020-10-28 13:57:50 +0100 | 
| commit | bd9133b1b12c4fa481ff9dd9d0572f363e7bfb59 (patch) | |
| tree | 5882ad6f8d19a0874b2083a0d9daf6253d43f33b /doc | |
| parent | ef8a7085e5014662d1ca74bc13e762f5e900bc3f (diff) | |
documentation: move benchmark description to separate file
Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Diffstat (limited to 'doc')
| -rw-r--r-- | doc/benchmark.txt | 137 | ||||
| -rw-r--r-- | doc/parallelism.txt | 131 | 
2 files changed, 137 insertions, 131 deletions
diff --git a/doc/benchmark.txt b/doc/benchmark.txt new file mode 100644 index 0000000..ed44cb3 --- /dev/null +++ b/doc/benchmark.txt @@ -0,0 +1,137 @@ + + 1) Parallel Compression Benchmark + ********************************* + + 1.1) How was the Benchmark Performed? + + An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs: + +  $ mkdir /dev/shm/temp +  $ ln -s /dev/shm/temp out +  $ ./autogen.sh +  $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \ +                LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out +  $ make -j install +  $ cd out + + A SquashFS image to be tested was unpacked in this directory: + +  $ ./bin/sqfs2tar <IMAGE> > test.tar + + And then repacked as follows: + +  $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar + + + Out of 4 runs, the worst wall-clock time ("real") was used for comparison. + + + For the serial reference version, configure was re-run with the option + --without-pthread, the tools re-compiled and re-installed. + + + 1.2) What Image was Tested? + + A Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE + was used. + + The input size and resulting output sizes turned out to be as follows: + +  - As uncompressed tarball:           ~6.5GiB (7,008,118,272) +  - As LZ4 compressed SquashFS image:  ~3.1GiB (3,381,751,808) +  - As LZO compressed SquashFS image:  ~2.5GiB (2,732,015,616) +  - As zstd compressed SquashFS image: ~2.1GiB (2,295,017,472) +  - As gzip compressed SquashFS image: ~2.3GiB (2,471,276,544) +  - As lzma compressed SquashFS image: ~2.0GiB (2,102,169,600) +  - As XZ compressed SquashFS image:   ~2.0GiB (2,098,466,816) + + + The Debian image is expected to contain realistic input data for a Linux + file system and also provide enough data for an interesting benchmark. + + + 1.3) What Test System was used? + +  AMD Ryzen 7 3700X +  32GiB DDR4 RAM +  Fedora 31 + + + 1.4) What software version was used? + + squashfs-tools-ng v0.9 + + TODO: update data and write the *exact* commit hash here. + + + 1.5) Results + + The raw timing results are as follows: + + Jobs    XZ          lzma        gzip        LZO         LZ4      zstd + serial  17m39.613s  16m10.710s   9m56.606s  13m22.337s  12.159s  9m33.600s +      1  17m38.050s  15m49.753s   9m46.948s  13m06.705s  11.908s  9m23.445s +      2   9m26.712s   8m24.706s   5m08.152s   6m53.872s   7.395s  5m 1.734s +      3   6m29.733s   5m47.422s   3m33.235s   4m44.407s   6.069s  3m30.708s +      4   5m02.993s   4m30.361s   2m43.447s   3m39.825s   5.864s  2m44.418s +      5   4m07.959s   3m40.860s   2m13.454s   2m59.395s   5.749s  2m16.745s +      6   3m30.514s   3m07.816s   1m53.641s   2m32.461s   5.926s  1m57.607s +      7   3m04.009s   2m43.765s   1m39.742s   2m12.536s   6.281s  1m43.734s +      8   2m45.050s   2m26.996s   1m28.776s   1m58.253s   6.395s  1m34.500s +      9   2m34.993s   2m18.868s   1m21.668s   1m50.461s   6.890s  1m29.820s +     10   2m27.399s   2m11.214s   1m15.461s   1m44.060s   7.225s  1m26.176s +     11   2m20.068s   2m04.592s   1m10.286s   1m37.749s   7.557s  1m22.566s +     12   2m13.131s   1m58.710s   1m05.957s   1m32.596s   8.127s  1m18.883s +     13   2m07.472s   1m53.481s   1m02.041s   1m27.982s   8.704s  1m16.218s +     14   2m02.365s   1m48.773s   1m00.337s   1m24.444s   9.494s  1m14.175s +     15   1m58.298s   1m45.079s     58.348s   1m21.445s  10.192s  1m12.134s +     16   1m55.940s   1m42.176s     56.615s   1m19.030s  10.964s  1m11.049s + + The file "benchmark.ods" contains those values, values derived from this and + charts depicting the results. + + + 1.6) Discussion + + Most obviously, the results indicate that LZ4, unlike the other compressors, + is clearly I/O bound and not CPU bound and doesn't benefit from parallelization + beyond 2-4 worker threads and even that benefit is marginal with efficiency + plummetting immediately. + + + The other compressors are clearly CPU bound. Speedup increases linearly until + about 8 cores, but with a slope < 1, as evident by efficiency linearly + decreasing and reaching 80% for 8 cores. + + A reason for this sub-linear scaling may be the choke point introduced by the + creation of fragment blocks, that *requires* a synchronization. To test this + theory, a second benchmark should be performed with fragment block generation + completely disabled. This requires a new flag to be added to tar2sqfs (and + also gensquashfs). + + + Using more than 8 jobs causes a much slower increase in speedup and efficency + declines even faster. This is probably due to the fact that the test system + only has 8 physical cores and beyond that, SMT has to be used. + + + It should also be noted that the thread pool compressor with only a single + thread turns out to be *slightly* faster than the serial reference + implementation. A possible explanation for this might be that the fragment + blocks are actually assembled in the main thread, in parallel to the worker + that can still continue with other data blocks. Because of this decoupling + there is in fact some degree of parallelism, even if only one worker thread + is used. + + + As a side effect, this benchmark also produces some insights into the + compression ratio and throughput of the supported compressors. Indicating that + for the Debian live image, XZ clearly provides the highest data density, while + LZ4 is clearly the fastest compressor available. + + The throughput of the zstd compressor is comparable to gzip, while the + resulting compression ratio is closer to LZMA. + + Repeating the benchmark without tail-end-packing and with fragments completely + disabled would also show the effectiveness of tail-end-packing and fragment + packing as a side effect. diff --git a/doc/parallelism.txt b/doc/parallelism.txt index 315a631..ca18add 100644 --- a/doc/parallelism.txt +++ b/doc/parallelism.txt @@ -107,134 +107,3 @@   add a lot for I/O bound compressors like zstd.   If you have a better idea how to do this, please let me know. - - - 2) Benchmarks - ************* - - 2.1) How was the Benchmark Performed? - - An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs: - -  $ mkdir /dev/shm/temp -  $ ln -s /dev/shm/temp out -  $ ./autogen.sh -  $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \ -                LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out -  $ make -j install -  $ cd out - - A SquashFS image to be tested was unpacked in this directory: - -  $ ./bin/sqfs2tar <IMAGE> > test.tar - - And then repacked as follows: - -  $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar - - - Out of 4 runs, the worst wall-clock time ("real") was used for comparison. - - - For the serial reference version, configure was re-run with the option - --without-pthread, the tools re-compiled and re-installed. - - - 2.2) What Image was Tested? - - A Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE - was used. - - The input size and resulting output sizes turned out to be as follows: - -  - As uncompressed tarball:           ~6.5GiB (7,008,118,272) -  - As LZ4 compressed SquashFS image:  ~3.1GiB (3,381,751,808) -  - As LZO compressed SquashFS image:  ~2.5GiB (2,732,015,616) -  - As zstd compressed SquashFS image: ~2.1GiB (2,295,017,472) -  - As gzip compressed SquashFS image: ~2.3GiB (2,471,276,544) -  - As lzma compressed SquashFS image: ~2.0GiB (2,102,169,600) -  - As XZ compressed SquashFS image:   ~2.0GiB (2,098,466,816) - - - The Debian image is expected to contain realistic input data for a Linux - file system and also provide enough data for an interesting benchmark. - - - 2.3) What Test System was used? - -  AMD Ryzen 7 3700X -  32GiB DDR4 RAM -  Fedora 31 - - - 2.4) Results - - The raw timing results are as follows: - - Jobs    XZ          lzma        gzip        LZO         LZ4      zstd - serial  17m39.613s  16m10.710s   9m56.606s  13m22.337s  12.159s  9m33.600s -      1  17m38.050s  15m49.753s   9m46.948s  13m06.705s  11.908s  9m23.445s -      2   9m26.712s   8m24.706s   5m08.152s   6m53.872s   7.395s  5m 1.734s -      3   6m29.733s   5m47.422s   3m33.235s   4m44.407s   6.069s  3m30.708s -      4   5m02.993s   4m30.361s   2m43.447s   3m39.825s   5.864s  2m44.418s -      5   4m07.959s   3m40.860s   2m13.454s   2m59.395s   5.749s  2m16.745s -      6   3m30.514s   3m07.816s   1m53.641s   2m32.461s   5.926s  1m57.607s -      7   3m04.009s   2m43.765s   1m39.742s   2m12.536s   6.281s  1m43.734s -      8   2m45.050s   2m26.996s   1m28.776s   1m58.253s   6.395s  1m34.500s -      9   2m34.993s   2m18.868s   1m21.668s   1m50.461s   6.890s  1m29.820s -     10   2m27.399s   2m11.214s   1m15.461s   1m44.060s   7.225s  1m26.176s -     11   2m20.068s   2m04.592s   1m10.286s   1m37.749s   7.557s  1m22.566s -     12   2m13.131s   1m58.710s   1m05.957s   1m32.596s   8.127s  1m18.883s -     13   2m07.472s   1m53.481s   1m02.041s   1m27.982s   8.704s  1m16.218s -     14   2m02.365s   1m48.773s   1m00.337s   1m24.444s   9.494s  1m14.175s -     15   1m58.298s   1m45.079s     58.348s   1m21.445s  10.192s  1m12.134s -     16   1m55.940s   1m42.176s     56.615s   1m19.030s  10.964s  1m11.049s - - The file "benchmark.ods" contains those values, values derived from this and - charts depicting the results. - - - 2.5) Discussion - - Most obviously, the results indicate that LZ4, unlike the other compressors, - is clearly I/O bound and not CPU bound and doesn't benefit from parallelization - beyond 2-4 worker threads and even that benefit is marginal with efficiency - plummetting immediately. - - - The other compressors are clearly CPU bound. Speedup increases linearly until - about 8 cores, but with a slope < 1, as evident by efficiency linearly - decreasing and reaching 80% for 8 cores. - - A reason for this sub-linear scaling may be the choke point introduced by the - creation of fragment blocks, that *requires* a synchronization. To test this - theory, a second benchmark should be performed with fragment block generation - completely disabled. This requires a new flag to be added to tar2sqfs (and - also gensquashfs). - - - Using more than 8 jobs causes a much slower increase in speedup and efficency - declines even faster. This is probably due to the fact that the test system - only has 8 physical cores and beyond that, SMT has to be used. - - - It should also be noted that the thread pool compressor with only a single - thread turns out to be *slightly* faster than the serial reference - implementation. A possible explanation for this might be that the fragment - blocks are actually assembled in the main thread, in parallel to the worker - that can still continue with other data blocks. Because of this decoupling - there is in fact some degree of parallelism, even if only one worker thread - is used. - - - As a side effect, this benchmark also produces some insights into the - compression ratio and throughput of the supported compressors. Indicating that - for the Debian live image, XZ clearly provides the highest data density, while - LZ4 is clearly the fastest compressor available. - - The throughput of the zstd compressor is comparable to gzip, while the - resulting compression ratio is closer to LZMA. - - Repeating the benchmark without tail-end-packing and with fragments completely - disabled would also show the effectiveness of tail-end-packing and fragment - packing as a side effect.  | 
