Parallelizing SquashFS Data Packing *********************************** 0) Overview *********** On a high level, data blocks are processed as follows: The "block processor" has a simple begin/append/end interface for submitting file data. Internally it chops the file data up into fixed size blocks that are each [optionally] compressed and hashed. If the "end" function is called and there is still left over data, a fragment is created. Fragments are only hashed. If another fragment exists with the same size and hash, it is discarded and the existing fragment is referenced. Fragments are collected in a fragment block that, once it overflows, is processed like a normal block. The final compressed & hashed data blocks & fragment blocks are passed on to the "block writer". The block writer simply writes blocks to the output file. Flags are used to communicate what the first and last block of a file are. Entire files are deduplicated by trying to find a sequence of identical size/hash pairs in the already written blocks. 0.1) Implementation The implementation of the block processor is in lib/sqfs/block_processor. The file common.c contains the frontend for file data submission and common functions for processing a single block, handling a completed block and handling a completed fragment. A reference serial implementation is provided in the file serial.c 1) Thread Pool Based Block Processor ************************************ 1.1) Goals and Challanges In addition to the goal of boosting performance, the thread pool based block processor must meet the following requirements: - Output MUST be deterministic and reproducible. I.e. feeding byte-for-byte the same input MUST ALWAYS produce byte-for-byte the same output. - Blocks the belong to a single file MUST be written in the order that they were submitted. - Output MUST be byte-for-byte equivalent to the serial reference implementation. Changing the serial reference implementation to achieve this is OK. - I/O cannot be done in worker threads. The underlying back-end must be assumed to not be thread safe and may get very confused by suddenly running in a different thread, even if only one thread at a time uses it. 1.2) The Current Approach The current implementation is in winpthread.c (based on pthread or Windows native threads, depending on whats available). It keeps track of blocks in 3 different FIFO queues: - A "work queue" that freshly submitted blocks go to. Worker threads take blocks from this queue for processing. - A "done queue" that worker threads submit blocks to, once completed. - An "I/O queue" that contains blocks ready to be written to disk. When the main thread submits a block, it gives it an incremental "processing" sequence number and appends it to the "work queue". Thread pool workers take the first best block of the queue, process it and add it to the "done" queue, sorted by its processing sequence number. The main thread dequeues blocks from the done queue sorted by their processing sequence number, using a second counter to make sure blocks are dequeued in the exact same order as they were added to the work queue. Regular data blocks from the "done queue" are given an incremental I/O sequence number and added to the "I/O queue", sorted by this number. Fragments are deduplicated and consolidated into a fragment block. If this block overflows, it is appended to the "work queue" the exact same way as regular blocks, but it is **already given an I/O sequence number**. If a block dequeued from the "done queue" turns out to be a fragment block, it is added to the "I/O queue", sorted by its I/O sequence number **that it already has**, i.e. no new sequence number is allocated. The I/O queue is dequeued in the same fashion as the "done queue", using a second counter to enforce ordering. The actual implementation interleaves enqueueing and dequeueing in the block submission function. It dequeues blocks if the queues reach a pre-set maximum backlog. In that case, it tries to dequeue from the I/O queue first and if that fails, tries to dequeue from the "done queue". If that also fails, it uses signal/await to be woken up by a worker thread once it adds a block to the "done queue". Fragment post-processing and re-queueing of blocks is done inside the critical region, but the actual I/O is done outside (for obvious reasons). Profiling on small filesystems using perf shows that the outlined approach seems to perform quite well for CPU bound compressors like XZ, but doesn't add a lot for I/O bound compressors like zstd. If you have a better idea how to do this, please let me know. 2) Benchmarks ************* 2.1) How was the Benchmark Performed? An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs: $ mkdir /dev/shm/temp $ ln -s /dev/shm/temp out $ ./autogen.sh $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \ LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out $ make -j install $ cd out A SquashFS image to be tested was unpacked in this directory: $ ./bin/sqfs2tar <IMAGE> > test.tar And then repacked as follows: $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar Out of 4 runs, the worst wall-clock time ("real") was used for comparison. For the serial reference version, configure was re-run with the option --without-pthread, the tools re-compiled and re-installed. 2.2) What Image was Tested? A Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE was used. The input size and resulting output sizes turned out to be as follows: - As uncompressed tarball: ~6.5GiB (7,008,118,272) - As LZ4 compressed SquashFS image: ~3.1GiB (3,381,751,808) - As LZO compressed SquashFS image: ~2.5GiB (2,732,015,616) - As zstd compressed SquashFS image: ~2.1GiB (2,295,017,472) - As gzip compressed SquashFS image: ~2.3GiB (2,471,276,544) - As lzma compressed SquashFS image: ~2.0GiB (2,102,169,600) - As XZ compressed SquashFS image: ~2.0GiB (2,098,466,816) The Debian image is expected to contain realistic input data for a Linux file system and also provide enough data for an interesting benchmark. 2.3) What Test System was used? AMD Ryzen 7 3700X 32GiB DDR4 RAM Fedora 31 2.4) Results The raw timing results are as follows: Jobs XZ lzma gzip LZO LZ4 zstd serial 17m39.613s 16m10.710s 9m56.606s 13m22.337s 12.159s 9m33.600s 1 17m38.050s 15m49.753s 9m46.948s 13m06.705s 11.908s 9m23.445s 2 9m26.712s 8m24.706s 5m08.152s 6m53.872s 7.395s 5m 1.734s 3 6m29.733s 5m47.422s 3m33.235s 4m44.407s 6.069s 3m30.708s 4 5m02.993s 4m30.361s 2m43.447s 3m39.825s 5.864s 2m44.418s 5 4m07.959s 3m40.860s 2m13.454s 2m59.395s 5.749s 2m16.745s 6 3m30.514s 3m07.816s 1m53.641s 2m32.461s 5.926s 1m57.607s 7 3m04.009s 2m43.765s 1m39.742s 2m12.536s 6.281s 1m43.734s 8 2m45.050s 2m26.996s 1m28.776s 1m58.253s 6.395s 1m34.500s 9 2m34.993s 2m18.868s 1m21.668s 1m50.461s 6.890s 1m29.820s 10 2m27.399s 2m11.214s 1m15.461s 1m44.060s 7.225s 1m26.176s 11 2m20.068s 2m04.592s 1m10.286s 1m37.749s 7.557s 1m22.566s 12 2m13.131s 1m58.710s 1m05.957s 1m32.596s 8.127s 1m18.883s 13 2m07.472s 1m53.481s 1m02.041s 1m27.982s 8.704s 1m16.218s 14 2m02.365s 1m48.773s 1m00.337s 1m24.444s 9.494s 1m14.175s 15 1m58.298s 1m45.079s 58.348s 1m21.445s 10.192s 1m12.134s 16 1m55.940s 1m42.176s 56.615s 1m19.030s 10.964s 1m11.049s The file "benchmark.ods" contains those values, values derived from this and charts depicting the results. 2.5) Discussion Most obviously, the results indicate that LZ4, unlike the other compressors, is clearly I/O bound and not CPU bound and doesn't benefit from parallelization beyond 2-4 worker threads and even that benefit is marginal with efficiency plummetting immediately. The other compressors are clearly CPU bound. Speedup increases linearly until about 8 cores, but with a slope < 1, as evident by efficiency linearly decreasing and reaching 80% for 8 cores. A reason for this sub-linear scaling may be the choke point introduced by the creation of fragment blocks, that *requires* a synchronization. To test this theory, a second benchmark should be performed with fragment block generation completely disabled. This requires a new flag to be added to tar2sqfs (and also gensquashfs). Using more than 8 jobs causes a much slower increase in speedup and efficency declines even faster. This is probably due to the fact that the test system only has 8 physical cores and beyond that, SMT has to be used. It should also be noted that the thread pool compressor with only a single thread turns out to be *slightly* faster than the serial reference implementation. A possible explanation for this might be that the fragment blocks are actually assembled in the main thread, in parallel to the worker that can still continue with other data blocks. Because of this decoupling there is in fact some degree of parallelism, even if only one worker thread is used. As a side effect, this benchmark also produces some insights into the compression ratio and throughput of the supported compressors. Indicating that for the Debian live image, XZ clearly provides the highest data density, while LZ4 is clearly the fastest compressor available. The throughput of the zstd compressor is comparable to gzip, while the resulting compression ratio is closer to LZMA. Repeating the benchmark without tail-end-packing and with fragments completely disabled would also show the effectiveness of tail-end-packing and fragment packing as a side effect.