Parallelizing SquashFS Data Packing
                      ***********************************

 0) Overview
 ***********

 On a high level, data blocks are processed as follows:

 The "block processor" has a simple begin/append/end interface for submitting
 file data. Internally it chops the file data up into fixed size blocks that
 are each [optionally] compressed and hashed. If the "end" function is called
 and there is still left over data, a fragment is created.

 Fragments are only hashed. If another fragment exists with the same size and
 hash, it is discarded and the existing fragment is referenced. Fragments are
 collected in a fragment block that, once it overflows, is processed like a
 normal block.

 The final compressed & hashed data blocks & fragment blocks are passed on to
 the "block writer".

 The block writer simply writes blocks to the output file. Flags are used to
 communicate what the first and last block of a file are. Entire files are
 deduplicated by trying to find a sequence of identical size/hash pairs in
 the already written blocks.


 0.1) Implementation

 The implementation of the block processor is in lib/sqfs/block_processor. The
 file common.c contains the frontend for file data submission and common
 functions for processing a single block, handling a completed block and
 handling a completed fragment.

 A reference serial implementation is provided in the file serial.c


 1) Thread Pool Based Block Processor
 ************************************

 1.1) Goals and Challanges

 In addition to the goal of boosting performance, the thread pool based block
 processor must meet the following requirements:

  - Output MUST be deterministic and reproducible. I.e. feeding byte-for-byte
    the same input MUST ALWAYS produce byte-for-byte the same output.
  - Blocks the belong to a single file MUST be written in the order that
    they were submitted.
  - Output MUST be byte-for-byte equivalent to the serial reference
    implementation. Changing the serial reference implementation to
    achieve this is OK.
  - I/O cannot be done in worker threads. The underlying back-end must be
    assumed to not be thread safe and may get very confused by suddenly running
    in a different thread, even if only one thread at a time uses it.


 1.2) The Current Approach

 The current implementation is in winpthread.c (based on pthread or Windows
 native threads, depending on whats available).

 It keeps track of blocks in 3 different FIFO queues:
  - A "work queue" that freshly submitted blocks go to. Worker threads take
    blocks from this queue for processing.
  - A "done queue" that worker threads submit blocks to, once completed.
  - An "I/O queue" that contains blocks ready to be written to disk.


 When the main thread submits a block, it gives it an incremental "processing"
 sequence number and appends it to the "work queue". Thread pool workers take
 the first best block of the queue, process it and added it to the "done"
 queue, sorted by its processing sequence number.

 The main thread dequeues blocks from the done queue sorted by their processing
 sequence number, using a second counter to make sure blocks are dequeued in
 the exact same order as they were added to the work queue.

 Regular data blocks from the "done queue" are given an incremental I/O
 sequence number and added to the "I/O queue", sorted by this number.

 Fragments are deduplicated and consolidated into a fragment block. If this
 block overflows, it is appended to the "work queue" the exact same way as
 regular blocks, but it is **already given an I/O sequence number**.

 If a block dequeued from the "done queue" turns out to be a fragment block, it
 is added to the "I/O queue", sorted by its I/O sequence number **that it
 already has**, i.e. no new sequence number is allocated.

 The I/O queue is dequeued in the same fashion as the "done queue", using a
 second counter to enforce ordering.


 The actual implementation interleaves enqueueing and dequeueing in the block
 submission function. It dequeues blocks if the queues reach a pre-set maximum
 backlog. In that case, it tries to dequeue from the I/O queue first and if
 that fails, tries to dequeue from the "done queue". If that also fails, it
 uses signal/await to be woken up by a worker thread once it adds a block to
 the "done queue". Fragment post-processing and re-queueing of blocks is done
 inside the critical region, but the actual I/O is obviously done outside.


 Profiling on small filesystems using perf shows that the outlined approach
 seems to perform quite well for CPU bound compressors like XZ, but doesn't
 add a lot for I/O bound compressors like zstd. Actual measurements still
 need to be done.

 If you have a better idea how to do this, please let me know.


 2) Benchmarks
 *************

 2.1) How was the Benchmark Performed?

 An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs:

  $ mkdir /dev/shm/temp
  $ ln -s /dev/shm/temp out
  $ ./autogen.sh
  $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \
                LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out
  $ make -j install
  $ cd out

 A SquashFS image to be tested was unpacked in this directory:

  $ ./bin/sqfs2tar <IMAGE> > test.tar

 And then repacked as follows:

  $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar


 Out of 4 runs, the worst wall-clock time ("real") was used for comparison.


 For the serial reference version, configure was re-run with the option
 --without-pthread, the tools re-compiled and re-installed.


 2.2) What Image was Tested?

 A Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE
 was used.

 The input size and resulting output sizes turned out to be as follows:

  - As uncompressed tarball:           ~6.5GiB (7,008,118,272)
  - As LZ4 compressed SquashFS image:  ~3.1GiB (3,381,751,808)
  - As LZO compressed SquashFS image:  ~2.5GiB (2,732,015,616)
  - As zstd compressed SquashFS image: ~2.4GiB (2,536,910,848)
  - As gzip compressed SquashFS image: ~2.3GiB (2,471,276,544)
  - As lzma compressed SquashFS image: ~2.0GiB (2,102,169,600)
  - As XZ compressed SquashFS image:   ~2.0GiB (2,098,466,816)


 The Debian image is expected to contain realistic input data for a Linux
 file system and also provide enough data for an interesting benchmark.


 2.3) What Test System was used?

  AMD Ryzen 7 3700X
  32GiB DDR4 RAM
  Fedora 31 with Linux 5.4.17


 2.4) Results

 The raw timing results are as follows:

 Jobs    XZ          lzma        gzip        LZO         LZ4      zstd
 serial  17m59.413s  16m08.868s  10m02.632s  13m17.956s  18.218s  35.280s
      1  18m01.695s  16m02.329s   9m57.334s  13m14.374s  16.727s  34.108s
      2   9m34.939s   8m32.806s   5m12.791s   6m56.017s  13.161s  21.696s
      3   6m37.701s   5m55.246s   3m35.409s   4m50.138s  12.798s  18.265s
      4   5m07.896s   4m34.419s   2m47.108s   3m43.153s  13.191s  16.885s
      5   4m11.593s   3m44.764s   2m17.371s   3m02.429s  14.251s  17.389s
      6   3m34.115s   3m12.032s   1m57.972s   2m35.601s  14.824s  17.023s
      7   3m07.806s   2m47.815s   1m44.661s   2m16.289s  15.643s  17.676s
      8   2m47.589s   2m30.433s   1m33.865s   2m01.389s  16.262s  17.524s
      9   2m38.737s   2m22.159s   1m27.477s   1m53.976s  16.887s  18.110s
     10   2m30.942s   2m14.427s   1m22.424s   1m47.411s  17.316s  18.497s
     11   2m23.512s   2m08.470s   1m17.419s   1m41.965s  17.759s  18.831s
     12   2m17.083s   2m02.814s   1m13.644s   1m36.742s  18.335s  19.082s
     13   2m11.450s   1m57.820s   1m10.310s   1m32.492s  18.827s  19.232s
     14   2m06.525s   1m53.951s   1m07.483s   1m28.779s  19.471s  20.070s
     15   2m02.338s   1m50.358s   1m04.954s   1m25.993s  19.772s  20.608s
     16   1m58.566s   1m47.371s   1m03.616s   1m23.241s  20.188s  21.779s

 The file "benchmark.ods" contains those values, values derived from this and
 charts depicting the results.


 2.5) Discussion

 Most obviously, the results indicate that LZ4 and zstd compression are clearly
 I/O bound and not CPU bound. They don't benefit from parallelization beyond
 2-4 worker threads and even that benefit is marginal with efficiency
 plummetting immediately.


 The other compressors (XZ, lzma, gzip, lzo) are clearly CPU bound. Speedup
 increases linearly until about 8 cores, but with a factor k < 1, paralleled by
 efficiency decreasing down to 80% for 8 cores.

 A reason for this sub-linear scaling may be the choke point introduced by the
 creation of fragment blocks, that *requires* a synchronization. To test this
 theory, a second benchmark should be performed with fragment block generation
 completely disabled. This requires a new flag to be added to tar2sqfs (and
 also gensquashfs).


 Using more than 8 jobs causes a much slower increase in speedup and efficency
 declines even faster. This is probably due to the fact that the test system
 only has 8 physical cores and beyond that, SMT has to be used.


 It should also be noted that the thread pool compressor with only a single
 thread turns out to be *slightly* faster than the serial reference
 implementation. A possible explanation for this might be that the fragment
 blocks are actually assembled in the main thread, in parallel to the worker
 that can still continue with other data blocks. Because of this decoupling
 there is in fact some degree of parallelism, even if only one worker thread
 is used.


 As a side effect, this benchmark also produces some insights into the
 compression ratio and throughput of the supported compressors. Indicating that
 for the Debian live image, XZ clearly provides the highest data density, while
 LZ4 is clearly the fastest compressor available, directly followed by zstd
 which has a much better compression ratio than LZ4, comparable to the gzip
 compressor, while being almost 50 times faster. The throughput of the zstd
 compressor is truly impressive, considering the compression ratio it achieves.

 Repeating the benchmark without tail-end-packing and wit fragments completely
 disabled would also show the effectiveness of tail-end-packing and fragment
 packing as a side effect.