summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--doc/Makemodule.am2
-rw-r--r--doc/parallelism.txt163
2 files changed, 164 insertions, 1 deletions
diff --git a/doc/Makemodule.am b/doc/Makemodule.am
index d7010b6..b215d5d 100644
--- a/doc/Makemodule.am
+++ b/doc/Makemodule.am
@@ -1,4 +1,4 @@
dist_man1_MANS += doc/gensquashfs.1 doc/rdsquashfs.1 doc/sqfs2tar.1
dist_man1_MANS += doc/tar2sqfs.1 doc/sqfsdiff.1
-EXTRA_DIST += doc/format.txt
+EXTRA_DIST += doc/format.txt doc/parallelism.txt
diff --git a/doc/parallelism.txt b/doc/parallelism.txt
new file mode 100644
index 0000000..e509d86
--- /dev/null
+++ b/doc/parallelism.txt
@@ -0,0 +1,163 @@
+
+ Parallelizing SquashFS Data Packing
+ ***********************************
+
+ 0) Overview
+ ***********
+
+ On a high level, data blocks are processed as follows:
+
+ The "block processor" has a simple begin/append/end interface for submitting
+ file data. Internally it chops the file data up into fixed size blocks that
+ are each [optionally] compressed and hashed. If the "end" function is called
+ and there is still left over data, a fragment is created.
+
+ Fragments are only hashed. If another fragment exists with the same size and
+ hash, it is discarded and the existing fragment is referenced. Fragments are
+ collected in a fragment block that, once it overflows, is processed like a
+ normal block.
+
+ The final compressed & hashed data blocks & fragment blocks are passed on to
+ the "block writer".
+
+ The block writer simply writes blocks to the output file. Flags are used to
+ communicate what the first and last block of a file are. Entire files are
+ deduplicated by trying to find a sequence of identical size/hash pairs in
+ the already written blocks.
+
+
+ 0.1) Implementation
+
+ The implementation of the block processor is in lib/sqfs/block_processor. The
+ file common.c contains the frontend for file data submission and common
+ functions for processing a single block, handling a completed block and
+ handling a completed fragment.
+
+ A reference serial implementation is provided in the file serial.c
+
+
+ 1) Thread Pool Based Block Processor
+ ************************************
+
+ The main challenge of parallelizing the block processor lies in the fact the
+ output HAS TO BE byte-for-byte equivalent to the serial reference
+ implementation.
+
+ This means:
+ - File blocks have to be written in the exact same order as they
+ are submitted.
+ - If storing a fragment overflows the fragment block, the resulting
+ fragment block has to be written next, no file data.
+
+
+ The current implementation in winpthread.c (based on pthread or Windows
+ native threads, depending on whats available) uses the following approach:
+
+ - Each submitted data block or fragment gets an incremental sequence number
+ and is appended to a FIFO queue.
+ - Multiple threads consume blocks from the queue and use the function
+ from common.c to process the dequeued blocks.
+ - Completed blocks are inserted into a "done" queue, sorted by their
+ sequence number.
+ - The main thread that submits blocks also dequeues the completed ones,
+ keeping track of the sequence numbers, and calls the respective common
+ functions for processing completed blocks and fragments.
+ - If a fragment block is created, it is submitted with *the same* sequence
+ number as the fragment that caused it to overflow and the next expected
+ sequence number is reset to that.
+
+ To make sure the queue doesn't fill all RAM, submitted blocks are counted.
+ The counter is decremented when dequeueing completed blocks. If it reaches
+ a maximum, signal/await is used to wait for the worker threads to complete
+ some blocks to process. Similarly, the worker threads use signal/await to
+ wait on the queue if it is empty.
+
+
+ 1.1) Problems
+
+ The outlined approach performs sub-optimal, with an efficiency somewhere
+ between 50% to 75% on the bench mark data used.
+
+ Profiling using perf shows that almost a third of the time, only one
+ worker thread is actually active, while the others are waiting.
+
+ The current hypothesis is that this is caused by many small input files
+ being processed, causing a work load consisting primarily of fragments.
+
+ - Fragments are only hashed, not compressed, so the work is
+ primarily I/O bound.
+ - After a number of fragments are consumed, a fragment block is created.
+ - The fragment block is submitted to the almost empty queue and the
+ I/O thread waits for it to be completed before doing anything else.
+ - One thread gets to handle the fragment block, which involves a lot more
+ work. Meanwhile the other threads starve on the empty queue.
+ - After that has finally been handed of to the I/O thread, another burst of
+ fragments comes in.
+ - Rinse and repeat.
+
+
+ 1.2) Proposed Solution
+
+ It makes no sense for the main thread to block until the fragment block is
+ done. It can process further fragments (just not write blocks), creating
+ more fragment blocks on the way.
+
+ A possible implementation might be to maintain 3 queues instead of 2:
+
+ - A queue with submitted blocks.
+ - A queue with completed blocks.
+ - A queue for blocks ready to be written to disk ("I/O queue").
+
+ A second sequence number is needed for keeping order on the I/O queue:
+
+ - Submit blocks as before with incremental processing sequence number.
+ - Dequeue completed blocks in order by processing sequence number.
+ - For regular blocks, add them to the I/O queue with incremental I/O
+ sequence number.
+ - For fragments, consolidate them into fragment blocks. On overflow,
+ dispatch a fragment block with incremental processing sequence number,
+ BUT give it an I/O queue sequence number NOW.
+ - For fragment blocks, add them to the I/O queue without allocating an
+ I/O sequence number, it already has one.
+ - Dequeue ordered by I/O sequence number from the I/O queue and send the
+ completed blocks to the block writer.
+
+
+ If you have a more insights or a better idea, please let me know.
+
+
+
+ 2) Benchmarks
+ *************
+
+ TODO: benchmarks with the following images:
+ - Debian live iso (2G)
+ - Arch Linux live iso (550M)
+ - Raspberry Pi 3 QT demo image (~300M)
+
+ sqfs2tar $IMAGE | tar2sqfs -j $NUM_CPU -f out.sqfs
+
+ Values to measure:
+ - Total wall clock time of tar2sqfs.
+ - Througput (bytes read / time, bytes written / time).
+
+ Try the above for different compressors and stuff everything into
+ a huge spread sheet. Then, determine the following and plot some
+ nice graphs:
+
+ - Absolute speedup (normalized to serial implementation).
+ - Absolute efficiency (= speedup / $NUM_CPU)
+ - Relative speedup (normalized to thread pool with -j 1).
+ - Relative efficiency
+
+
+ Available test hardware:
+ - 8(16) core AMD Ryzen 7 3700X, 32GiB DDR4 RAM.
+ - Various 4 core Intel Xeon servers. Precise Specs not known yet.
+ - TODO: Check if my credentials on LCC2 still work. The cluster nodes AFAIK
+ have dual socket Xeons. Not sure if 8 cores per CPU or 8 in total?
+
+ For some compressors and work load, tar2sqfs may be I/O bound rather than CPU
+ bound. The different machines have different storage which may impact the
+ result. Should this be taken into account for comparison or eliminated by
+ using a ramdisk or fiddling with the queue backlog?