diff options
| author | David Oberhollenzer <david.oberhollenzer@sigma-star.at> | 2020-02-17 23:16:19 +0100 | 
|---|---|---|
| committer | David Oberhollenzer <david.oberhollenzer@sigma-star.at> | 2020-02-17 23:16:19 +0100 | 
| commit | 7bfb01489bdc11ceb7a79f38e2c90b61232f605f (patch) | |
| tree | 64cec13ef9d3de272bb5c0ec5d366f213e6e17d1 /doc | |
| parent | 1c82a9f2fa98e68b68e6868298fba1830d91d1db (diff) | |
Write up some thoughts on the thread pool based block processor
Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Diffstat (limited to 'doc')
| -rw-r--r-- | doc/Makemodule.am | 2 | ||||
| -rw-r--r-- | doc/parallelism.txt | 163 | 
2 files changed, 164 insertions, 1 deletions
| diff --git a/doc/Makemodule.am b/doc/Makemodule.am index d7010b6..b215d5d 100644 --- a/doc/Makemodule.am +++ b/doc/Makemodule.am @@ -1,4 +1,4 @@  dist_man1_MANS += doc/gensquashfs.1 doc/rdsquashfs.1 doc/sqfs2tar.1  dist_man1_MANS += doc/tar2sqfs.1 doc/sqfsdiff.1 -EXTRA_DIST += doc/format.txt +EXTRA_DIST += doc/format.txt doc/parallelism.txt diff --git a/doc/parallelism.txt b/doc/parallelism.txt new file mode 100644 index 0000000..e509d86 --- /dev/null +++ b/doc/parallelism.txt @@ -0,0 +1,163 @@ + +                      Parallelizing SquashFS Data Packing +                      *********************************** + + 0) Overview + *********** + + On a high level, data blocks are processed as follows: + + The "block processor" has a simple begin/append/end interface for submitting + file data. Internally it chops the file data up into fixed size blocks that + are each [optionally] compressed and hashed. If the "end" function is called + and there is still left over data, a fragment is created. + + Fragments are only hashed. If another fragment exists with the same size and + hash, it is discarded and the existing fragment is referenced. Fragments are + collected in a fragment block that, once it overflows, is processed like a + normal block. + + The final compressed & hashed data blocks & fragment blocks are passed on to + the "block writer". + + The block writer simply writes blocks to the output file. Flags are used to + communicate what the first and last block of a file are. Entire files are + deduplicated by trying to find a sequence of identical size/hash pairs in + the already written blocks. + + + 0.1) Implementation + + The implementation of the block processor is in lib/sqfs/block_processor. The + file common.c contains the frontend for file data submission and common + functions for processing a single block, handling a completed block and + handling a completed fragment. + + A reference serial implementation is provided in the file serial.c + + + 1) Thread Pool Based Block Processor + ************************************ + + The main challenge of parallelizing the block processor lies in the fact the + output HAS TO BE byte-for-byte equivalent to the serial reference + implementation. + + This means: +  - File blocks have to be written in the exact same order as they +    are submitted. +  - If storing a fragment overflows the fragment block, the resulting +    fragment block has to be written next, no file data. + + + The current implementation in winpthread.c (based on pthread or Windows + native threads, depending on whats available) uses the following approach: + +  - Each submitted data block or fragment gets an incremental sequence number +    and is appended to a FIFO queue. +  - Multiple threads consume blocks from the queue and use the function +    from common.c to process the dequeued blocks. +  - Completed blocks are inserted into a "done" queue, sorted by their +    sequence number. +  - The main thread that submits blocks also dequeues the completed ones, +    keeping track of the sequence numbers, and calls the respective common +    functions for processing completed blocks and fragments. +  - If a fragment block is created, it is submitted with *the same* sequence +    number as the fragment that caused it to overflow and the next expected +    sequence number is reset to that. + + To make sure the queue doesn't fill all RAM, submitted blocks are counted. + The counter is decremented when dequeueing completed blocks. If it reaches + a maximum, signal/await is used to wait for the worker threads to complete + some blocks to process. Similarly, the worker threads use signal/await to + wait on the queue if it is empty. + + + 1.1) Problems + + The outlined approach performs sub-optimal, with an efficiency somewhere + between 50% to 75% on the bench mark data used. + + Profiling using perf shows that almost a third of the time, only one + worker thread is actually active, while the others are waiting. + + The current hypothesis is that this is caused by many small input files + being processed, causing a work load consisting primarily of fragments. + +  - Fragments are only hashed, not compressed, so the work is +    primarily I/O bound. +  - After a number of fragments are consumed, a fragment block is created. +  - The fragment block is submitted to the almost empty queue and the +    I/O thread waits for it to be completed before doing anything else. +  - One thread gets to handle the fragment block, which involves a lot more +    work. Meanwhile the other threads starve on the empty queue. +  - After that has finally been handed of to the I/O thread, another burst of +    fragments comes in. +  - Rinse and repeat. + + + 1.2) Proposed Solution + + It makes no sense for the main thread to block until the fragment block is + done. It can process further fragments (just not write blocks), creating + more fragment blocks on the way. + + A possible implementation might be to maintain 3 queues instead of 2: + +  - A queue with submitted blocks. +  - A queue with completed blocks. +  - A queue for blocks ready to be written to disk ("I/O queue"). + + A second sequence number is needed for keeping order on the I/O queue: + +  - Submit blocks as before with incremental processing sequence number. +  - Dequeue completed blocks in order by processing sequence number. +     - For regular blocks, add them to the I/O queue with incremental I/O +       sequence number. +     - For fragments, consolidate them into fragment blocks. On overflow, +       dispatch a fragment block with incremental processing sequence number, +       BUT give it an I/O queue sequence number NOW. +     - For fragment blocks, add them to the I/O queue without allocating an +       I/O sequence number, it already has one. +  - Dequeue ordered by I/O sequence number from the I/O queue and send the +    completed blocks to the block writer. + + + If you have a more insights or a better idea, please let me know. + + + + 2) Benchmarks + ************* + + TODO: benchmarks with the following images: +        - Debian live iso (2G) +	- Arch Linux live iso (550M) +	- Raspberry Pi 3 QT demo image (~300M) + +       sqfs2tar $IMAGE | tar2sqfs -j $NUM_CPU -f out.sqfs + +       Values to measure: +        - Total wall clock time of tar2sqfs. +	- Througput (bytes read / time, bytes written / time). + +       Try the above for different compressors and stuff everything into +       a huge spread sheet. Then, determine the following and plot some +       nice graphs: + +        - Absolute speedup (normalized to serial implementation). +	- Absolute efficiency (= speedup / $NUM_CPU) +        - Relative speedup (normalized to thread pool with -j 1). +	- Relative efficiency + + + Available test hardware: +  - 8(16) core AMD Ryzen 7 3700X, 32GiB DDR4 RAM. +  - Various 4 core Intel Xeon servers. Precise Specs not known yet. +  - TODO: Check if my credentials on LCC2 still work. The cluster nodes AFAIK +    have dual socket Xeons. Not sure if 8 cores per CPU or 8 in total? + + For some compressors and work load, tar2sqfs may be I/O bound rather than CPU + bound. The different machines have different storage which may impact the + result. Should this be taken into account for comparison or eliminated by + using a ramdisk or fiddling with the queue backlog? | 
