diff options
Diffstat (limited to 'doc')
| -rw-r--r-- | doc/parallelism.txt | 96 | 
1 files changed, 31 insertions, 65 deletions
diff --git a/doc/parallelism.txt b/doc/parallelism.txt index e509d86..142444f 100644 --- a/doc/parallelism.txt +++ b/doc/parallelism.txt @@ -60,80 +60,46 @@    - Completed blocks are inserted into a "done" queue, sorted by their      sequence number.    - The main thread that submits blocks also dequeues the completed ones, -    keeping track of the sequence numbers, and calls the respective common -    functions for processing completed blocks and fragments. -  - If a fragment block is created, it is submitted with *the same* sequence -    number as the fragment that caused it to overflow and the next expected -    sequence number is reset to that. - - To make sure the queue doesn't fill all RAM, submitted blocks are counted. - The counter is decremented when dequeueing completed blocks. If it reaches - a maximum, signal/await is used to wait for the worker threads to complete - some blocks to process. Similarly, the worker threads use signal/await to - wait on the queue if it is empty. - - - 1.1) Problems - - The outlined approach performs sub-optimal, with an efficiency somewhere - between 50% to 75% on the bench mark data used. - - Profiling using perf shows that almost a third of the time, only one - worker thread is actually active, while the others are waiting. - - The current hypothesis is that this is caused by many small input files - being processed, causing a work load consisting primarily of fragments. - -  - Fragments are only hashed, not compressed, so the work is -    primarily I/O bound. -  - After a number of fragments are consumed, a fragment block is created. -  - The fragment block is submitted to the almost empty queue and the -    I/O thread waits for it to be completed before doing anything else. -  - One thread gets to handle the fragment block, which involves a lot more -    work. Meanwhile the other threads starve on the empty queue. -  - After that has finally been handed of to the I/O thread, another burst of -    fragments comes in. -  - Rinse and repeat. - - - 1.2) Proposed Solution - - It makes no sense for the main thread to block until the fragment block is - done. It can process further fragments (just not write blocks), creating - more fragment blocks on the way. - - A possible implementation might be to maintain 3 queues instead of 2: - -  - A queue with submitted blocks. -  - A queue with completed blocks. -  - A queue for blocks ready to be written to disk ("I/O queue"). - - A second sequence number is needed for keeping order on the I/O queue: - -  - Submit blocks as before with incremental processing sequence number. -  - Dequeue completed blocks in order by processing sequence number. -     - For regular blocks, add them to the I/O queue with incremental I/O -       sequence number. -     - For fragments, consolidate them into fragment blocks. On overflow, -       dispatch a fragment block with incremental processing sequence number, -       BUT give it an I/O queue sequence number NOW. -     - For fragment blocks, add them to the I/O queue without allocating an -       I/O sequence number, it already has one. -  - Dequeue ordered by I/O sequence number from the I/O queue and send the -    completed blocks to the block writer. - +    in order of their sequence number. +      - Data blocks are given an "i/o sequence number" and inserted into +        an I/O queue in order by this number. +      - For fragments, it and calls the respective function from common.c +        to deduplicate and append to the current fragment block. +      - If the current fragment block overflows, it is given the next free +        **I/O sequence number** and inserted into the **processing queue**. +      - Completed fragment blocks are inserted into the I/O queue with +        regard to the I/O sequence number they already have +  - The main thread also dequeues I/O blocks ordered by their I/O sequence +    number, and passes them to the respective common.c function that eventually +    forwards it to the block writer. + + To make sure the queue doesn't fill all RAM, submitted blocks are counted. The + counter is also incremented when submitting to the I/O queue. The counter is + decremented when dequeueing completed blocks or blocks from the I/O queue. + + The block submission function tries to add blocks only if the counter is below + a pre-set threshold. If it is above, it first tries to dequeue I/O block, and + if there are no I/O blocks, dequeue and handle "done" blocks. If the "done" + queue is also empty, it uses signal/await to wait for the worker threads to + complete some blocks to process. The main thread in turn uses signal/await to + tell the worker threads it has submitted block (they wait if the work queue is + empty). + + Profiling on small filesystems using perf shows that the outlined approach + seems to perform quite well for CPU bound compressors like XZ, but doesn't + add a lot for I/O bound compressors like zstd. Actual measurements still + need to be done.   If you have a more insights or a better idea, please let me know. -   2) Benchmarks   *************   TODO: benchmarks with the following images:          - Debian live iso (2G) -	- Arch Linux live iso (550M) -	- Raspberry Pi 3 QT demo image (~300M) +	- Arch Linux live iso (~550M) +	- Raspberry Pi 3 QT demo image (~390M)         sqfs2tar $IMAGE | tar2sqfs -j $NUM_CPU -f out.sqfs  | 
