aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorDavid Oberhollenzer <david.oberhollenzer@sigma-star.at>2020-02-22 17:28:52 +0100
committerDavid Oberhollenzer <david.oberhollenzer@sigma-star.at>2020-02-22 17:37:35 +0100
commit61bceccd5a9506d670bbe2ec225c0bb37bec1e9b (patch)
tree17ae5bad026fea691ac1d116f2699bae0d9ec52b
parent3d444455ee62eedc91b2eeeddb01d39ec0bcfecb (diff)
Try to clarify the parallel block processor documentation
Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
-rw-r--r--doc/parallelism.txt108
1 files changed, 62 insertions, 46 deletions
diff --git a/doc/parallelism.txt b/doc/parallelism.txt
index 142444f..046c559 100644
--- a/doc/parallelism.txt
+++ b/doc/parallelism.txt
@@ -39,58 +39,74 @@
1) Thread Pool Based Block Processor
************************************
- The main challenge of parallelizing the block processor lies in the fact the
- output HAS TO BE byte-for-byte equivalent to the serial reference
- implementation.
-
- This means:
- - File blocks have to be written in the exact same order as they
- are submitted.
- - If storing a fragment overflows the fragment block, the resulting
- fragment block has to be written next, no file data.
-
-
- The current implementation in winpthread.c (based on pthread or Windows
- native threads, depending on whats available) uses the following approach:
-
- - Each submitted data block or fragment gets an incremental sequence number
- and is appended to a FIFO queue.
- - Multiple threads consume blocks from the queue and use the function
- from common.c to process the dequeued blocks.
- - Completed blocks are inserted into a "done" queue, sorted by their
- sequence number.
- - The main thread that submits blocks also dequeues the completed ones,
- in order of their sequence number.
- - Data blocks are given an "i/o sequence number" and inserted into
- an I/O queue in order by this number.
- - For fragments, it and calls the respective function from common.c
- to deduplicate and append to the current fragment block.
- - If the current fragment block overflows, it is given the next free
- **I/O sequence number** and inserted into the **processing queue**.
- - Completed fragment blocks are inserted into the I/O queue with
- regard to the I/O sequence number they already have
- - The main thread also dequeues I/O blocks ordered by their I/O sequence
- number, and passes them to the respective common.c function that eventually
- forwards it to the block writer.
-
- To make sure the queue doesn't fill all RAM, submitted blocks are counted. The
- counter is also incremented when submitting to the I/O queue. The counter is
- decremented when dequeueing completed blocks or blocks from the I/O queue.
-
- The block submission function tries to add blocks only if the counter is below
- a pre-set threshold. If it is above, it first tries to dequeue I/O block, and
- if there are no I/O blocks, dequeue and handle "done" blocks. If the "done"
- queue is also empty, it uses signal/await to wait for the worker threads to
- complete some blocks to process. The main thread in turn uses signal/await to
- tell the worker threads it has submitted block (they wait if the work queue is
- empty).
+ 1.1) Goals and Challanges
+
+ In addition to the goal of boosting performance, the thread pool based block
+ processor must meet the following requirements:
+
+ - Output MUST be deterministic and reproducible. I.e. feeding byte-for-byte
+ the same input MUST ALWAYS produce byte-for-byte the same output.
+ - Blocks the belong to a single file MUST be written in the order that
+ they were submitted.
+ - Output MUST be byte-for-byte equivalent to the serial reference
+ implementation. Changing the serial reference implementation to
+ achieve this is OK.
+ - I/O cannot be done in worker threads. The underlying back-end must be
+ assumed to not be thread safe and may get very confused by suddenly running
+ in a different thread, even if only one thread at a time uses it.
+
+
+ 1.2) The Current Approach
+
+ The current implementation is in winpthread.c (based on pthread or Windows
+ native threads, depending on whats available).
+
+ It keeps track of blocks in 3 different FIFO queues:
+ - A "work queue" that freshly submitted blocks go to. Worker threads take
+ blocks from this queue for processing.
+ - A "done queue" that worker threads submit blocks to, once completed.
+ - An "I/O queue" that contains blocks ready to be written to disk.
+
+
+ When the main thread submits a block, it gives it an incremental "processing"
+ sequence number and appends it to the "work queue". Thread pool workers take
+ the first best block of the queue, process it and added it to the "done"
+ queue, sorted by its processing sequence number.
+
+ The main thread dequeues blocks from the done queue sorted by their processing
+ sequence number, using a second counter to make sure blocks are dequeued in
+ the exact same order as they were added to the work queue.
+
+ Regular data blocks from the "done queue" are given an incremental I/O
+ sequence number and added to the "I/O queue", sorted by this number.
+
+ Fragments are deduplicated and consolidated into a fragment block. If this
+ block overflows, it is appended to the "work queue" the exact same way as
+ regular blocks, but it is **already given an I/O sequence number**.
+
+ If a block dequeued from the "done queue" turns out to be a fragment block, it
+ is added to the "I/O queue", sorted by its I/O sequence number **that it
+ already has**, i.e. no new sequence number is allocated.
+
+ The I/O queue is dequeued in the same fashion as the "done queue", using a
+ second counter to enforce ordering.
+
+
+ The actual implementation interleaves enqueueing and dequeueing in the block
+ submission function. It dequeues blocks if the queues reach a pre-set maximum
+ backlog. In that case, it tries to dequeue from the I/O queue first and if
+ that fails, tries to dequeue from the "done queue". If that also fails, it
+ uses signal/await to be woken up by a worker thread once it adds a block to
+ the "done queue". Fragment post-processing and re-queueing of blocks is done
+ inside the critical region, but the actual I/O is obviously done outside.
+
Profiling on small filesystems using perf shows that the outlined approach
seems to perform quite well for CPU bound compressors like XZ, but doesn't
add a lot for I/O bound compressors like zstd. Actual measurements still
need to be done.
- If you have a more insights or a better idea, please let me know.
+ If you have a better idea how to do this, please let me know.
2) Benchmarks