From 61bceccd5a9506d670bbe2ec225c0bb37bec1e9b Mon Sep 17 00:00:00 2001
From: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Date: Sat, 22 Feb 2020 17:28:52 +0100
Subject: Try to clarify the parallel block processor documentation

Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
---
 doc/parallelism.txt | 108 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 62 insertions(+), 46 deletions(-)

diff --git a/doc/parallelism.txt b/doc/parallelism.txt
index 142444f..046c559 100644
--- a/doc/parallelism.txt
+++ b/doc/parallelism.txt
@@ -39,58 +39,74 @@
  1) Thread Pool Based Block Processor
  ************************************
 
- The main challenge of parallelizing the block processor lies in the fact the
- output HAS TO BE byte-for-byte equivalent to the serial reference
- implementation.
-
- This means:
-  - File blocks have to be written in the exact same order as they
-    are submitted.
-  - If storing a fragment overflows the fragment block, the resulting
-    fragment block has to be written next, no file data.
-
-
- The current implementation in winpthread.c (based on pthread or Windows
- native threads, depending on whats available) uses the following approach:
-
-  - Each submitted data block or fragment gets an incremental sequence number
-    and is appended to a FIFO queue.
-  - Multiple threads consume blocks from the queue and use the function
-    from common.c to process the dequeued blocks.
-  - Completed blocks are inserted into a "done" queue, sorted by their
-    sequence number.
-  - The main thread that submits blocks also dequeues the completed ones,
-    in order of their sequence number.
-      - Data blocks are given an "i/o sequence number" and inserted into
-        an I/O queue in order by this number.
-      - For fragments, it and calls the respective function from common.c
-        to deduplicate and append to the current fragment block.
-      - If the current fragment block overflows, it is given the next free
-        **I/O sequence number** and inserted into the **processing queue**.
-      - Completed fragment blocks are inserted into the I/O queue with
-        regard to the I/O sequence number they already have
-  - The main thread also dequeues I/O blocks ordered by their I/O sequence
-    number, and passes them to the respective common.c function that eventually
-    forwards it to the block writer.
-
- To make sure the queue doesn't fill all RAM, submitted blocks are counted. The
- counter is also incremented when submitting to the I/O queue. The counter is
- decremented when dequeueing completed blocks or blocks from the I/O queue.
-
- The block submission function tries to add blocks only if the counter is below
- a pre-set threshold. If it is above, it first tries to dequeue I/O block, and
- if there are no I/O blocks, dequeue and handle "done" blocks. If the "done"
- queue is also empty, it uses signal/await to wait for the worker threads to
- complete some blocks to process. The main thread in turn uses signal/await to
- tell the worker threads it has submitted block (they wait if the work queue is
- empty).
+ 1.1) Goals and Challanges
+
+ In addition to the goal of boosting performance, the thread pool based block
+ processor must meet the following requirements:
+
+  - Output MUST be deterministic and reproducible. I.e. feeding byte-for-byte
+    the same input MUST ALWAYS produce byte-for-byte the same output.
+  - Blocks the belong to a single file MUST be written in the order that
+    they were submitted.
+  - Output MUST be byte-for-byte equivalent to the serial reference
+    implementation. Changing the serial reference implementation to
+    achieve this is OK.
+  - I/O cannot be done in worker threads. The underlying back-end must be
+    assumed to not be thread safe and may get very confused by suddenly running
+    in a different thread, even if only one thread at a time uses it.
+
+
+ 1.2) The Current Approach
+
+ The current implementation is in winpthread.c (based on pthread or Windows
+ native threads, depending on whats available).
+
+ It keeps track of blocks in 3 different FIFO queues:
+  - A "work queue" that freshly submitted blocks go to. Worker threads take
+    blocks from this queue for processing.
+  - A "done queue" that worker threads submit blocks to, once completed.
+  - An "I/O queue" that contains blocks ready to be written to disk.
+
+
+ When the main thread submits a block, it gives it an incremental "processing"
+ sequence number and appends it to the "work queue". Thread pool workers take
+ the first best block of the queue, process it and added it to the "done"
+ queue, sorted by its processing sequence number.
+
+ The main thread dequeues blocks from the done queue sorted by their processing
+ sequence number, using a second counter to make sure blocks are dequeued in
+ the exact same order as they were added to the work queue.
+
+ Regular data blocks from the "done queue" are given an incremental I/O
+ sequence number and added to the "I/O queue", sorted by this number.
+
+ Fragments are deduplicated and consolidated into a fragment block. If this
+ block overflows, it is appended to the "work queue" the exact same way as
+ regular blocks, but it is **already given an I/O sequence number**.
+
+ If a block dequeued from the "done queue" turns out to be a fragment block, it
+ is added to the "I/O queue", sorted by its I/O sequence number **that it
+ already has**, i.e. no new sequence number is allocated.
+
+ The I/O queue is dequeued in the same fashion as the "done queue", using a
+ second counter to enforce ordering.
+
+
+ The actual implementation interleaves enqueueing and dequeueing in the block
+ submission function. It dequeues blocks if the queues reach a pre-set maximum
+ backlog. In that case, it tries to dequeue from the I/O queue first and if
+ that fails, tries to dequeue from the "done queue". If that also fails, it
+ uses signal/await to be woken up by a worker thread once it adds a block to
+ the "done queue". Fragment post-processing and re-queueing of blocks is done
+ inside the critical region, but the actual I/O is obviously done outside.
+
 
  Profiling on small filesystems using perf shows that the outlined approach
  seems to perform quite well for CPU bound compressors like XZ, but doesn't
  add a lot for I/O bound compressors like zstd. Actual measurements still
  need to be done.
 
- If you have a more insights or a better idea, please let me know.
+ If you have a better idea how to do this, please let me know.
 
 
  2) Benchmarks
-- 
cgit v1.2.3