aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--CHANGELOG.md1
-rw-r--r--doc/parallelism.txt96
2 files changed, 32 insertions, 65 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 7cf9b4d..81080e6 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -20,6 +20,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Split data writer up into a block writer and a block processor.
- All abstract data types inhert from an sqfs_object_t with common
functionality.
+- Lots of performance improvements.
### Fixed
- Include sys/sysmacros.h on any GNU libc platform.
diff --git a/doc/parallelism.txt b/doc/parallelism.txt
index e509d86..142444f 100644
--- a/doc/parallelism.txt
+++ b/doc/parallelism.txt
@@ -60,80 +60,46 @@
- Completed blocks are inserted into a "done" queue, sorted by their
sequence number.
- The main thread that submits blocks also dequeues the completed ones,
- keeping track of the sequence numbers, and calls the respective common
- functions for processing completed blocks and fragments.
- - If a fragment block is created, it is submitted with *the same* sequence
- number as the fragment that caused it to overflow and the next expected
- sequence number is reset to that.
-
- To make sure the queue doesn't fill all RAM, submitted blocks are counted.
- The counter is decremented when dequeueing completed blocks. If it reaches
- a maximum, signal/await is used to wait for the worker threads to complete
- some blocks to process. Similarly, the worker threads use signal/await to
- wait on the queue if it is empty.
-
-
- 1.1) Problems
-
- The outlined approach performs sub-optimal, with an efficiency somewhere
- between 50% to 75% on the bench mark data used.
-
- Profiling using perf shows that almost a third of the time, only one
- worker thread is actually active, while the others are waiting.
-
- The current hypothesis is that this is caused by many small input files
- being processed, causing a work load consisting primarily of fragments.
-
- - Fragments are only hashed, not compressed, so the work is
- primarily I/O bound.
- - After a number of fragments are consumed, a fragment block is created.
- - The fragment block is submitted to the almost empty queue and the
- I/O thread waits for it to be completed before doing anything else.
- - One thread gets to handle the fragment block, which involves a lot more
- work. Meanwhile the other threads starve on the empty queue.
- - After that has finally been handed of to the I/O thread, another burst of
- fragments comes in.
- - Rinse and repeat.
-
-
- 1.2) Proposed Solution
-
- It makes no sense for the main thread to block until the fragment block is
- done. It can process further fragments (just not write blocks), creating
- more fragment blocks on the way.
-
- A possible implementation might be to maintain 3 queues instead of 2:
-
- - A queue with submitted blocks.
- - A queue with completed blocks.
- - A queue for blocks ready to be written to disk ("I/O queue").
-
- A second sequence number is needed for keeping order on the I/O queue:
-
- - Submit blocks as before with incremental processing sequence number.
- - Dequeue completed blocks in order by processing sequence number.
- - For regular blocks, add them to the I/O queue with incremental I/O
- sequence number.
- - For fragments, consolidate them into fragment blocks. On overflow,
- dispatch a fragment block with incremental processing sequence number,
- BUT give it an I/O queue sequence number NOW.
- - For fragment blocks, add them to the I/O queue without allocating an
- I/O sequence number, it already has one.
- - Dequeue ordered by I/O sequence number from the I/O queue and send the
- completed blocks to the block writer.
-
+ in order of their sequence number.
+ - Data blocks are given an "i/o sequence number" and inserted into
+ an I/O queue in order by this number.
+ - For fragments, it and calls the respective function from common.c
+ to deduplicate and append to the current fragment block.
+ - If the current fragment block overflows, it is given the next free
+ **I/O sequence number** and inserted into the **processing queue**.
+ - Completed fragment blocks are inserted into the I/O queue with
+ regard to the I/O sequence number they already have
+ - The main thread also dequeues I/O blocks ordered by their I/O sequence
+ number, and passes them to the respective common.c function that eventually
+ forwards it to the block writer.
+
+ To make sure the queue doesn't fill all RAM, submitted blocks are counted. The
+ counter is also incremented when submitting to the I/O queue. The counter is
+ decremented when dequeueing completed blocks or blocks from the I/O queue.
+
+ The block submission function tries to add blocks only if the counter is below
+ a pre-set threshold. If it is above, it first tries to dequeue I/O block, and
+ if there are no I/O blocks, dequeue and handle "done" blocks. If the "done"
+ queue is also empty, it uses signal/await to wait for the worker threads to
+ complete some blocks to process. The main thread in turn uses signal/await to
+ tell the worker threads it has submitted block (they wait if the work queue is
+ empty).
+
+ Profiling on small filesystems using perf shows that the outlined approach
+ seems to perform quite well for CPU bound compressors like XZ, but doesn't
+ add a lot for I/O bound compressors like zstd. Actual measurements still
+ need to be done.
If you have a more insights or a better idea, please let me know.
-
2) Benchmarks
*************
TODO: benchmarks with the following images:
- Debian live iso (2G)
- - Arch Linux live iso (550M)
- - Raspberry Pi 3 QT demo image (~300M)
+ - Arch Linux live iso (~550M)
+ - Raspberry Pi 3 QT demo image (~390M)
sqfs2tar $IMAGE | tar2sqfs -j $NUM_CPU -f out.sqfs