1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
|
Parallelizing SquashFS Data Packing
***********************************
0) Overview
***********
On a high level, data blocks are processed as follows:
The "block processor" has a simple begin/append/end interface for submitting
file data. Internally it chops the file data up into fixed size blocks that
are each [optionally] compressed and hashed. If the "end" function is called
and there is still left over data, a fragment is created.
Fragments are only hashed. If another fragment exists with the same size and
hash, it is discarded and the existing fragment is referenced. Fragments are
collected in a fragment block that, once it overflows, is processed like a
normal block.
The final compressed & hashed data blocks & fragment blocks are passed on to
the "block writer".
The block writer simply writes blocks to the output file. Flags are used to
communicate what the first and last block of a file are. Entire files are
deduplicated by trying to find a sequence of identical size/hash pairs in
the already written blocks.
0.1) Implementation
The implementation of the block processor is in lib/sqfs/block_processor. The
file common.c contains the frontend for file data submission and common
functions for processing a single block, handling a completed block and
handling a completed fragment.
A reference serial implementation is provided in the file serial.c
1) Thread Pool Based Block Processor
************************************
1.1) Goals and Challanges
In addition to the goal of boosting performance, the thread pool based block
processor must meet the following requirements:
- Output MUST be deterministic and reproducible. I.e. feeding byte-for-byte
the same input MUST ALWAYS produce byte-for-byte the same output.
- Blocks the belong to a single file MUST be written in the order that
they were submitted.
- Output MUST be byte-for-byte equivalent to the serial reference
implementation. Changing the serial reference implementation to
achieve this is OK.
- I/O cannot be done in worker threads. The underlying back-end must be
assumed to not be thread safe and may get very confused by suddenly running
in a different thread, even if only one thread at a time uses it.
1.2) The Current Approach
The current implementation is in winpthread.c (based on pthread or Windows
native threads, depending on whats available).
It keeps track of blocks in 3 different FIFO queues:
- A "work queue" that freshly submitted blocks go to. Worker threads take
blocks from this queue for processing.
- A "done queue" that worker threads submit blocks to, once completed.
- An "I/O queue" that contains blocks ready to be written to disk.
When the main thread submits a block, it gives it an incremental "processing"
sequence number and appends it to the "work queue". Thread pool workers take
the first best block of the queue, process it and added it to the "done"
queue, sorted by its processing sequence number.
The main thread dequeues blocks from the done queue sorted by their processing
sequence number, using a second counter to make sure blocks are dequeued in
the exact same order as they were added to the work queue.
Regular data blocks from the "done queue" are given an incremental I/O
sequence number and added to the "I/O queue", sorted by this number.
Fragments are deduplicated and consolidated into a fragment block. If this
block overflows, it is appended to the "work queue" the exact same way as
regular blocks, but it is **already given an I/O sequence number**.
If a block dequeued from the "done queue" turns out to be a fragment block, it
is added to the "I/O queue", sorted by its I/O sequence number **that it
already has**, i.e. no new sequence number is allocated.
The I/O queue is dequeued in the same fashion as the "done queue", using a
second counter to enforce ordering.
The actual implementation interleaves enqueueing and dequeueing in the block
submission function. It dequeues blocks if the queues reach a pre-set maximum
backlog. In that case, it tries to dequeue from the I/O queue first and if
that fails, tries to dequeue from the "done queue". If that also fails, it
uses signal/await to be woken up by a worker thread once it adds a block to
the "done queue". Fragment post-processing and re-queueing of blocks is done
inside the critical region, but the actual I/O is obviously done outside.
Profiling on small filesystems using perf shows that the outlined approach
seems to perform quite well for CPU bound compressors like XZ, but doesn't
add a lot for I/O bound compressors like zstd. Actual measurements still
need to be done.
If you have a better idea how to do this, please let me know.
2) Benchmarks
*************
TODO: benchmarks with the following images:
- Debian live iso (2G)
- Arch Linux live iso (~550M)
- Raspberry Pi 3 QT demo image (~390M)
sqfs2tar $IMAGE | tar2sqfs -j $NUM_CPU -f out.sqfs
Values to measure:
- Total wall clock time of tar2sqfs.
- Througput (bytes read / time, bytes written / time).
Try the above for different compressors and stuff everything into
a huge spread sheet. Then, determine the following and plot some
nice graphs:
- Absolute speedup (normalized to serial implementation).
- Absolute efficiency (= speedup / $NUM_CPU)
- Relative speedup (normalized to thread pool with -j 1).
- Relative efficiency
Available test hardware:
- 8(16) core AMD Ryzen 7 3700X, 32GiB DDR4 RAM.
- Various 4 core Intel Xeon servers. Precise Specs not known yet.
- TODO: Check if my credentials on LCC2 still work. The cluster nodes AFAIK
have dual socket Xeons. Not sure if 8 cores per CPU or 8 in total?
For some compressors and work load, tar2sqfs may be I/O bound rather than CPU
bound. The different machines have different storage which may impact the
result. Should this be taken into account for comparison or eliminated by
using a ramdisk or fiddling with the queue backlog?
|