1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
|
Parallelizing SquashFS Data Packing
***********************************
0) Overview
***********
On a high level, data blocks are processed as follows:
The "block processor" has a simple begin/append/end interface for submitting
file data. Internally it chops the file data up into fixed size blocks that
are each [optionally] compressed and hashed. If the "end" function is called
and there is still left over data, a fragment is created.
Fragments are only hashed. If another fragment exists with the same size and
hash, it is discarded and the existing fragment is referenced. Fragments are
collected in a fragment block that, once it overflows, is processed like a
normal block.
The final compressed & hashed data blocks & fragment blocks are passed on to
the "block writer".
The block writer simply writes blocks to the output file. Flags are used to
communicate what the first and last block of a file are. Entire files are
deduplicated by trying to find a sequence of identical size/hash pairs in
the already written blocks.
0.1) Implementation
The implementation of the block processor is in lib/sqfs/block_processor. The
file common.c contains the frontend for file data submission and common
functions for processing a single block, handling a completed block and
handling a completed fragment.
A reference serial implementation is provided in the file serial.c
1) Thread Pool Based Block Processor
************************************
1.1) Goals and Challanges
In addition to the goal of boosting performance, the thread pool based block
processor must meet the following requirements:
- Output MUST be deterministic and reproducible. I.e. feeding byte-for-byte
the same input MUST ALWAYS produce byte-for-byte the same output.
- Blocks the belong to a single file MUST be written in the order that
they were submitted.
- Output MUST be byte-for-byte equivalent to the serial reference
implementation. Changing the serial reference implementation to
achieve this is OK.
- I/O cannot be done in worker threads. The underlying back-end must be
assumed to not be thread safe and may get very confused by suddenly running
in a different thread, even if only one thread at a time uses it.
1.2) The Current Approach
The current implementation is in winpthread.c (based on pthread or Windows
native threads, depending on whats available).
It keeps track of blocks in 3 different FIFO queues:
- A "work queue" that freshly submitted blocks go to. Worker threads take
blocks from this queue for processing.
- A "done queue" that worker threads submit blocks to, once completed.
- An "I/O queue" that contains blocks ready to be written to disk.
When the main thread submits a block, it gives it an incremental "processing"
sequence number and appends it to the "work queue". Thread pool workers take
the first best block of the queue, process it and added it to the "done"
queue, sorted by its processing sequence number.
The main thread dequeues blocks from the done queue sorted by their processing
sequence number, using a second counter to make sure blocks are dequeued in
the exact same order as they were added to the work queue.
Regular data blocks from the "done queue" are given an incremental I/O
sequence number and added to the "I/O queue", sorted by this number.
Fragments are deduplicated and consolidated into a fragment block. If this
block overflows, it is appended to the "work queue" the exact same way as
regular blocks, but it is **already given an I/O sequence number**.
If a block dequeued from the "done queue" turns out to be a fragment block, it
is added to the "I/O queue", sorted by its I/O sequence number **that it
already has**, i.e. no new sequence number is allocated.
The I/O queue is dequeued in the same fashion as the "done queue", using a
second counter to enforce ordering.
The actual implementation interleaves enqueueing and dequeueing in the block
submission function. It dequeues blocks if the queues reach a pre-set maximum
backlog. In that case, it tries to dequeue from the I/O queue first and if
that fails, tries to dequeue from the "done queue". If that also fails, it
uses signal/await to be woken up by a worker thread once it adds a block to
the "done queue". Fragment post-processing and re-queueing of blocks is done
inside the critical region, but the actual I/O is obviously done outside.
Profiling on small filesystems using perf shows that the outlined approach
seems to perform quite well for CPU bound compressors like XZ, but doesn't
add a lot for I/O bound compressors like zstd. Actual measurements still
need to be done.
If you have a better idea how to do this, please let me know.
2) Benchmarks
*************
2.1) How was the Benchmark Performed?
An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs:
$ mkdir /dev/shm/temp
$ ln -s /dev/shm/temp out
$ ./autogen.sh
$ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \
LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out
$ make -j install
$ cd out
A SquashFS image to be tested was unpacked in this directory:
$ ./bin/sqfs2tar <IMAGE> > test.tar
And then repacked as follows:
$ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar
Out of 4 runs, the worst wall-clock time ("real") was used for comparison.
For the serial reference version, configure was re-run with the option
--without-pthread, the tools re-compiled and re-installed.
2.2) What Image was Tested?
A Debian image extracted from the Debian 10.2 LiveDVD for AMD64 with XFCE
was used.
The input size and resulting output sizes turned out to be as follows:
- As uncompressed tarball: ~6.5GiB (7,008,118,272)
- As LZ4 compressed SquashFS image: ~3.1GiB (3,381,751,808)
- As LZO compressed SquashFS image: ~2.5GiB (2,732,015,616)
- As zstd compressed SquashFS image: ~2.4GiB (2,536,910,848)
- As gzip compressed SquashFS image: ~2.3GiB (2,471,276,544)
- As lzma compressed SquashFS image: ~2.0GiB (2,102,169,600)
- As XZ compressed SquashFS image: ~2.0GiB (2,098,466,816)
The Debian image is expected to contain realistic input data for a Linux
file system and also provide enough data for an interesting benchmark.
2.3) What Test System was used?
AMD Ryzen 7 3700X
32GiB DDR4 RAM
Fedora 31 with Linux 5.4.17
2.4) Results
The raw timing results are as follows:
Jobs XZ lzma gzip LZO LZ4 zstd
serial 17m59.413s 16m08.868s 10m02.632s 13m17.956s 18.218s 35.280s
1 18m01.695s 16m02.329s 9m57.334s 13m14.374s 16.727s 34.108s
2 9m34.939s 8m32.806s 5m12.791s 6m56.017s 13.161s 21.696s
3 6m37.701s 5m55.246s 3m35.409s 4m50.138s 12.798s 18.265s
4 5m07.896s 4m34.419s 2m47.108s 3m43.153s 13.191s 16.885s
5 4m11.593s 3m44.764s 2m17.371s 3m02.429s 14.251s 17.389s
6 3m34.115s 3m12.032s 1m57.972s 2m35.601s 14.824s 17.023s
7 3m07.806s 2m47.815s 1m44.661s 2m16.289s 15.643s 17.676s
8 2m47.589s 2m30.433s 1m33.865s 2m01.389s 16.262s 17.524s
9 2m38.737s 2m22.159s 1m27.477s 1m53.976s 16.887s 18.110s
10 2m30.942s 2m14.427s 1m22.424s 1m47.411s 17.316s 18.497s
11 2m23.512s 2m08.470s 1m17.419s 1m41.965s 17.759s 18.831s
12 2m17.083s 2m02.814s 1m13.644s 1m36.742s 18.335s 19.082s
13 2m11.450s 1m57.820s 1m10.310s 1m32.492s 18.827s 19.232s
14 2m06.525s 1m53.951s 1m07.483s 1m28.779s 19.471s 20.070s
15 2m02.338s 1m50.358s 1m04.954s 1m25.993s 19.772s 20.608s
16 1m58.566s 1m47.371s 1m03.616s 1m23.241s 20.188s 21.779s
The file "benchmark.ods" contains those values, values derived from this and
charts depicting the results.
2.5) Discussion
Most obviously, the results indicate that LZ4 and zstd compression are clearly
I/O bound and not CPU bound. They don't benefit from parallelization beyond
2-4 worker threads and even that benefit is marginal with efficiency
plummetting immediately.
The other compressors (XZ, lzma, gzip, lzo) are clearly CPU bound. Speedup
increases linearly until about 8 cores, but with a factor k < 1, paralleled by
efficiency decreasing down to 80% for 8 cores.
A reason for this sub-linear scaling may be the choke point introduced by the
creation of fragment blocks, that *requires* a synchronization. To test this
theory, a second benchmark should be performed with fragment block generation
completely disabled. This requires a new flag to be added to tar2sqfs (and
also gensquashfs).
Using more than 8 jobs causes a much slower increase in speedup and efficency
declines even faster. This is probably due to the fact that the test system
only has 8 physical cores and beyond that, SMT has to be used.
It should also be noted that the thread pool compressor with only a single
thread turns out to be *slightly* faster than the serial reference
implementation. A possible explanation for this might be that the fragment
blocks are actually assembled in the main thread, in parallel to the worker
that can still continue with other data blocks. Because of this decoupling
there is in fact some degree of parallelism, even if only one worker thread
is used.
As a side effect, this benchmark also produces some insights into the
compression ratio and throughput of the supported compressors. Indicating that
for the Debian live image, XZ clearly provides the highest data density, while
LZ4 is clearly the fastest compressor available, directly followed by zstd
which has a much better compression ratio than LZ4, comparable to the gzip
compressor, while being almost 50 times faster. The throughput of the zstd
compressor is truly impressive, considering the compression ratio it achieves.
Repeating the benchmark without tail-end-packing and wit fragments completely
disabled would also show the effectiveness of tail-end-packing and fragment
packing as a side effect.
|