1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
|
1) Test Setup
*************
The tests were performed an a system with the following specifications:
AMD Ryzen 7 3700X
32GiB DDR4 RAM
Fedora 33
The following gcc versions of GCC and Linux were used:
gcc (GCC) 10.2.1 20201125 (Red Hat 10.2.1-9)
Linux 5.11.9-200.fc33.x86_64
The following squashfs-tools-ng commit was tested:
7d2b3b077d7e204e64a1c57845524250c5b4a142
An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs:
$ mkdir /dev/shm/temp
$ ln -s /dev/shm/temp out
$ ./autogen.sh
$ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \
LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out
$ make -j install-strip
$ cd out
Working in a tmpfs was done to eliminate any influence of I/O performance and
I/O caching side effects to the extend possible and only measure the actual
processing time.
For all benchmark tests, a Debian image extracted from the Debian 10.2 LiveDVD
for AMD64 with XFCE was used.
The Debian image is expected to contain realistic input data for a Linux
file system and also provide enough data for an interesting benchmark.
For all performed benchmarks, graphical representations of the results and
derived values can be seen in "benchmark.ods".
1) Parallel Compression Benchmark
*********************************
1.1) What was measured?
The Debian image was first converted to a tarball:
$ ./bin/sqfs2tar debian.sqfs > test.tar
The tarball was then repacked and time was measured as follows:
$ time -p ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar
The repacking was repeated 4 times and the worst wall-clock time ("real") was
used for comparison.
The <NUM_CPU> was varied from 1 to 16 and for <COMPRESSOR>, all available
compressors were used. All possible combinations <NUM_CPU> and <COMPRESSOR>
were measured.
In addition, a serial reference version was compiled by running configure
with the additional option --without-pthread and re-running the tests for
all compressors without the <NUM_CPU> option.
In addition to the existing compressors, the LZO compressor in libcommon.a was
briefly patched to not perform any compression at all. This way, a baseline
comparison was established for a completely uncompressed SquashFS image.
1.2) What was computed from the results?
The relative and absolute speedup were determined as follows:
runtime_parallel(compressor, num_cpu)
spedup_rel(compressor, num_cpu) = -------------------------------------
runtime_parallel(compressor, 1)
runtime_parallel(compressor, num_cpu)
spedup_abs(compressor, num_cpu) = -------------------------------------
runtime_serial(compressor)
In addition, relative and absolute efficiency of the parallel implementation
were determined:
speedup_rel(compressor, num_cpu)
efficiency_rel(compressor, num_cpu) = --------------------------------
num_cpu
speedup_abs(compressor, num_cpu)
efficiency_abs(compressor, num_cpu) = --------------------------------
num_cpu
Furthermore, although not relevant for this specific benchmark, having the
converted tarballs available, the compression ratio was computed as follows:
size(tarball)
max_throughput(compressor) = --------------------------
min(runtime(compressor))
1.4) Results
The raw timing results are as follows:
Jobs XZ lzma gzip LZO LZ4 zstd none
serial 1108.39s 995.43s 609.79s 753.14s 13.58s 550.59s 5.86s
1 1116.06s 990.33s 598.85s 753.53s 11.25s 550.37s 4.23s
2 591.21s 536.61s 312.14s 394.21s 6.41s 294.12s 4.13s
3 415.90s 370.48s 215.92s 273.14s 4.84s 205.14s 4.58s
4 320.02s 288.35s 165.50s 210.32s 4.29s 159.71s 4.62s
5 263.94s 235.69s 136.28s 172.33s 4.19s 132.27s 4.94s
6 224.23s 200.63s 116.44s 146.80s 4.28s 112.79s 5.08s
7 196.78s 176.35s 100.66s 128.61s 4.24s 99.26s 5.43s
8 175.04s 157.82s 89.79s 113.47s 4.46s 88.22s 5.68s
9 166.52s 148.88s 83.01s 106.14s 4.64s 84.97s 5.76s
10 159.35s 141.08s 77.04s 99.92s 4.84s 81.61s 5.94s
11 151.08s 136.27s 71.52s 94.23s 5.00s 77.51s 6.14s
12 144.72s 128.91s 67.21s 89.33s 5.28s 74.10s 6.39s
13 137.91s 122.67s 63.43s 84.39s 5.41s 71.83s 6.51s
14 132.94s 117.79s 59.45s 80.87s 5.71s 68.86s 6.68s
15 126.76s 113.51s 56.37s 76.68s 5.74s 65.78s 6.91s
16 119.06s 107.15s 52.56s 71.49s 6.37s 62.52s 7.10s
1.5) Discussion
Most obviously, the results indicate that LZ4, unlike the other compressors,
is clearly I/O bound and not CPU bound and doesn't benefit from parallelization
beyond 2-4 worker threads and even that benefit is marginal with efficiency
plummeting immediately.
The other compressors are clearly CPU bound. Speedup increases linearly until
about 8 cores, but with a slope < 1, as evident by efficiency linearly
decreasing and reaching 80% for 8 cores.
A reason for this sub-linear scaling may be the choke point introduced by the
creation of fragment blocks, that *requires* a synchronization. To test this
theory, a second benchmark should be performed with fragment block generation
completely disabled. This requires a new flag to be added to tar2sqfs (and
also gensquashfs).
Using more than 8 jobs causes a much slower increase in speedup and efficiency
declines even faster. This is probably due to the fact that the test system
only has 8 physical cores and beyond that, SMT has to be used.
It should also be noted that for most of the compressors, as well as the
uncompressed version, the thread pool compressor with only a single thread
turns out to be *slightly* faster than the serial reference implementation.
A possible explanation for this might be that the fragment blocks are actually
assembled in the main thread, in parallel to the worker that can still
continue with other data blocks. Because of this decoupling there is in fact
some degree of parallelism, even if only one worker thread is used. For the
uncompressed version, the work still done in the thread pool is the hashing of
blocks and fragments for de-duplication.
Also of interest are the changes from the previous version of the benchmark,
performed on v0.9 of squashfs-tools-ng. Since then, the thread pool design has
been overhauled to spend a lot less time in the critical regions, but to also
perform byte-for-byte equivalence checks before considering blocks or fragments
to be identical. This may require a read-back and decompression step in the
main thread in order to access already written fragment blocks.
While the overall behavior has stayed the same, performance for XZ & LZMA has
decreased slightly, whereas performance for the gzip, LZ4 & ZSTD has improved
slightly. As the decompression benchmark shows, the first two are a lot slower
at decompression, which needs to be done when reading back a fragment block
from disk, and due to the higher data density also have a higher chance of
actually having to decompress a block, so as a net result, the performance
penalty from exact fragment matching eats all gains from the new thread pool
design. For the more I/O bound compressors like LZ4 & ZSTD, decompressing a
block is done much faster and due to the low data density for LZ4, the chance
of actually having to decompress a block is lowered. As a result, the gains
from the new thread pool design apparently outweigh the read-back penalty.
Also noteworthy, due to the inclusion of an uncompressed reference, is that
the LZ4 compressor is actually very close in performance to the uncompressed
version, in some cases even outperforming it. This might be due to the fact
that LZ4 actually does compress blocks, so in many cases where the
uncompressed version needs to read back a full block during deduplication,
the LZ4 version only needs to read a considerably smaller amount of data,
reducing the penalty of having to read back blocks.
2) Reference Decompression Benchmark
************************************
2.1) What was measured?
A SquashFS image was generated for each supported compressor:
$ ./bin/sqfs2tar debian.sqfs | ./bin/tar2sqfs -c <COMPRESSOR> test.sqfs
And then, for each compressor, the unpacking time was measured:
$ time -p ./bin/sqfs2tar test.sqfs > /dev/null
The unpacking step was repeated 4 times and the worst wall-clock time ("real")
was used for comparison.
2.2) What was computed from the results?
The throughput was established by dividing the size of the resulting tarball by
the time taken to produce it from the image.
For better comparison, this was also normalized to the throughput of the
uncompressed SquashFS image.
2.3) Results
xz 120.53s
lzma 118.91s
gzip 20.57s
lzo 10.65s
zstd 7.74s
lz4 2.59s
uncompressed 1.42s
2.4) Discussion
From the measurement, it becomes obvious that LZ4 and zstd are the two fastest
decompressors, both being very close to the uncompressed version. Zstd is
particularly noteworthy here, because it is not far behind LZ4 in speed, but
also achieves a substantially better compression ratio that is
between gzip and lzma. LZ4, despite being the fastest in decompression and
beating the others in compression speed by orders of magnitudes, has by far
the worst compression ratio.
It should be noted that the number of actually compressed blocks has not been
determined. A worse compression ratio can lead to more blocks being stored
uncompressed, reducing the workload and thus affecting decompression time.
However, since zstd has a better compression ratio than gzip, takes only 30% of
the time to decompress, and in the serial compression benchmark only takes 2%
of the time to compress, we can safely say that in this benchmark, zstd beats
gzip by every metric.
Furthermore, while XZ stands out as the compressor with the best compression
ratio, zstd only takes ~6% of the time to decompress the entire image, while
being ~17% bigger than XZ. Shaving off 17% is definitely significant,
especially considering that in absolute numbers it is in the 100MB range, but
it clearly comes at a substantial performance cost.
Also interesting are the results for the LZO compressor. Its compression speed
is between gzip and LZMA, decompression speed is about 50% of gzip, and only a
little bit worse than zstd, but its compression ratio is the second worst only
after LZ4, which beats it by a factor of 5 in decompression speed and by ~60
in compression speed.
Concluding, for applications where a good compression ratio is most important,
XZ is obviously the best choice, but if speed is favored, zstd is probably a
very good option to go with. LZ4 is much faster, but has a lot worse
compression ratio. It is probably best suited as transparent compression for a
read/write file system or network protocols.
Finally, it should be noted, that this serial decompression benchmark is not
representative of a real-life workload where only a small set of files are
accessed in a random access fashion. In that case, a caching layer can largely
mitigate the decompression cost, translating it into an initial or only
occasionally occurring cache miss latency. But this benchmark should in theory
give an approximate idea how those cache miss latencies are expected to
compare between the different compressors.
3) Compression Size and Overhead Benchmark
******************************************
3.1) What was measured?
For each compressor, a SquashFS image was created in the way outlined in the
parallel compression benchmark and the resulting file size was recorded.
In addition, the raw tarball size was recorded for comparison.
3.2) What was computed from the results?
The compression ratio was established as follows:
size(compressor)
ratio(compressor) = --------------------
size(uncompressed)
3.3) Results
SquashFS tar
Uncompressed ~6.1GiB (6,542,389,248) ~6.5GiB (7,008,118,272)
LZ4 ~3.1GiB (3,381,751,808)
LZO ~2.5GiB (2,732,015,616)
gzip ~2.3GiB (2,471,276,544)
zstd ~2.1GiB (2,295,078,912)
lzma ~2.0GiB (2,102,169,600)
XZ ~2.0GiB (2,098,466,816)
3.4) Discussion
Obviously XZ and lzma achieve the highest data density, shrinking the SquashFS
image down to less than a third of the input size.
Noteworthy is also Zstd achieving higher data density than gzip while being
faster in compression as well as decompression.
Interestingly, even the uncompressed SquashFS image is still smaller than the
uncompressed tarball. Obviously SquashFS packs data and meta data more
efficiently than the tar format, shaving off ~7% in size.
|