1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
|
Parallelizing SquashFS Data Packing
***********************************
0) Overview
***********
On a high level, data blocks are processed as follows:
The "block processor" has a simple begin/append/end interface for submitting
file data. Internally it chops the file data up into fixed size blocks that
are each [optionally] compressed and hashed. If the "end" function is called
and there is still left over data, a fragment is created.
Fragments are only hashed. If another fragment exists with the same size and
hash, it is discarded and the existing fragment is referenced. Fragments are
collected in a fragment block that, once it overflows, is processed like a
normal block.
The final compressed & hashed data blocks & fragment blocks are passed on to
the "block writer".
The block writer simply writes blocks to the output file. Flags are used to
communicate what the first and last block of a file are. Entire files are
deduplicated by trying to find a sequence of identical size/hash pairs in
the already written blocks.
0.1) Implementation
The implementation of the block processor is in lib/sqfs/block_processor. The
file common.c contains the frontend for file data submission and common
functions for processing a single block, handling a completed block and
handling a completed fragment.
A reference serial implementation is provided in the file serial.c
1) Thread Pool Based Block Processor
************************************
The main challenge of parallelizing the block processor lies in the fact the
output HAS TO BE byte-for-byte equivalent to the serial reference
implementation.
This means:
- File blocks have to be written in the exact same order as they
are submitted.
- If storing a fragment overflows the fragment block, the resulting
fragment block has to be written next, no file data.
The current implementation in winpthread.c (based on pthread or Windows
native threads, depending on whats available) uses the following approach:
- Each submitted data block or fragment gets an incremental sequence number
and is appended to a FIFO queue.
- Multiple threads consume blocks from the queue and use the function
from common.c to process the dequeued blocks.
- Completed blocks are inserted into a "done" queue, sorted by their
sequence number.
- The main thread that submits blocks also dequeues the completed ones,
in order of their sequence number.
- Data blocks are given an "i/o sequence number" and inserted into
an I/O queue in order by this number.
- For fragments, it and calls the respective function from common.c
to deduplicate and append to the current fragment block.
- If the current fragment block overflows, it is given the next free
**I/O sequence number** and inserted into the **processing queue**.
- Completed fragment blocks are inserted into the I/O queue with
regard to the I/O sequence number they already have
- The main thread also dequeues I/O blocks ordered by their I/O sequence
number, and passes them to the respective common.c function that eventually
forwards it to the block writer.
To make sure the queue doesn't fill all RAM, submitted blocks are counted. The
counter is also incremented when submitting to the I/O queue. The counter is
decremented when dequeueing completed blocks or blocks from the I/O queue.
The block submission function tries to add blocks only if the counter is below
a pre-set threshold. If it is above, it first tries to dequeue I/O block, and
if there are no I/O blocks, dequeue and handle "done" blocks. If the "done"
queue is also empty, it uses signal/await to wait for the worker threads to
complete some blocks to process. The main thread in turn uses signal/await to
tell the worker threads it has submitted block (they wait if the work queue is
empty).
Profiling on small filesystems using perf shows that the outlined approach
seems to perform quite well for CPU bound compressors like XZ, but doesn't
add a lot for I/O bound compressors like zstd. Actual measurements still
need to be done.
If you have a more insights or a better idea, please let me know.
2) Benchmarks
*************
TODO: benchmarks with the following images:
- Debian live iso (2G)
- Arch Linux live iso (~550M)
- Raspberry Pi 3 QT demo image (~390M)
sqfs2tar $IMAGE | tar2sqfs -j $NUM_CPU -f out.sqfs
Values to measure:
- Total wall clock time of tar2sqfs.
- Througput (bytes read / time, bytes written / time).
Try the above for different compressors and stuff everything into
a huge spread sheet. Then, determine the following and plot some
nice graphs:
- Absolute speedup (normalized to serial implementation).
- Absolute efficiency (= speedup / $NUM_CPU)
- Relative speedup (normalized to thread pool with -j 1).
- Relative efficiency
Available test hardware:
- 8(16) core AMD Ryzen 7 3700X, 32GiB DDR4 RAM.
- Various 4 core Intel Xeon servers. Precise Specs not known yet.
- TODO: Check if my credentials on LCC2 still work. The cluster nodes AFAIK
have dual socket Xeons. Not sure if 8 cores per CPU or 8 in total?
For some compressors and work load, tar2sqfs may be I/O bound rather than CPU
bound. The different machines have different storage which may impact the
result. Should this be taken into account for comparison or eliminated by
using a ramdisk or fiddling with the queue backlog?
|