diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/benchmark.ods | bin | 58458 -> 103733 bytes | |||
-rw-r--r-- | doc/benchmark.txt | 257 | ||||
-rw-r--r-- | doc/format.adoc | 1067 | ||||
-rw-r--r-- | doc/format.txt | 1216 |
4 files changed, 1221 insertions, 1319 deletions
diff --git a/doc/benchmark.ods b/doc/benchmark.ods Binary files differindex 167d323..2ffd0f9 100644 --- a/doc/benchmark.ods +++ b/doc/benchmark.ods diff --git a/doc/benchmark.txt b/doc/benchmark.txt index 407cb26..841407a 100644 --- a/doc/benchmark.txt +++ b/doc/benchmark.txt @@ -6,7 +6,16 @@ AMD Ryzen 7 3700X 32GiB DDR4 RAM - Fedora 32 + Fedora 33 + + The following gcc versions of GCC and Linux were used: + + gcc (GCC) 10.2.1 20201125 (Red Hat 10.2.1-9) + Linux 5.11.9-200.fc33.x86_64 + + The following squashfs-tools-ng commit was tested: + + 7d2b3b077d7e204e64a1c57845524250c5b4a142 An optimized build of squashfs-tools-ng was compiled and installed to a tmpfs: @@ -16,13 +25,13 @@ $ ./autogen.sh $ ./configure CFLAGS="-O3 -Ofast -march=native -mtune=native" \ LDFLAGS="-O3 -Ofast" --prefix=$(pwd)/out - $ make -j install + $ make -j install-strip $ cd out - This was done to eliminate any influence of I/O performance and I/O caching - side effects to the extend possible and only measure the actual processing - time. + Working in a tmpfs was done to eliminate any influence of I/O performance and + I/O caching side effects to the extend possible and only measure the actual + processing time. For all benchmark tests, a Debian image extracted from the Debian 10.2 LiveDVD @@ -47,21 +56,12 @@ The tarball was then repacked and time was measured as follows: - $ time ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar + $ time -p ./bin/tar2sqfs -j <NUM_CPU> -c <COMPRESSOR> -f test.sqfs < test.tar The repacking was repeated 4 times and the worst wall-clock time ("real") was used for comparison. - Altough not relevant for this benchmark, the resulting image sizes were - measured once for each compressor, so that the compression ratio could - be estimated: - - $ stat test.tar - $ stat test.sqfs - - - The <NUM_CPU> was varied from 1 to 16 and for <COMPRESSOR>, all available compressors were used. All possible combinations <NUM_CPU> and <COMPRESSOR> were measured. @@ -71,6 +71,11 @@ all compressors without the <NUM_CPU> option. + In addition to the existing compressors, the LZO compressor in libcommon.a was + briefly patched to not perform any compression at all. This way, a baseline + comparison was established for a completely uncompressed SquashFS image. + + 1.2) What was computed from the results? The relative and absolute speedup were determined as follows: @@ -84,7 +89,7 @@ runtime_serial(compressor) - In addition, relative and absolute efficiency of the parellel implementation + In addition, relative and absolute efficiency of the parallel implementation were determined: speedup_rel(compressor, num_cpu) @@ -96,56 +101,36 @@ num_cpu - Furthermore, altough not relevant for this specific benchmark, having the + Furthermore, although not relevant for this specific benchmark, having the converted tarballs available, the compression ratio was computed as follows: - file_size(tarball) - compression_ratio(compressor) = --------------------- - file_size(compressor) - - - 1.3) What software versions were used? - - squashfs-tools-ng v0.9 - - TODO: update data and write the *exact* commit hash here, as well as gcc and - Linux versions. + size(tarball) + max_throughput(compressor) = -------------------------- + min(runtime(compressor)) 1.4) Results The raw timing results are as follows: - Jobs XZ lzma gzip LZO LZ4 zstd - serial 17m39.613s 16m10.710s 9m56.606s 13m22.337s 12.159s 9m33.600s - 1 17m38.050s 15m49.753s 9m46.948s 13m06.705s 11.908s 9m23.445s - 2 9m26.712s 8m24.706s 5m08.152s 6m53.872s 7.395s 5m 1.734s - 3 6m29.733s 5m47.422s 3m33.235s 4m44.407s 6.069s 3m30.708s - 4 5m02.993s 4m30.361s 2m43.447s 3m39.825s 5.864s 2m44.418s - 5 4m07.959s 3m40.860s 2m13.454s 2m59.395s 5.749s 2m16.745s - 6 3m30.514s 3m07.816s 1m53.641s 2m32.461s 5.926s 1m57.607s - 7 3m04.009s 2m43.765s 1m39.742s 2m12.536s 6.281s 1m43.734s - 8 2m45.050s 2m26.996s 1m28.776s 1m58.253s 6.395s 1m34.500s - 9 2m34.993s 2m18.868s 1m21.668s 1m50.461s 6.890s 1m29.820s - 10 2m27.399s 2m11.214s 1m15.461s 1m44.060s 7.225s 1m26.176s - 11 2m20.068s 2m04.592s 1m10.286s 1m37.749s 7.557s 1m22.566s - 12 2m13.131s 1m58.710s 1m05.957s 1m32.596s 8.127s 1m18.883s - 13 2m07.472s 1m53.481s 1m02.041s 1m27.982s 8.704s 1m16.218s - 14 2m02.365s 1m48.773s 1m00.337s 1m24.444s 9.494s 1m14.175s - 15 1m58.298s 1m45.079s 58.348s 1m21.445s 10.192s 1m12.134s - 16 1m55.940s 1m42.176s 56.615s 1m19.030s 10.964s 1m11.049s - - - The sizes of the tarball and the resulting images: - - - LZ4 compressed SquashFS image: ~3.1GiB (3,381,751,808) - - LZO compressed SquashFS image: ~2.5GiB (2,732,015,616) - - zstd compressed SquashFS image: ~2.1GiB (2,295,017,472) - - gzip compressed SquashFS image: ~2.3GiB (2,471,276,544) - - lzma compressed SquashFS image: ~2.0GiB (2,102,169,600) - - XZ compressed SquashFS image: ~2.0GiB (2,098,466,816) - - raw tarball: ~6.5GiB (7,008,118,272) - + Jobs XZ lzma gzip LZO LZ4 zstd none + serial 1108.39s 995.43s 609.79s 753.14s 13.58s 550.59s 5.86s + 1 1116.06s 990.33s 598.85s 753.53s 11.25s 550.37s 4.23s + 2 591.21s 536.61s 312.14s 394.21s 6.41s 294.12s 4.13s + 3 415.90s 370.48s 215.92s 273.14s 4.84s 205.14s 4.58s + 4 320.02s 288.35s 165.50s 210.32s 4.29s 159.71s 4.62s + 5 263.94s 235.69s 136.28s 172.33s 4.19s 132.27s 4.94s + 6 224.23s 200.63s 116.44s 146.80s 4.28s 112.79s 5.08s + 7 196.78s 176.35s 100.66s 128.61s 4.24s 99.26s 5.43s + 8 175.04s 157.82s 89.79s 113.47s 4.46s 88.22s 5.68s + 9 166.52s 148.88s 83.01s 106.14s 4.64s 84.97s 5.76s + 10 159.35s 141.08s 77.04s 99.92s 4.84s 81.61s 5.94s + 11 151.08s 136.27s 71.52s 94.23s 5.00s 77.51s 6.14s + 12 144.72s 128.91s 67.21s 89.33s 5.28s 74.10s 6.39s + 13 137.91s 122.67s 63.43s 84.39s 5.41s 71.83s 6.51s + 14 132.94s 117.79s 59.45s 80.87s 5.71s 68.86s 6.68s + 15 126.76s 113.51s 56.37s 76.68s 5.74s 65.78s 6.91s + 16 119.06s 107.15s 52.56s 71.49s 6.37s 62.52s 7.10s 1.5) Discussion @@ -153,7 +138,7 @@ Most obviously, the results indicate that LZ4, unlike the other compressors, is clearly I/O bound and not CPU bound and doesn't benefit from parallelization beyond 2-4 worker threads and even that benefit is marginal with efficiency - plummetting immediately. + plummeting immediately. The other compressors are clearly CPU bound. Speedup increases linearly until @@ -167,37 +152,55 @@ also gensquashfs). - Using more than 8 jobs causes a much slower increase in speedup and efficency + Using more than 8 jobs causes a much slower increase in speedup and efficiency declines even faster. This is probably due to the fact that the test system only has 8 physical cores and beyond that, SMT has to be used. - It should also be noted that the thread pool compressor with only a single - thread turns out to be *slightly* faster than the serial reference - implementation. A possible explanation for this might be that the fragment - blocks are actually assembled in the main thread, in parallel to the worker - that can still continue with other data blocks. Because of this decoupling - there is in fact some degree of parallelism, even if only one worker thread - is used. - - - As a side effect, this benchmark also produces some insights into the - compression ratio and throughput of the supported compressors. Indicating that - for the Debian live image, XZ clearly provides the highest data density, while - LZ4 is clearly the fastest compressor available. - - The throughput of the zstd compressor is comparable to gzip, while the - resulting compression ratio is closer to LZMA. - - Repeating the benchmark without tail-end-packing and with fragments completely - disabled would also show the effectiveness of tail-end-packing and fragment - packing as a side effect. + It should also be noted that for most of the compressors, as well as the + uncompressed version, the thread pool compressor with only a single thread + turns out to be *slightly* faster than the serial reference implementation. + A possible explanation for this might be that the fragment blocks are actually + assembled in the main thread, in parallel to the worker that can still + continue with other data blocks. Because of this decoupling there is in fact + some degree of parallelism, even if only one worker thread is used. For the + uncompressed version, the work still done in the thread pool is the hashing of + blocks and fragments for de-duplication. + + + Also of interest are the changes from the previous version of the benchmark, + performed on v0.9 of squashfs-tools-ng. Since then, the thread pool design has + been overhauled to spend a lot less time in the critical regions, but to also + perform byte-for-byte equivalence checks before considering blocks or fragments + to be identical. This may require a read-back and decompression step in the + main thread in order to access already written fragment blocks. + + While the overall behavior has stayed the same, performance for XZ & LZMA has + decreased slightly, whereas performance for the gzip, LZ4 & ZSTD has improved + slightly. As the decompression benchmark shows, the first two are a lot slower + at decompression, which needs to be done when reading back a fragment block + from disk, and due to the higher data density also have a higher chance of + actually having to decompress a block, so as a net result, the performance + penalty from exact fragment matching eats all gains from the new thread pool + design. For the more I/O bound compressors like LZ4 & ZSTD, decompressing a + block is done much faster and due to the low data density for LZ4, the chance + of actually having to decompress a block is lowered. As a result, the gains + from the new thread pool design apparently outweigh the read-back penalty. + + + Also noteworthy, due to the inclusion of an uncompressed reference, is that + the LZ4 compressor is actually very close in performance to the uncompressed + version, in some cases even outperforming it. This might be due to the fact + that LZ4 actually does compress blocks, so in many cases where the + uncompressed version needs to read back a full block during deduplication, + the LZ4 version only needs to read a considerably smaller amount of data, + reducing the penalty of having to read back blocks. 2) Reference Decompression Benchmark ************************************ - 1.1) What was measured? + 2.1) What was measured? A SquashFS image was generated for each supported compressor: @@ -205,39 +208,42 @@ And then, for each compressor, the unpacking time was measured: - $ time ./bin/sqfs2tar test.sqfs > /dev/null + $ time -p ./bin/sqfs2tar test.sqfs > /dev/null The unpacking step was repeated 4 times and the worst wall-clock time ("real") was used for comparison. - 2.2) What software version was used? + 2.2) What was computed from the results? - squashfs-tools-ng commit cc1141984a03da003e15ff229d3b417f8e5a24ad + The throughput was established by dividing the size of the resulting tarball by + the time taken to produce it from the image. - gcc version: 10.2.1 20201016 (Red Hat 10.2.1-6) - Linux version: 5.8.16-200.fc32.x86_64 + For better comparison, this was also normalized to the throughput of the + uncompressed SquashFS image. 2.3) Results - gzip 20.466s - lz4 2.519s - lzma 1m58.455s - lzo 10.521s - xz 1m59.451s - zstd 7.833s + xz 120.53s + lzma 118.91s + gzip 20.57s + lzo 10.65s + zstd 7.74s + lz4 2.59s + uncompressed 1.42s 2.4) Discussion From the measurement, it becomes obvious that LZ4 and zstd are the two fastest - decompressors. Zstd is particularly noteworth here, because it is not far - behind LZ4 in speed, but also achievs a substantially better compression ratio - that is somewhere between gzip and lzma. LZ4, despite being the fastest in - decompression and beating the others in compression speed by orders of - magnitudes, has by far the worst compression ratio. + decompressors, both being very close to the uncompressed version. Zstd is + particularly noteworthy here, because it is not far behind LZ4 in speed, but + also achieves a substantially better compression ratio that is + between gzip and lzma. LZ4, despite being the fastest in decompression and + beating the others in compression speed by orders of magnitudes, has by far + the worst compression ratio. It should be noted that the number of actually compressed blocks has not been determined. A worse compression ratio can lead to more blocks being stored @@ -245,14 +251,14 @@ However, since zstd has a better compression ratio than gzip, takes only 30% of the time to decompress, and in the serial compression benchmark only takes 2% - of the time to compress, we cane safely say that in this benchmark, zstd beats + of the time to compress, we can safely say that in this benchmark, zstd beats gzip by every metric. Furthermore, while XZ stands out as the compressor with the best compression ratio, zstd only takes ~6% of the time to decompress the entire image, while - being ~17% bigger than XZ. Shaving off 17% is definitely signifficant, + being ~17% bigger than XZ. Shaving off 17% is definitely significant, especially considering that in absolute numbers it is in the 100MB range, but - it clearly comes at a substential performance cost. + it clearly comes at a substantial performance cost. Also interesting are the results for the LZO compressor. Its compression speed @@ -262,8 +268,8 @@ in compression speed. - Concluding, for applications where a good compression ratio is most imporant, - XZ is obviously the best choice, but if speed is favoured, zstd is probably a + Concluding, for applications where a good compression ratio is most important, + XZ is obviously the best choice, but if speed is favored, zstd is probably a very good option to go with. LZ4 is much faster, but has a lot worse compression ratio. It is probably best suited as transparent compression for a read/write file system or network protocols. @@ -273,6 +279,51 @@ representative of a real-life workload where only a small set of files are accessed in a random access fashion. In that case, a caching layer can largely mitigate the decompression cost, translating it into an initial or only - occasionally occouring cache miss latency. But this benchmark should in theory + occasionally occurring cache miss latency. But this benchmark should in theory give an approximate idea how those cache miss latencies are expected to compare between the different compressors. + + + 3) Compression Size and Overhead Benchmark + ****************************************** + + 3.1) What was measured? + + For each compressor, a SquashFS image was created in the way outlined in the + parallel compression benchmark and the resulting file size was recorded. + + In addition, the raw tarball size was recorded for comparison. + + + 3.2) What was computed from the results? + + The compression ratio was established as follows: + + size(compressor) + ratio(compressor) = -------------------- + size(uncompressed) + + 3.3) Results + + SquashFS tar + Uncompressed ~6.1GiB (6,542,389,248) ~6.5GiB (7,008,118,272) + LZ4 ~3.1GiB (3,381,751,808) + LZO ~2.5GiB (2,732,015,616) + gzip ~2.3GiB (2,471,276,544) + zstd ~2.1GiB (2,295,078,912) + lzma ~2.0GiB (2,102,169,600) + XZ ~2.0GiB (2,098,466,816) + + + 3.4) Discussion + + Obviously XZ and lzma achieve the highest data density, shrinking the SquashFS + image down to less than a third of the input size. + + Noteworthy is also Zstd achieving higher data density than gzip while being + faster in compression as well as decompression. + + + Interestingly, even the uncompressed SquashFS image is still smaller than the + uncompressed tarball. Obviously SquashFS packs data and meta data more + efficiently than the tar format, shaving off ~7% in size. diff --git a/doc/format.adoc b/doc/format.adoc new file mode 100644 index 0000000..f81b6f8 --- /dev/null +++ b/doc/format.adoc @@ -0,0 +1,1067 @@ += Squashfs Binary Format +:toc: left +:toclevels: 4 +:sectnums: + +== About + +SquashFS is a compressed, read-only filesystem for Linux that can also be used +as a flexible, general purpose, compressed archive format, optimized for fast +random access with support for Unix permissions, sparse files and extended +attributes. + +SquashFS supports data and metadata compression through zlib, lz4, lzo, lzma, +xz or zstd. + +For fast random access, compressed files are split up in fixed size blocks +that are compressed separately. +The block size can be set between 4k and 1M (default for squashfs-tools and +squashfs-tools-ng is 128K). + +This document attempts to specify the on-disk format in detail. +It is based on a previous on-line version that was originally written by +Zachary Dremann and subsequently expanded by David Oberhollenzer during +reverse engineering attempts and available here: https://dr-emann.github.io/squashfs/. + +== Overview + +SquashFS always stores integers in little endian format. +The data blocks that make up the SquashFS archive are byte aligned, +i.e. they typically do not care for alignment. +The implementation in the Linux kernel requires the archive itself to +be a multiple of either 1k or 4k in size (called the device block size) +and user space tools typically use 4k to be compatible with both. + +A SquashFS archive consists of a maximum of nine parts: + +[%nowrap] +---- + _______________ +| | Important information about the archive, including +| Superblock | locations of other sections. +|_______________| +| | If non-default compression options have been used, +| Compression | they can optionally be stored here, to facilitate +| options | later, offline editing of the archive. +|_______________| +| | +| Data blocks | The contents of the files in the archive, +| & fragments | split into separately compressed blocks. +|_______________| +| | Metadata (ownership, permissions, etc) for +| Inode table | items in the archive. +|_______________| +| | +| Directory | Directory listings, including file names, and +| table | references to inodes. +|_______________| +| | +| Fragment | Description of fragment locations within the +| table | Datablocks & Fragments section. +|_______________| +| | A mapping from inode numbers to disk locations, +| Export table | required for NFS export. +|_______________| +| | +| UID/GID | A list of unique UID/GIDs. Inodes use an index into +| lookup table | this table to save memory. +|_______________| +| | +| Xattr | Extended attributes for items in the archive. +| table | +|_______________| +---- + +Although the super block details the exact positions of each section, most +implementations, including the one in the Linux kernel, insist on this exact +order. + +=== Packing File Data + +The file data is packed into the archive after the super block (and optional +compressor options). + +Files are divided into fixed size blocks that are separately compressed and +stored in order. SquashFS supports optional tail-end-packing of files that +are not an exact multiple of the block size. The remaining ends can either +be treated as a short block, or can be packed together with the tail ends of +other files in a single "fragment block". Files that are less than block size +are treated the same way. + +If the size of a data or fragment block would exceed the input size after +compression, the original, uncompressed data is stored, so that the size of a +block after compression never exceeds the input block size. + +=== Packing Metadata + +Metadata (e.g. inodes, directory listings, etc...) is treated as a continuous +stream of records that is chopped up into 8KiB blocks that are separately +compressed into special metadata blocks. + +The input size of 8KiB is fixed and independent of the data block size. +Similar to data blocks, if the compressed size would exceed 8KiB, the +uncompressed block is stored instead, so the on-disk size of a metadata +block never exceeds 8KiB. + +Individual entries are allowed to cross the block boundary, so e.g. an inode +may be located at the end of a metadata block with some part of it located at +the start of the next block. Both have to be read and decompressed when +reading this inode. If an entry is written across block boundaries, there +*MUST NOT* be any gap between the compressed metadata blocks on-disk. + + +In contrast to data blocks, every metadata block is preceded by a single, +16 bit unsigned integer. This integer holds the on-disk size of the block +that follows. The MSB is set if the block is stored uncompressed. Whenever +a metadata block is referenced, the position of this integer is given. + +To read a metadata block, seek to the indicated position and read the 16 bit +header. Sanity check that the lower 15 bit are less than 8KiB and proceed +to read that many bytes. If the highest bit of the header is cleared, +uncompress the data into an 8KiB buffer that *MUST NOT* overflow. + + +In the SquashFS archive format, metadata entries (e.g. inodes) are often +referenced using a 64 bit integer. The lower 16 bit hold an offset into the +uncompressed block and the upper 48 bit point to the on-disk location of the +block. + +The on-disk location is relative to the type of metadata entry, e.g. for +inodes it is relative to the start of the inode table given by the +super block. + +=== Storing Lookup Tables + +Lookup tables are arrays (i.e. sequences of identical sized records) that are +addressed by an index. + +Such tables are stored in the SquashFS format as metadata blocks, i.e. by +dividing the table data into 8KiB chunks that are separately compressed and +stored in sequence. + +To allow constant time lookup, a list of 64 bit unsigned integers is stored, +holding the on-disk locations of each metadata block. + +This list itself is stored uncompressed and not preceded by a header. + +When referring to a lookup table, the superblock gives the number of table +entries and points to this location list. + +Since the table entry size is a known, fixed value, the required number of +metadata blocks can be computed: + + block_count = ceil(table_count * entry_size / 8192) + +Which is also the number of 64 bit integers in the location list. + +When resolving a lookup table index, first work out the index of the +metadata block: + + meta_index = floor(index * entry_size / 8192) + +Using this index on the location list yields the on-disk location of +the metadata block containing the entry. + +After reading this metadata block, the byte offset into the block can +be computed to get the entry: + + offset = index * entry_size % 8192 + +The location list can be cached in memory. Resolving an index requires at +worst a single metadata block read (at most 8194 bytes fetched from an +unaligned on-disk location). + +=== Supported Compressors + +The SquashFS format supports the following compressors: + +* zlib deflate (referred to as "gzip" but only uses raw zlib streams) +* lzo +* lzma 1 (considered deprecated) +* lzma 2 (referred to as "xz") +* lz4 +* zstd + +The archive can only specify one compressor in the super block and has to use +it for both file data and metadata compression. Using one compressor for data +and switching to a different compressor for e.g. inodes is not supported. + +While it is technically not possible to pick a "null" compressor in the super +block, an implementation can still deliberately write only uncompressed blocks +to a SquashFS archive, or choose to store certain metadata blocks without +compression. + +The lzma 2 aka xz compressor *MUST* use `CRC32` checksums only. Using `SHA-256` is +not supported. + +== The Superblock + +The superblock is the first section of a SquashFS archive. It is always +96 bytes in size and contains important information about the archive, +including the locations of other sections. + +[cols="1,3,13a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | magic | Must be set to `0x73717368` ("hsqs" on disk). +| u32 | inode count | The number of inodes stored in the archive. +| u32 | mod time | Last modification time of the archive. Count seconds + since 00:00, Jan 1st 1970 UTC (not counting leap + seconds). This is unsigned, so it expires in the + year 2106 (as opposed to 2038). +| u32 | block size | The size of a data block in bytes. Must be a power + of two between 4096 (4k) and 1048576 (1 MiB). +| u32 | frag count | The number of entries in the fragment table. +| u16 | compressor | An ID designating the compressor used for both data + and meta data blocks. + +[cols=">1,2,8",frame="none",grid="none",options="header"] +!=== +! Value ! Name ! Comment +! 1 ! GZIP ! just zlib streams (no gzip headers\!) +! 2 ! LZO ! +! 3 ! LZMA ! LZMA version 1 +! 4 ! XZ ! LZMA version 2 as used by xz-utils +! 5 ! LZ4 ! +! 6 ! ZSTD ! +!=== + +| u16 | block log | The log~2~ of the block size. If the two fields do not + agree, the archive is considered corrupted. +| u16 | flags | Bit wise *OR* of the flag bits below. + + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Meaning +! 0x0001 ! Inodes are stored uncompressed. +! 0x0002 ! Data blocks are stored uncompressed. +! 0x0004 ! Unused, should always be unset. +! 0x0008 ! Fragments are stored uncompressed. +! 0x0010 ! Fragments are not used. +! 0x0020 ! Fragments are always generated. +! 0x0040 ! Data has been deduplicated. +! 0x0080 ! NFS export table exists. +! 0x0100 ! Xattrs are stored uncompressed. +! 0x0200 ! There are no Xattrs in the archive. +! 0x0400 ! Compressor options are present. +! 0x0800 ! The ID table is uncompressed. +!=== + +| u16 | id count | The number of entries in the ID lookup table. +| u16 | version major | Major version of the format. Must be set to 4. +| u16 | version minor | Minor version of the format. Must be set to 0. +| u64 | root inode | A reference to the inode of the root directory. +| u64 | bytes used | The number of bytes used by the archive. Because + SquashFS archives must be padded to a multiple of the underlying + device block size, this can be less than the actual file size. +| u64 | ID table | The byte offset at which the id table starts. +| u64 | Xattr table | The byte offset at which the xattr id table starts. +| u64 | Inode table | The byte offset at which the inode table starts. +| u64 | Dir. table | The byte offset at which the directory table starts. +| u64 | Frag table | The byte offset at which the fragment table starts. +| u64 | Export table | The byte offset at which the export table starts. +|=== + + +The Xattr table, fragment table and export table are optional. If they are +omitted from the archive, the respective fields indicating their position +must be set to `0xFFFFFFFFFFFFFFFF` (i.e. all bits set). + +Most of the flags only serve an informational purpose and are only useful +when editing the archive to convey the original packer settings. + +The only flag that actually carries information is the "Compressor options are +present" flag. In fact, this is the only flag that the Linux kernel +implementation actually tests for. + +The compressor options, however, are also only there for informal purpose, as +most compression libraries understand their own stream format irregardless of +the options used to compress and in fact don't provide any options for the +decompressor. In the Linux kernel, the XZ decompressor is currently the only +one that processes those options to pre-allocate the LZMA dictionary if a +non-default size was used. + +=== Compression Options + +If the compressor options flag is set in the superblock, the superblock is +immediately followed by a single metadata block, which is always uncompressed. + +The data stored in this block is compressor dependent. + +There are two special cases: + +* For LZ4, the compressor options always have to be present. +* The LZMA compressor does not support compressor options, so this section + must never be present. + +For the compressors currently implemented, a 4 to 8 byte payload follows. + +The following sub sections outline the contents for each compressor that +supports options. The default values if the options are missing are outlined +as well. + +==== GZIP + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | compression level | In the range 1 to 9 (inclusive). Defaults to 9. +| u16 | window size | In the rage 8 to 15 (inclusive). Defaults to 15. +| u16 | strategies | A bit field describing the enabled strategies. + If no flags are set, the default strategy is + implicitly used. Please consult the ZLIB manual + for details on specific strategies. + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0x0001 ! Default strategy. +! 0x0002 ! Filtered. +! 0x0004 ! Huffman Only. +! 0x0008 ! Run Length Encoded. +! 0x0010 ! Fixed. +!=== +|=== + +NOTE: The SquashFS writer typically tries all selected strategies (including +not setting any and letting zlib work with defaults) and stores the result +with the smallest size. + +==== XZ + + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | dictionary size | *SHOULD* be >= 8KiB, and must be either a power of + 2, or the sum of two consecutive powers of 2. +| u32 | Filters | A bit field describing the additional enabled + filters attempted to better compress executable + code. + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0x0001 ! x86 +! 0x0002 ! PowerPC +! 0x0004 ! IA64 +! 0x0008 ! ARM +! 0x0010 ! ARM thumb +! 0x0020 ! SPARC +!=== +|=== + +NOTE: A SquashFS writer typically tries all selected VLI filters (including +not setting any and letting libxz work with defaults) and stores the resulting +block that has the smallest size. + +Also note that further options, such as XZ presets, are not included. The +compressor typically uses the libxz defaults, i.e. level 6 and not using the +extreme flag. Likewise for `lc`, `lp` and `pb` (defaults are 3, 0 and 2 +respectively). + +If the encoder chooses to change those values, the decoder will still be +able to read the data, but there is currently no way to convey that those +values were changed. + +This is specifically problematic for the compression level, since increasing +the level can result in drastically increasing the decoders memory consumption. + +==== LZ4 + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | Version | *MUST* be set to 1. +| u32 | Flags | A bit field describing the enabled LZ4 flags. + There is currently only one possible flag: + + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0x0001 ! Use LZ4 High Compression(HC) mode. +!=== +|=== + +==== ZSTD + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | compression level | Should be in range 1 to 22 (inclusive). The real + maximum is the zstd defined ZSTD_maxCLevel(). + + + + + The default value is 15. +|=== + +==== LZO + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | algorithm | Which variant of LZO to use. + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0 ! lzo1x_1 +! 1 ! lzo1x_1_11 +! 2 ! lzo1x_1_12 +! 3 ! lzo1x_1_15 +! 4 ! lzo1x_999 (default) +!=== + +| u32 | compression level | For lzo1x_999, this can be a value between 0 + and 9 inclusive (defaults to 8). *MUST* be 0 + for all other algorithms. +|=== + +== Data and Fragment Blocks + +As outlined in 2.1, file data is packed by dividing the input files into fixed +size chunks (the block size from the super block) that are stored in sequence. + +The picture below tries to illustrate this concept: + +.Packing of File Data +[%nowrap] +.... + _____ _____ _____ _ _____ _____ _ _ +File A: |__A__|__A__|__A__|A| File B: |__B__|__B__|B| File C: |C| + | | | | | | | | + | +---+ | | | | | | + | | +------+ | | | | | + | | | | | | | | + | | | +------|---------------+ | | | + | | | | +--|---------------------+ | | + | | | | | | | | + | | | | | +-----------------------+ | +------------+ + | | | | | | | | + V V V V V V V V + __ _ ___ ___ ___ __ Fragment block: |A|B|C| + Output: |_A|A|_A_|_B_|_B_|_F| | + __V__ + A |__F__| + | | + +------------------------+ +.... + +In the above diagram, file A consists of 3 blocks and a single tail end, file B has +2 blocks and one tail end while file C is smaller than block size. + + +For each file, the blocks are individually compressed and stored on disk +in order. + +The tail ends of A and B, together with the entire contents of C are packed +together into a fragment block F, that is compressed and stored on disk once +it is full. + +This tail-end-packing is completely optional. The tail ends (or in case of C +the entire file) can also be treated as truncated blocks that expand to less +than block size when uncompressed. + + +There are no headers in front of data or fragment blocks and there *MUST NOT* be +any gaps between data blocks from a single file, but a SquashFS packer is free +to leave gaps between two different files or fragment blocks. The packer is +also free to decide how to arrange fragments within a fragment block and what +fragments to pack together. + +To locate file data, the file inodes store the following information: + +* The uncompressed size of the file. From this, the number of blocks can + be computed: + + block_count = floor(file_size / block_size) # if tail end packing is used + block_count = ceil(file_size / block_size) # otherwise + +* The exact location of the first block, if one exists. +* For each consecutive block, the on-disk size. ++ +A 32 bit integer is used with bit 24 (i.e. `1 << 24`) set if the block +is stored uncompressed. + +* If tail-end-packing was done, the location of the fragment block and a + byte offset into the uncompressed fragment block. The size of the tail + end can be computed easily: + + tail_end_size = file_size % block_size + +Since a fragment block will likely be referred to by multiple files, inodes +don't store its on-disk location and size directly, but instead use a 32 bit +index into a fragment block lookup table (see the <<Fragment Table>>). + +If a data block other than the last one unpacks to less than block size, the +rest of the buffer is filled with 0 bytes. This way, sparse files are +implemented. Specifically if a block has an on-disk size of 0 this translates +to an entire block filled with 0 bytes without having to retrieve any data +from disk. + +The on-disk locations of file blocks *MAY* overlap and different file inodes are +free to refer to the same fragment. Typical SquashFS packers would explicitly +use this to for files that are duplicates of others. Doing so is NOT counted +as a hard link. + +If an inode references on-disk locations outside the data area, the result is +undefined. + +== Inode Table + +Inodes are packed into metadata blocks and are not aligned, i.e. they can span +the boundary between metadata blocks. To save space, there are different +inodes for each type (regular file, directory, device, etc.) of varying +contents and size. + +To further save more space, inodes come in two flavors: simple inode types +optimized for a simple, standard use case, and extended inode types where +extra information has to be stored. + +SquashFS more or less supports 32 bit UIDs and GIDs. As an optimization, those +IDs are stored in a lookup table (see <<ID Table>>) and the inodes themselves +hold a 16 bit index into this table. This allows to 32 bit UIDs/GIDs, but only +among 2^16^ unique values. + +The location of the first metadata block is indicated by the inode table start +in the superblock. The inode table ends at the start of the directory table. + +=== Common Inode Header + +All Inodes share a common header, which contains some common information, +as well as describing the type of Inode which follows. This header has the +following structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u16 | type | The type of item described by the inode which follows this header + +[cols=">1,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 1 ! Basic Directory +! 2 ! Basic File +! 3 ! Basic Symlink +! 4 ! Basic Block Device +! 5 ! Basic Character Device +! 6 ! Basic Named Pipe (FIFO) +! 7 ! Basic Socked +! 8 ! Extended Directory +! 9 ! Extended File +! 10 ! Extended Symlink +! 11 ! Extended Block Device +! 12 ! Extended Character Device +! 13 ! Extended Named Pipe (FIFO) +! 14 ! Extended Socked +!=== + +| u16 | permissions | A bit mask representing Unix file system permissions + for the inode. This only stores permissions, not the + type. The type is reconstructed from the field above. +| u16 | uid | An index into the <<ID Table>>, giving the user ID of the owner. +| u16 | gid | An index into the <<ID Table>>, giving the group ID of the owner. +| u32 | mtime | The unsigned number of seconds (not counting leap + seconds) since 00:00, Jan 1st, 1970 UTC when the item + described by the inode was last modified. +| u32 | inode number | Unique node number. Must be at least 1 and at most + the inode count from the super block. +|=== + +=== Directory Inodes + +Directory inodes mainly contain a reference into the directory table where +the listing of entries is stored. + +A basic directory has an entry listing of at most 64k (uncompressed) and +no extended attributes. The layout of the inode data is as follows: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | block index | The location of the metadata block in the directory + table where the entry information starts. This is + relative to the directory table location. +| u32 | link count | The number of hard links to this directory. +| u16 | file size | Total (uncompressed) size in bytes of the entry + listing in the directory table, including headers. + + + + + This value is 3 bytes larger than the real listing. + The Linux kernel creates "." and ".." entries for + offsets 0 and 1, and only after 3 looks into the + listing, subtracting 3 from the size. +| u16 | block offset | The (uncompressed) offset within the metadata block + in the directory table where the directory listing + starts. +| u32 | parent inode | The inode number of the parent of this directory. If + this is the root directory, this *SHOULD* be 0. +|=== + +NOTE: For historical reasons, the hard link count of a directory includes +the number of entries in the directory and is initialized to 2 for an empty +directory. I.e. a directory with N entries has at least N + 2 link count. + +If the "file size" is set to a value < 4, the directory is empty and there is +no corresponding listing in the directory table. + +An extended directory can have a listing that is at most 4GiB in size, may +have extended attributes and can have an optional index for faster lookup: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | Same as above. +| u32 | file size | Same as above. +| u32 | block index | Same as above. +| u32 | parent inode | Same as above. +| u16 | index count | The number of directory index entries following the + inode structure. +| u16 | block offset | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> or `0xFFFFFFFF` + if the inode has no extended attributes. +|=== + + +The index follows directly after the inode. See <<Directory Index>> for details on +how the directory index is structured. + +=== File Inodes + +Basic files can be at most 4 GiB in size (uncompressed), must be located +within the first 4 GiB of the SquashFS image, cannot have any extended +attributes and don't support hard-link or sparse file accounting: + + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | blocks start | The offset from the start of the archive to the first + data block. +| u32 | frag index | An index into the <<Fragment Table>> which describes the fragment + block that the tail end of this file is stored in. If not used, + this is set to `0xFFFFFFFF`. +| u32 | block offset | The (uncompressed) offset within the fragment block + where the tail end of this file is. See <<Data and Fragment Blocks>> + for details. +| u32 | file size | The (uncompressed) size of this file. +| u32[] | block sizes | An array of consecutive block sizes. See <<Data and Fragment Blocks>> for details. +|=== + +If 'frag index' is set to `0xFFFFFFFF`, the number of blocks is computed as + + ceil(file_size / block_size) + +otherwise, if 'frag index' is a valid fragment index, the block count is +computed as + + floor(file_size / block_size) + +and the size of the tail end is + + file_size % block_size + + +To access a data block, first compute the block index as + + index = floor(offset / block_size) + +then compute the on-disk location of the block by summing up the sizes of the +blocks that come before it: + + location = block_start + + for i = 0; i < index; i++ + location += block_sizes[i] & 0x00FFFFFF + + +The tail end, if present, is accessed by resolving the fragment index through +the fragment lookup table (see the <<Fragment Table>>), loading the fragment block and +using the given 'block offset' into the fragment block. + +Extended files have a 64 bit location and size, have additional counters for +sparse file accounting and hard links, and can have extended attributes: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u64 | blocks start | Same as above (but larger). +| u64 | file size | Same as above (but larger). +| u64 | sparse | The number of bytes saved by omitting zero bytes. + Used in the kernel for sparse file accounting. +| u32 | link count | The number of hard links to this node. +| u32 | frag index | Same as above. +| u32 | block offset | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> or `0xFFFFFFFF` + if the inode has no extended attributes. +| u32[] | block sizes | Same as above. +|=== + +=== Symbolic Links + +Symbolic links mainly have a target path stored directly after the inode +header, as well as a hard-link counter (yes, you can have hard links to +symlinks): + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | The number of hard links to this symlink. +| u32 | target size | The size in bytes of the target path this symlink + points to. +| u8[] | target path | An array of bytes holding the target path this + symlink points to. The path is 'target size' bytes + long and NOT null-terminated. +|=== + +The extended symlink type adds an additional extended attribute index: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | Same as above. +| u32 | target size | Same as above. +| u8[] | target path | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> +|=== + +=== Device Special Files + +Basic device special files only store a hard-link counter and a device number. +The layout is identical for both character and block devices: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | The number of hard links to this entry. +| u32 | device number | The system specific device number. + + + + + On Linux, this consists of major and minor device + numbers that can be extracted as follows: + + major = (dev & 0xFFF00) >> 8. + minor = (dev & 0x000FF) | ((dev >> 12) & 0xFFF00) +|=== + +The extended device file inode adds an additional extended attribute index: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | Same as above. +| u32 | device number | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> +|=== + +=== IPC Inodes (FIFO or Socket) + +Named pipe (FIFO) and socket special files only add a hard-link counter +after the inode header: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | The number of hard links to this entry. +|=== + +The extended versions add an additional extended attribute index: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> +|=== + +== Directory Table + +For each directory inode, the directory table stores a linear list of all +entries, with references back to the inodes that describe those entries. + +The entry list itself is sorted ASCIIbetically by entry name and split into +multiple runs, each preceded by a short header. + +The directory inodes store the total, uncompressed size of the entire listing, +including headers. Using this size, a SquashFS reader can determine if another +header with further entries should be following once it reaches the end of a +run. + +To save space, the header indicates a metadata block and a reference inode +number. The entries that follow simply store a difference to that inode number +and an offset into the specified metadata block. + +Every time, the inode block changes or the difference of the inode number +to the reference in the header cannot be encoded in 16 bits anymore, a new +header is emitted. + +A header must be followed by *AT MOST* 256 entries. If there are more entries, +a new header *MUST* be emitted. + +Typically, inode allocation strategies would sort the children of a directory +and then allocate inode numbers incrementally, to optimize directory entry +listings. + +Since hard links might be further further away than ±32k of the reference +number, they might require a new header to be emitted. Inode number allocation +and picking of the reference could of course be optimized to prevent this. + +The directory header has the following structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | count | Number of entries following the header. +| u32 | start | The location of the metadata block in the inode table + where the inodes are stored. This is relative to the + inode table start from the super block. +| s32 | inode number | An arbitrary inode number. The entries that follow + store their inode number as a difference to this. +|=== + +The counter is stored off-by-one, i.e. a value of 0 indicates 1 entry follows. +This also makes it impossible to encode a size of 0, which wouldn't make any +sense. Empty directories simply have their size set to 0 in the inode instead, +so no extra dummy header has to be stored or looked up. + +The header is followed by multiple entries that each have this structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u16 | offset | An offset into the uncompressed inode metadata block. +| s16 | inode offset | The difference of this inode's number to the reference + stored in the header. +| u16 | type | The inode type. For extended inodes, the basic type is stored + here instead. +| u16 | name size | One less than the size of the entry name. +| u8[] | name | The file name of the entry without a trailing null byte. Has + `name size` + 1 bytes. +|=== + +In the entry structure itself, the file names are stored without trailing null +bytes. Since a zero length name makes no sense, the name length is stored +off-by-one, i.e. the value 0 cannot be encoded. + +The inode type is stored in the entry, but always as the corresponding +basic type. + +While the field is technically 16 bits, the kernel implementation currently +imposes an arbitrary limit of 255 on the name size field. Since the field is +off-by-one, this means that a file name in SquashFS can be at most 256 +characters long. + +=== Directory Index + +To speed up lookups on directories with lots of entries, the extended +directory inode can store an index, holding the locations of all directory +headers and the name of the first entry after the header. + +When searching for an entry, the reader can then iterate over the index to +find a range of metadata blocks that should contain a given entry and then +only scan over the given range. + +To allow for even faster lookups, a new header should be emitted every time +the entry list crosses a metadata block boundary. This narrows the boundary +down to a single metadata block lookup in most cases. + +The index entries have the following structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | index | This stores a byte offset from the first directory + header to the current header, as if the uncompressed + directory metadata blocks were laid out in memory + consecutively. +| u32 | start | Start offset of a directory table metadata block, + relative to the directory table start. +| u32 | name size | One less than the size of the entry name. +| u8[] | name | The name of the first entry following the header + without a trailing null byte. +|=== + +== Fragment Table + +Tail-ends and smaller than block size files can be combined into fragment +blocks that are at most 'block size' bytes long. + +The fragment table describes the location and size of the fragment blocks +(not the tail-ends within them). + +This is a lookup table which stores entries of the following shape: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u64 | start | The offset within the archive where the fragment block starts +| u32 | size | The on-disk size of the fragment block. If the block is + uncompressed, bit 24 (i.e. `1 << 24`) is set. +| u32 | unused | *SHOULD* be set to 0. +|=== + +The table is stored on-disk as described in <<Storing Lookup Tables>>. + +The fragment table location in the superblock points to an array of 64 bit +integers that store the on-disk locations of the metadata blocks containing +the lookup table. + +Each metadata block can store up to 512 entries (`8129 / 16`). + +The "unused" field is there for alignment and *SHOULD* be set to 0, however the +Linux kernel currently ignores this field completely, making it impossible for +Linux to ever re-purpose this field. + +== Export Table + +To support NFS exports, SquashFS needs a fast way to resolve an inode number +to an inode structure. + +For this purpose, a SquashFS archive can optionally contain an export table, +which is basically a flat array of 64 bit inode references, with the inode +number being used as an index into the array. + +Because the inode number 0 is not used (reserved as a sentinel value), the +array actually starts at inode number 1 and the index is thus +inode_number - 1. + +The array itself is stored in a series of metadata blocks, as outlined in +<<Storing Lookup Tables>>. + +Since each block can store 1024 references (`8192 / 8`), there will be +`ceil(inode_count / 1024)` metadata blocks for the entire array. + +== ID Table + +As outlined in <<Common Inode Header>>, SquashFS supports 32 bit user and group IDs. To +compact the inode table, the unique UID/GID values are collected in a lookup +table and a 16 bit table index is stored in the inode instead. + +This lookup table is stored as outlined in <<Storing Lookup Tables>>. + +Each metadata block can store up to 2048 IDs (`8192 / 4`). + +[[_xattr_table,Xattr Table]] +== Extended Attribute Table + +Extended attributes are arbitrary key value pairs attached to inodes. The key +names use dots as separators to create a hierarchy of name spaces. + +The key value pairs of all inodes are stored consecutively in a series of +metadata blocks. + +The values can either be stored inline, i.e. a key entry is directly followed +by a value, or out-of-line to deduplicate identical values and use a reference +instead. Typically, the first occurrence of a value is stored in line and +every consecutive use of the same value uses an out-of-line reference back to +the first one. + +The keys are stored using the following data structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u16 | type | A prefix ID for the key name. If the value that follows + is stored out-of-line, the flag `0x0100` is **OR**ed to the + type ID. + +[cols=">1,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0 ! Prefix the name with `"user."` +! 1 ! Prefix the name with `"trusted."` +! 2 ! Prefix the name with `"security."` +!=== + +| u16 | name size | The number of key bytes the follows. +| u8[] | name | The remainder of the key without the prefix and without a + trailing null byte. +|=== + +Following the key, this structure is used to store the value: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | value size | The size of the value string. If the value is stored + out of line, this is always 8, i.e. the size of an + unsigned 64 bit integer. +| u8[] | value | This is 'value size' bytes of arbitrary binary data. + If the value is stored out-of-line, this is a 64 bit + reference, i.e. a location of a metadata block, + shifted left by 16 and **OR**ed with an offset into the + uncompressed block, giving the location of another + value structure. +|=== + +The metadata block location given by an out-of-line reference is relative to +the location of the first block. + +To actually address a block of key value pairs associated with an inode, a +lookup table is used that specifies the start and size of a sequence of key +value pairs. + +All an inode needs to store is a 32 bit index into this table. If two inodes +have an identical attribute sets, the key/value sequence is only written once, +there is only one lookup table entry and both inodes have the same index. + +Each lookup table entry has the following structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u64 | xattr ref | A reference to the start of the key value block, i.e. + the metadata block location shifted left by 16, **OR**ed + with an offset into the uncompressed block. +| u32 | count | The number of key value pairs. +| u32 | size | The exact, uncompressed size in bytes of the entire + block of key value pairs, counting what has been + written to disk and including the key/value entry + structures. +|=== + +This lookup table is stored as outlined in <<Storing Lookup Tables>> + +Each metadata block can hold 512 (`8192 / 16`) entries. + +However, in contrast to <<Storing Lookup Tables>>, additional data is given before +the list of metadata block locations, to locate the key-value pairs, as well as the +actual number of lookup table entries that are not specified in the super +block. + +The 'Xattr table' entry in the superblock gives the absolute location of the +following data structure which is stored on-disk as is, uncompressed: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u64 | kv start | The absolute position of the first metadata block holding the + key/value pairs. +| u32 | count | The number of entries in the lookup table. +| u32 | unused | *SHOULD* be set to 0, however Linux currently ignores + this field completely and squashfs-tools used to leak + stack data here, making it impossible for Linux to + ever re-purpose this field. +| u64[] | locations | An array holding the absolute on-disk location of each + metadata block of the lookup table. +|=== + +If an inode has a a valid xattr index (i.e. not `0xFFFFFFFF`), the metadata +block index is computed as + + block_idx = floor(index / 512) + +which is then used to retrieve the metadata block index from the locations +array. + +Once the block has been read from disk and uncompressed, the byte offset into +the metadata block can be computed as + + offset = (index * 16) % 8192 + +From this position, the structure can be read that holds a reference to the +metadata block that contains the key/value pairs (and byte offset into the +uncompressed block where the pairs start), as well as the number of key/value +pairs and their total, uncompressed size. diff --git a/doc/format.txt b/doc/format.txt deleted file mode 100644 index a783c07..0000000 --- a/doc/format.txt +++ /dev/null @@ -1,1216 +0,0 @@ - - Squashfs Binary Format - ********************** - - 0) Index - ******** - - 0............Index - 1............About - 2............Overview - 2.1........Packing File Data - 2.2........Packing Metadata - 2.3........Storing Lookup Tables - 3............The Superblock - 3.1........Compression Options - 3.1.1....GZIP - 3.1.2....XZ - 3.1.3....LZ4 - 3.1.4....ZSTD - 3.1.5....LZO - 4............Data and Fragment Blocks - 5............Inode Table - 5.1........Common Inode Header - 5.2........Directory inodes - 5.3........File Inodes - 5.4........Symbolic Links - 5.5........Device Special Files - 5.6........IPC inodes (FIFO or Socket) - 6............Directory Table - 6.1........Directory Index - 7............Fragment Table - 8............Export Table - 9............ID Table - 10...........Extended Attribute Table - - 1) About - ******** - - SquashFS is a compressed, read-only filesystem for Linux that can also be used - as a flexible, general purpose, compressed archive format, optimized for fast - random access with support for Unix permissions, sparse files and extended - attributes. - - SquashFS supports data and metadata compression through zlib, lz4, lzo, lzma, - xz or zstd. - - For fast random access, compressed files are split up in fixed size blocks - that are compressed separately. The block size can be set between 4k and 1M - (default for squashfs-tools and squashfs-tools-ng is 128K). - - - This document attempts to specify the on-disk format in detail. - - It is based on a previous on-line version that was originally written by - Zachary Dremann and subsequently expanded by David Oberhollenzer during - reverse engineering attempts and available here: - - https://dr-emann.github.io/squashfs/ - - - 2) Overview - *********** - - SquashFS always stores integers in little endian format. The data blocks that - make up the SquashFS archive are byte aligned, i.e. they typically do not care - for alignment. The implementation in the Linux kernel requires the archive - itself to be a multiple of either 1k or 4k in size (called the device block - size) and user space tools typically use 4k to be compatible with both. - - A SquashFS archive consists of a maximum of nine parts: - - _______________ - | | Important information about the archive, including - | Superblock | locations of other sections. - |_______________| - | | If non-default compression options have been used, - | Compression | they can optionally be stored here, to facilitate - | options | later, offline editing of the archive. - |_______________| - | | - | Data blocks | The contents of the files in the archive, - | & fragments | split into separately compressed blocks. - |_______________| - | | Metadata (ownership, permissions, etc) for - | Inode table | items in the archive. - |_______________| - | | - | Directory | Directory listings, including file names, and - | table | references to inodes. - |_______________| - | | - | Fragment | Description of fragment locations within the - | table | Datablocks & Fragments section. - |_______________| - | | A mapping from inode numbers to disk locations, - | Export table | required for NFS export. - |_______________| - | | - | UID/GID | A list of unique UID/GIDs. Inodes use an index into - | lookup table | this table to save memory. - |_______________| - | | - | Xattr | Extended attributes for items in the archive. - | table | - |_______________| - - - Although the super block details the exact positions of each section, most - implementations, including the one in the Linux kernel, insist on this exact - order. - - - 2.1) Packing File Data - - The file data is packed into the archive after the super block (and optional - compressor options). - - Files are divided into fixed size blocks that are separately compressed and - stored in order. SquashFS supports optional tail-end-packing of files that - are not an exact multiple of the block size. The remaining ends can either - be treated as a short block, or can be packed together with the tail ends of - other files in a single "fragment block". Files that are less than block size - are treated the same way. - - If the size of a data or fragment block would exceed the input size after - compression, the original, uncompressed data is stored, so that the size of a - block after compression never exceeds the input block size. - - - 2.2) Packing Metadata - - Metadata (e.g. inodes, directory listings, etc...) is treated as a continuous - stream of records that is chopped up into 8KiB blocks that are separately - compressed into special metadata blocks. - - The input size of 8KiB is fixed and independent of the data block size. - Similar to data blocks, if the compressed size would exceed 8KiB, the - uncompressed block is stored instead, so the on-disk size of a metadata - block never exceeds 8KiB. - - Individual entries are allowed to cross the block boundary, so e.g. an inode - may be located at the end of a metadata block with some part of it located at - the start of the next block. Both have to be read and decompressed when - reading this inode. If an entry is written across block boundaries, there - MUST NOT be any gap between the compressed metadata blocks on-disk. - - - In contrast to data blocks, every metadata block is preceded by a single, - 16 bit unsigned integer. This integer holds the on-disk size of the block - that follows. The MSB is set if the block is stored uncompressed. Whenever - a metadata block is referenced, the position of this integer is given. - - To read a metadata block, seek to the indicated position and read the 16 bit - header. Sanity check that the lower 15 bit are less than 8KiB and proceed - to read that many bytes. If the highest bit of the header is cleared, - uncompress the data into an 8KiB buffer that MUST NOT overflow. - - - In the SquashFS archive format, metadata entries (e.g. inodes) are often - referenced using a 64 bit integer. The lower 16 bit hold an offset into the - uncompressed block and the upper 48 bit point to the on-disk location of the - block. - - The on-disk location is relative to the type of metadata entry, e.g. for - inodes it is relative to the start of the inode table given by the - super block. - - - 2.3) Storing Lookup Tables - - Lookup tables are arrays (i.e. sequences of identical sized records) that are - addressed by an index. - - Such tables are stored in the SquashFS format as metadata blocks, i.e. by - dividing the table data into 8KiB chunks that are separately compressed and - stored in sequence. - - To allow constant time lookup, a list of 64 bit unsigned integers is stored, - holding the on-disk locations of each metadata block. - - This list itself is stored uncompressed and not preceded by a header. - - When referring to a lookup table, the superblock gives the number of table - entries and points to this location list. - - Since the table entry size is a known, fixed value, the required number of - metadata blocks can be computed: - - block_count = ceil(table_count * entry_size / 8192) - - Which is also the number of 64 bit integers in the location list. - - - When resolving a lookup table index, first work out the index of the - metadata block: - - meta_index = floor(index * entry_size / 8192) - - Using this index on the location list yields the on-disk location of - the metadata block containing the entry. - - After reading this metadata block, the byte offset into the block can - be computed to get the entry: - - offset = index * entry_size % 8192 - - - The location list can be cached in memory. Resolving an index requires at - worst a single metadata block read (at most 8194 bytes fetched from an - unaligned on-disk location). - - - 2.4) Supported Compressors - - The SquashFS format supports the following compressors: - - - zlib deflate (referred to as "gzip" but only uses raw zlib streams) - - lzo - - lzma 1 (considered deprecated) - - lzma 2 (referred to as "xz") - - lz4 - - zstd - - The archive can only specify one compressor in the super block and has to use - it for both file data and metadata compression. Using one compressor for data - and switching to a different compressor for e.g. inodes is not supported. - - While it is technically not possible to pick a "null" compressor in the super - block, an implementation can still deliberately write only uncompressed blocks - to a SquashFS archive, or choose to store certain metadata blocks without - compression. - - The lzma 2 aka xz compressor MUST use CRC32 checksums only. Using SHA-256 is - not supported. - - - 3) The superblock - ***************** - - The superblock is the first section of a SquashFS archive. It is always - 96 bytes in size and contains important information about the archive, - including the locations of other sections. - - +======+===============+=====================================================+ - | Type | Name | Description | - +======+===============+=====================================================+ - | u32 | magic | Must be set to 0x73717368 ("hsqs" on disk). | - +------+---------------+-----------------------------------------------------+ - | u32 | inode count | The number of inodes stored in the archive. | - +------+---------------+-----------------------------------------------------+ - | u32 | mod time | Last modification time of the archive. Count seconds| - | | | since 00:00, Jan 1st 1970 UTC (not counting leap | - | | | seconds). This is unsigned, so it expires in the | - | | | year 2106 (as opposed to 2038). | - +------+---------------+-----------------------------------------------------+ - | u32 | block size | The size of a data block in bytes. Must be a power | - | | | of two between 4096 (4k) and 1048576 (1 MiB) | - +------+---------------+-----------------------------------------------------+ - | u32 | frag count | The number of entries in the fragment table | - +------+---------------+-----------------------------------------------------+ - | u16 | compressor | An ID designating the compressor used for both data | - | | | and meta data blocks. | - | | | | - | | +-------+------+--------------------------------------+ - | | | Value | Name | Comment | - | | +-------+------+--------------------------------------+ - | | | 1 | GZIP | just zlib streams (no gzip headers!) | - | | | 2 | LZO | | - | | | 3 | LZMA | LZMA version 1 | - | | | 4 | XZ | LZMA version 2 as used by xz-utils | - | | | 5 | LZ4 | | - | | | 6 | ZSTD | | - +------+---------------+-------+------+--------------------------------------+ - | u16 | block log | The log2 of the block size. If the two fields do not| - | | | agree, the archive is considered corrupted. | - +------+---------------+-----------------------------------------------------+ - | u16 | flags | Bit wise OR of the flag bits below. | - | | | | - | | +--------+--------------------------------------------+ - | | | Value | Meaing | - | | +--------+--------------------------------------------+ - | | | 0x0001 | Inodes are stored uncompressed. | - | | | 0x0002 | Data blocks are stored uncompressed. | - | | | 0x0008 | Fragments are stored uncompressed. | - | | | 0x0010 | Fragments are not used. | - | | | 0x0020 | Fragments are always generated. | - | | | 0x0040 | Data has been deduplicated. | - | | | 0x0080 | NFS export table exists. | - | | | 0x0100 | Xattrs are stored uncompressed. | - | | | 0x0200 | There are no Xattrs in the archive. | - | | | 0x0400 | Compressor options are present. | - | | | 0x0800 | The ID table is uncompressed. | - +------+---------------+--------+--------------------------------------------+ - | u16 | id count | The number of entries in the ID lookup table. | - +------+---------------+-----------------------------------------------------+ - | u16 | version major | Major version of the format. Must be set to 4. | - +------+---------------+-----------------------------------------------------+ - | u16 | version minor | Minor version of the format. Must be set to 0. | - +------+---------------+-----------------------------------------------------+ - | u64 | root inode | A reference to the inode of the root directory. | - +------+---------------+-----------------------------------------------------+ - | u64 | bytes used | The number of bytes used by the archive. Because | - | | | SquashFS archives must be padded to a multiple of | - | | | the underlying device block size, this can be less | - | | | than the actual file size. | - +------+---------------+-----------------------------------------------------+ - | u64 | ID table | The byte offset at which the id table starts. | - +------+---------------+-----------------------------------------------------+ - | u64 | Xattr table | The byte offset at which the xattr id table starts. | - +------+---------------+-----------------------------------------------------+ - | u64 | Inode table | The byte offset at which the inode table starts. | - +------+---------------+-----------------------------------------------------+ - | u64 | Dir. table | The byte offset at which the directory table starts.| - +------+---------------+-----------------------------------------------------+ - | u64 | Frag table | The byte offset at which the fragment table starts. | - +------+---------------+-----------------------------------------------------+ - | u64 | Export table | The byte offset at which the export table starts. | - +------+---------------+-----------------------------------------------------+ - - The Xattr table, fragment table and export table are optional. If they are - omitted from the archive, the respective fields indicating their position - must be set to 0xFFFFFFFFFFFFFFFF (i.e. all bits set). - - Most of the flags only serve an informational purpose and are only useful - when editing the archive to convey the original packer settings. - - The only flag that actually carries information is the "Compressor options are - present" flag. In fact, this is the only flag that the Linux kernel - implementation actually tests for. - - The compressor options, however, are also only there for informal purpose, as - most compression libraries understand their own stream format irregardless of - the options used to compress and in fact don't provide any options for the - decompressor. In the Linux kernel, the XZ decompressor is currently the only - one that processes those options to pre-allocate the LZMA dictionary if a - non-default size was used. - - - 3.1) Compression Options - - If the compressor options flag is set in the superblock, the superblock is - immediately followed by a single metadata block, which is always uncompressed. - - The data stored in this block is compressor dependent. - - There are two special cases: - - For LZ4, the compressor options always have to be present. - - The LZMA compressor does not support compressor options, so this section - must never be present. - - For the compressors currently implemented, a 4 to 8 byte payload follows. - - The following sub sections outline the contents for each compressor that - supports options. The default values if the options are missing are outlined - as well. - - - 3.1.1) GZIP - - +======+===================+=================================================+ - | Type | Name | Description | - +======+===================+=================================================+ - | u32 | compression level | In the range 1 to 9 (inclusive). Defaults to 9. | - +------+-------------------+-------------------------------------------------+ - | u16 | window size | In the range 8 to 15 (inclusive) Defaults to 15.| - +------+-------------------+-------------------------------------------------+ - | u16 | strategies | A bit field describing the enabled strategies. | - | | | If no flags are set, the default strategy is | - | | | implicitly used. Please consult the ZLIB manual | - | | | for details on specific strategies. | - | | | | - | | +--------+----------------------------------------+ - | | | Value | Comment | - | | +--------+----------------------------------------+ - | | | 0x0001 | Default strategy. | - | | | 0x0002 | Filtered. | - | | | 0x0004 | Huffman Only. | - | | | 0x0008 | Run Length Encoded. | - | | | 0x0010 | Fixed. | - +------+-------------------+--------+----------------------------------------+ - - Note: The SquashFS writer typically tries all selected strategies (including - not setting any and letting zlib work with defaults) and stores the result - with the smallest size. - - - 3.1.2) XZ - - +======+===================+=================================================+ - | Type | Name | Description | - +======+===================+=================================================+ - | u32 | dictionary size | SHOULD be >= 8KiB, and must be either a power of| - | | | 2, or the sum of two consecutive powers of 2. | - +------+-------------------+-------------------------------------------------+ - | u32 | Filters | A bit field describing the additional enabled | - | | | filters attempted to better compress executable | - | | | code. | - | | | | - | | +--------+----------------------------------------+ - | | | Value | Comment | - | | +--------+----------------------------------------+ - | | | 0x0001 | x86 | - | | | 0x0002 | PowerPC | - | | | 0x0004 | IA64 | - | | | 0x0008 | ARM | - | | | 0x0010 | ARM thumb | - | | | 0x0020 | SPARC | - +------+-------------------+--------+----------------------------------------+ - - Note: A SquashFS writer typically tries all selected VLI filters (including - not setting any and letting libxz work with defaults) and stores the resulting - block that has the smallest size. - - Also note that further options, such as XZ presets, are not included. The - compressor typically uses the libxz defaults, i.e. level 6 and not using the - extreme flag. Likewise for lc, lp and pb (defaults are 3, 0 and 2 - respectively). - - If the encoder chooses to change those values, the decoder will still be - able to read the data, but there is currently no way to convey that those - values were changed. - - This is specifically problematic for the compression level, since increasing - the level can result in drastically increasing the decoders memory consumption. - - - 3.1.3) LZ4 - - +======+===================+=================================================+ - | Type | Name | Description | - +======+===================+=================================================+ - | u32 | Version | MUST be set to 1. | - +------+-------------------+-------------------------------------------------+ - | u32 | Flags | A bit field describing the enabled LZ4 flags. | - | | | There is currently only one possible flag: | - | | | | - | | +--------+----------------------------------------+ - | | | Value | Comment | - | | +--------+----------------------------------------+ - | | | 0x0001 | Use LZ4 High Compression(HC) mode. | - +------+-------------------+--------+----------------------------------------+ - - 3.1.4) ZSTD - - +======+===================+=================================================+ - | Type | Name | Description | - +======+===================+=================================================+ - | u32 | compression level | Should be in range 1 to 22 (inclusive). The real| - | | | maximum is the zstd defined ZSTD_maxCLevel(). | - | | | | - | | | The default value is 15. | - +------+-------------------+-------------------------------------------------+ - - 3.1.5) LZO - - +======+===================+=================================================+ - | Type | Name | Description | - +======+===================+=================================================+ - | u32 | algorithm | Which variant of LZO to use. | - | | | | - | | +--------+----------------------------------------+ - | | | Value | Comment | - | | +--------+----------------------------------------+ - | | | 0 | lzo1x_1 | - | | | 1 | lzo1x_1_11 | - | | | 2 | lzo1x_1_12 | - | | | 3 | lzo1x_1_15 | - | | | 4 | lzo1x_999 (default) | - +------+-------------------+--------+----------------------------------------+ - | u32 | compression level | For lzo1x_999, this can be a value between 0 | - | | | and 9 inclusive (defaults to 8). MUST be 0 | - | | | for all other algorithms. | - +------+-------------------+-------------------------------------------------+ - - - - 4) Data and Fragment Blocks - *************************** - - As outlined in 2.1, file data is packed by dividing the input files into fixed - size chunks (the block size from the super block) that are stored in sequence. - - The picture below tries to illustrate this concept: - - - _____ _____ _____ _ _____ _____ _ _ - File A: |__A__|__A__|__A__|A| File B: |__B__|__B__|B| File C: |C| - | | | | | | | | - | +---+ | | | | | | - | | +------+ | | | | | - | | | | | | | | - | | | +------|---------------+ | | | - | | | | +--|---------------------+ | | - | | | | | | | | - | | | | | +-----------------------+ | +------------+ - | | | | | | | | - V V V V V V V V - __ _ ___ ___ ___ __ Fragment block: |A|B|C| - Output: |_A|A|_A_|_B_|_B_|_F| | - __V__ - A |__F__| - | | - +------------------------+ - - Figure 2.1: Packing of File Data. - - - In Figure 1, file A consists of 3 blocks and a single tail end, file B has - 2 blocks and one tail end while file C is smaller than block size. - - For each file, the blocks are individually compressed and stored on disk - in order. - - The tail ends of A and B, together with the entire contents of C are packed - together into a fragment block F, that is compressed and stored on disk once - it is full. - - This tail-end-packing is completely optional. The tail ends (or in case of C - the entire file) can also be treated as truncated blocks that expand to less - than block size when uncompressed. - - - There are no headers in front of data or fragment blocks and there MUST NOT be - any gaps between data blocks from a single file, but a SquashFS packer is free - to leave gaps between two different files or fragment blocks. The packer is - also free to decide how to arrange fragments within a fragment block and what - fragments to pack together. - - - - To locate file data, the file inodes store the following information: - - - The uncompressed size of the file. From this, the number of blocks can - be computed: - - block_count = floor(file_size / block_size) if tail end packing is used - block_count = ceil(file_size / block_size) otherwise - - - The exact location of the first block, if one exists. - - For each consecutive block, the on-disk size. - - A 32 bit integer is used with bit 24 (i.e. 1 << 24) set if the block - is stored uncompressed. - - - If tail-end-packing was done, the location of the fragment block and a - byte offset into the uncompressed fragment block. The size of the tail - end can be computed easily: - - tail_end_size = file_size % block_size - - - Since a fragment block will likely be referred to by multiple files, inodes - don't store its on-disk location and size directly, but instead use a 32 bit - index into a fragment block lookup table (see section 7). - - - If a data block other than the last one unpacks to less than block size, the - rest of the buffer is filled with 0 bytes. This way, sparse files are - implemented. Specifically if a block has an on-disk size of 0 this translates - to an entire block filled with 0 bytes without having to retrieve any data - from disk. - - - The on-disk locations of file blocks MAY overlap and different file inodes are - free to refer to the same fragment. Typical SquashFS packers would explicitly - use this to for files that are duplicates of others. Doing so is NOT counted - as a hard link. - - If an inode references on-disk locations outside the data area, the result is - undefined. - - - 5) Inode Table - ************** - - Inodes are packed into metadata blocks and are not aligned, i.e. they can span - the boundary between metadata blocks. To save space, there are different - inodes for each type (regular file, directory, device, etc.) of varying - contents and size. - - To further save more space, inodes come in two flavors: simple inode types - optimized for a simple, standard use case, and extended inode types where - extra information has to be stored. - - SquashFS more or less supports 32 bit UIDs and GIDs. As an optimization, those - IDs are stored in a lookup table (see section 9) and the inodes themselves - hold a 16 bit index into this table. This allows to 32 bit UIDs/GIDs, but only - among 2^16 unique values. - - - The location of the first metadata block is indicated by the inode table start - in the superblock. The inode table ends at the start of the directory table. - - - 5.1) Common Inode Header - - All Inodes share a common header, which contains some common information, - as well as describing the type of Inode which follows. This header has the - following structure: - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u16 | type | The type of item described by the inode which follows| - | | | this header. | - | | | | - | | +-------+----------------------------------------------+ - | | | Value | Comment | - | | +-------+----------------------------------------------+ - | | | 1 | Basic Directory | - | | | 2 | Basic File | - | | | 3 | Basic Symlink | - | | | 4 | Basic Block Device | - | | | 5 | Basic Character Device | - | | | 6 | Basic Named Pipe (FIFO) | - | | | 7 | Basic Socket | - | | | 8 | Extended Directory | - | | | 9 | Extended File | - | | | 10 | Extended Symlink | - | | | 11 | Extended Block Device | - | | | 12 | Extended Character Device | - | | | 13 | Extended Named Pipe (FIFO) | - | | | 14 | Extended Socket | - +------+--------------+-------+----------------------------------------------+ - | u16 | permissions | A bit mask representing Unix file system permissions | - | | | for the inode. This only stores permissions, not the | - | | | type. The type is reconstructed from the field above.| - +------+--------------+------------------------------------------------------+ - | u16 | uid | An index into the ID table, giving the user ID | - | | | of the owner. | - +------+--------------+------------------------------------------------------+ - | u16 | gid | An index into the ID table, giving the group ID | - | | | of the owner. | - +------+--------------+------------------------------------------------------+ - | u32 | mtime | The unsigned number of seconds (not counting leap | - | | | seconds) since 00:00, Jan 1st, 1970 UTC when the item| - | | | described by the inode was last modified. | - +------+--------------+------------------------------------------------------+ - | u32 | inode number | Unique node number. Must be at least 1 and at most | - | | | the inode count from the super block. | - +------+--------------+------------------------------------------------------+ - - Depending on the type, additional data follows, outlined in sections 5.2 - to 5.6. - - - - 5.2) Directory inodes - - Directory inodes mainly contain a reference into the directory table where - the listing of entries is stored. - - A basic directory has an entry listing of at most 64k (uncompressed) and - no extended attributes. The layout of the inode data is as follows: - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u32 | block index | The location of the metadata block in the directory | - | | | table where the entry information starts. This is | - | | | relative to the directory table location. | - +------+--------------+------------------------------------------------------+ - | u32 | link count | The number of hard links to this directory. | - +------+--------------+------------------------------------------------------+ - | u16 | file size | Total (uncompressed) size in bytes of the entry | - | | | listing in the directory table, including headers. | - | | | | - | | | This value is 3 bytes larger than the real listing. | - | | | The Linux kernel creates "." and ".." entries for | - | | | offsets 0 and 1, and only after 3 looks into the | - | | | listing, subtracting 3 from the size. | - +------+--------------+------------------------------------------------------+ - | u16 | block offset | The (uncompressed) offset within the metadata block | - | | | in the directory table where the directory listing | - | | | starts. | - +------+--------------+------------------------------------------------------+ - | u32 | parent inode | The inode number of the parent of this directory. If | - | | | this is the root directory, this SHOULD be 0. | - +------+--------------+------------------------------------------------------+ - - - Note that for historical reasons, the hard link count of a directory includes - the number of entries in the directory and is initialized to 2 for an empty - directory. I.e. a directory with N entries has at least N + 2 link count. - - - If the "file size" is set to a value < 4, the directory is empty and there is - no corresponding listing in the directory table. - - - An extended directory can have a listing that is at most 4GiB in size, may - have extended attributes and can have an optional index for faster lookup: - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u32 | link count | Same as above. | - +------+--------------+------------------------------------------------------+ - | u32 | file size | Same as above. | - +------+--------------+------------------------------------------------------+ - | u32 | block index | Same as above. | - +------+--------------+------------------------------------------------------+ - | u32 | parent inode | Same as above. | - +------+--------------+------------------------------------------------------+ - | u16 | index count | The number of directory index entries following the | - | | | inode structure. | - +------+--------------+------------------------------------------------------+ - | u16 | block offset | Same as above. | - +------+--------------+------------------------------------------------------+ - | u32 | xattr index | An index into the xattr lookup table or 0xFFFFFFFF | - | | | if the inode has no extended attributes. | - +------+--------------+------------------------------------------------------+ - - The index follows directly after the inode. See section 6.1 for details on - how the directory index is structured. - - - 5.3) File Inodes - - Basic files can be at most 4 GiB in size (uncompressed), must be located - within the first 4 GiB of the SquashFS image, cannot have any extended - attributes and don't support hard-link or sparse file accounting: - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u32 | blocks start | The offset from the start of the archive to the first| - | | | data block. | - +------+--------------+------------------------------------------------------+ - | u32 | frag index | An index into the fragment table which describes the | - | | | fragment block that the tail end of this file is | - | | | stored in. If not used, this is set to 0xFFFFFFFF. | - +------+--------------+------------------------------------------------------+ - | u32 | block offset | The (uncompressed) offset within the fragment block | - | | | where the tail end of this file is. See section 3 | - | | | for details. | - +------+--------------+------------------------------------------------------+ - | u32 | file size | The (uncompressed) size of this file. | - +------+--------------+------------------------------------------------------+ - | u32[]| block sizes | An array of consecutive block sizes. See section 3 | - | | | for details. | - +------+--------------+------------------------------------------------------+ - - If 'frag index' is set to 0xFFFFFFFF, the number of blocks is computed as - - ceil(file_size / block_size) - - otherwise, if 'frag index' is a valid fragment index, the block count is - computed as - - floor(file_size / block_size) - - and the size of the tail end is - - file_size % block_size - - - To access a data block, first compute the block index as - - index = floor(offset / block_size) - - then compute the on-disk location of the block by summing up the sizes of the - blocks that come before it: - - location = block_start - - for i = 0; i < index; i++ - location += block_sizes[i] & 0x00FFFFFF - - - The tail end, if present, is accessed by resolving the fragment index through - the fragment lookup table (see section 7), loading the fragment block and - using the given 'block offset' into the fragment block. - - - - Extended files have a 64 bit location and size, have additional counters for - sparse file accounting and hard links, and can have extended attributes: - - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u64 | blocks start | Same as above (but larger). | - +------+--------------+------------------------------------------------------+ - | u64 | file size | Same as above (but larger). | - +------+--------------+------------------------------------------------------+ - | u64 | sparse | The number of bytes saved by omitting zero bytes. | - | | | Used in the kernel for sparse file accounting. | - +------+--------------+------------------------------------------------------+ - | u32 | link count | The number of hard links to this node. | - +------+--------------+------------------------------------------------------+ - | u32 | frag index | Same as above. | - +------+--------------+------------------------------------------------------+ - | u32 | block offset | Same as above. | - +------+--------------+------------------------------------------------------+ - | u32 | xattr index | An index into the xattr lookup table or 0xFFFFFFFF | - | | | if the inode has no extended attributes. | - +------+--------------+------------------------------------------------------+ - | u32[]| block sizes | Same as above. | - +------+--------------+------------------------------------------------------+ - - - 5.4) Symbolic Links - - Symbolic links mainly have a target path stored directly after the inode - header, as well as a hard-link counter (yes, you can have hard links to - symlinks): - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u32 | link count | The number of hard links to this symlink. | - +------+--------------+------------------------------------------------------+ - | u32 | target size | The size in bytes of the target path this symlink | - | | | points to. | - +------+--------------+------------------------------------------------------+ - | u8[] | target path | An array of bytes holding the target path this | - | | | symlink points to. The path is 'target size' bytes | - | | | long and NOT null-terminated. | - +------+--------------+------------------------------------------------------+ - - The extended symlink type adds an additional extended attribute index: - - +======+==============+=======================================+ - | Type | Name | Description | - +======+==============+=======================================+ - | u32 | link count | Same as above. | - +------+--------------+---------------------------------------+ - | u32 | target size | Same as above. | - +------+--------------+---------------------------------------+ - | u8[] | target path | Same as above. | - +------+--------------+---------------------------------------+ - | u32 | xattr index | An index into the xattr lookup table. | - +------+--------------+---------------------------------------+ - - - 5.5) Device Special Files - - Basic device special files only store a hard-link counter and a device number. - The layout is identical for both character and block devices: - - +======+===============+=====================================================+ - | Type | Name | Description | - +======+===============+=====================================================+ - | u32 | link count | The number of hard links to this entry. | - +------+---------------+-----------------------------------------------------+ - | u32 | device number | The system specific device number. | - | | | | - | | | On Linux, this consists of major and minor device | - | | | numbers that can be extracted as follows: | - | | | major = (dev & 0xFFF00) >> 8. | - | | | minor = (dev & 0x000FF) | ((dev >> 12) & 0xFFF00) | - +------+---------------+-----------------------------------------------------+ - - The extended device file inode adds an additional extended attribute index: - - +======+===============+=========================================+ - | Type | Name | Description | - +======+===============+=========================================+ - | u32 | link count | Same as above. | - +------+---------------+-----------------------------------------+ - | u32 | device number | Same as above. | - +------+---------------+-----------------------------------------+ - | u32 | xattr index | An index into the xattr lookup table. | - +------+---------------+-----------------------------------------+ - - - 5.6) IPC inodes (FIFO or Socket) - - Named pipe (FIFO) and socket special files only add a hard-link counter - after the inode header: - - +======+=============+=========================================+ - | Type | Name | Description | - +======+=============+=========================================+ - | u32 | link count | The number of hard links to this entry. | - +------+-------------+-----------------------------------------+ - - The extended versions add an additional extended attribute index: - - +======+=============+=========================================+ - | Type | Name | Description | - +======+=============+=========================================+ - | u32 | link count | Same as above. | - +------+-------------+-----------------------------------------+ - | u32 | xattr index | An index into the xattr lookup table. | - +------+-------------+-----------------------------------------+ - - - - 6) Directory Table - ****************** - - For each directory inode, the directory table stores a linear list of all - entries, with references back to the inodes that describe those entries. - - The entry list itself is sorted ASCIIbetically by entry name and split into - multiple runs, each preceded by a short header. - - The directory inodes store the total, uncompressed size of the entire listing, - including headers. Using this size, a SquashFS reader can determine if another - header with further entries should be following once it reaches the end of a - run. - - To save space, the header indicates a metadata block and a reference inode - number. The entries that follow simply store a difference to that inode number - and an offset into the specified metadata block. - - Every time, the inode block changes or the difference of the inode number - to the reference in the header cannot be encoded in 16 bits anymore, a new - header is emitted. - - A header must be followed by AT MOST 256 entries. If there are more entries, - a new header MUST be emitted. - - Typically, inode allocation strategies would sort the children of a directory - and then allocate inode numbers incrementally, to optimize directory entry - listings. - - Since hard links might be further further away than +/- 32k of the reference - number, they might require a new header to be emitted. Inode number allocation - and picking of the reference could of course be optimized to prevent this. - - The directory header has the following structure: - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u32 | count | Number of entries following the header | - +------+--------------+------------------------------------------------------+ - | u32 | start | The location of the metadata block in the inode table| - | | | where the inodes are stored. This is relative to the | - | | | inode table start from the super block. | - +------+--------------+------------------------------------------------------+ - | s32 | inode number | An arbitrary inode number. The entries that follow | - | | | store their inode number as a difference to this. | - +======+==============+======================================================+ - - The counter is stored off-by-one, i.e. a value of 0 indicates 1 entry follows. - This also makes it impossible to encode a size of 0, which wouldn't make any - sense. Empty directories simply have their size set to 0 in the inode instead, - so no extra dummy header has to be stored or looked up. - - - The header is followed by multiple entries that each have this structure: - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u16 | offset | An offset into the uncompressed inode metadata block.| - +------+--------------+------------------------------------------------------+ - | s16 | inode offset | The difference of this inodes number to the reference| - | | | stored in the header. | - +------+--------------+------------------------------------------------------+ - | u16 | type | The inode type. For extended inodes, the basic type | - | | | is stored here instead. | - +------+--------------+------------------------------------------------------+ - | u16 | name size | One less than the size of the entry name. | - +------+--------------+------------------------------------------------------+ - | u8[] | name | The file name of the entry without a trailing null | - | | | byte. Has 'name size' + 1 bytes. | - +------+--------------+------------------------------------------------------+ - - In the entry structure itself, the file names are stored without trailing null - bytes. Since a zero length name makes no sense, the name length is stored - off-by-one, i.e. the value 0 cannot be encoded. - - The inode type is stored in the entry, but always as the corresponding - basic type. - - While the field is technically 16 bits, the kernel implementation currently - imposes an arbitrary limit of 255 on the name size field. Since the field is - off-by-one, this means that a file name in SquashFS can be at most 256 - characters long. - - - 6.1) Directory Index - - To speed up lookups on directories with lots of entries, the extended - directory inode can store an index, holding the locations of all directory - headers and the name of the first entry after the header. - - When searching for an entry, the reader can then iterate over the index to - find a range of metadata blocks that should contain a given entry and then - only scan over the given range. - - To allow for even faster lookups, a new header should be emitted every time - the entry list crosses a metadata block boundary. This narrows the boundary - down to a single metadata block lookup in most cases. - - The index entries have the following structure: - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u32 | index | This stores a byte offset from the first directory | - | | | header to the current header, as if the uncompressed | - | | | directory metadata blocks were laid out in memory | - | | | consecutively. | - +------+--------------+------------------------------------------------------+ - | u32 | start | Start offset of a directory table metadata block, | - | | | relative to the directory table start. | - +------+--------------+------------------------------------------------------+ - | u32 | name size | One less than the size of the entry name. | - +------+--------------+------------------------------------------------------+ - | u8[] | name | The name of the first entry following the header | - | | | without a trailing null byte. | - +------+--------------+------------------------------------------------------+ - - - 7) Fragment Table - ***************** - - Tail-ends and smaller than block size files can be combined into fragment - blocks that are at most 'block size' bytes long. - - The fragment table describes the location and size of the fragment blocks - (not the tail-ends within them). - - - This is a lookup table which stores entries of the following shape: - - +======+==============+======================================================+ - | Type | Name | Description | - +======+==============+======================================================+ - | u64 | start | The offset within the archive where the fragment | - | | | block starts. | - +------+--------------+------------------------------------------------------+ - | u32 | size | The on-disk size of the fragment block. If the block | - | | | is uncompressed, bit 24 (i.e. 1 << 24) is set. | - +------+--------------+------------------------------------------------------+ - | u32 | unused | SHOULD be set to 0. | - +------+--------------+------------------------------------------------------+ - - - The table is stored on-disk as described in section 2.3. - - The fragment table location in the superblock points to an array of 64 bit - integers that store the on-disk locations of the metadata blocks containing - the lookup table. - - Each metadata block can store up to 512 entries (= 8129 / 16). - - The "unused" field is there for alignment and SHOULD be set to 0, however the - Linux kernel currently ignores this field completely, making it impossible for - Linux to ever re-purpose this field. - - - 8) Export Table - *************** - - To support NFS exports, SquashFS needs a fast way to resolve an inode number - to an inode structure. - - For this purpose, a SquashFS archive can optionally contain an export table, - which is basically a flat array of 64 bit inode references, with the inode - number being used as an index into the array. - - Because the inode number 0 is not used (reserved as a sentinel value), the - array actually starts at inode number 1 and the index is thus - inode_number - 1. - - The array itself is stored in a series of metadata blocks, as outlined in - section 2.3. - - Since each block can store 1024 references (= 8192 / 8), there will be - ceil(inode_count / 1024) metadata blocks for the entire array. - - - 9) ID Table - *********** - - As outlined in section 5.1, SquashFS supports 32 bit user and group IDs. To - compact the inode table, the unique UID/GID values are collected in a lookup - table and a 16 bit table index is stored in the inode instead. - - This lookup table is stored as outlined in section 2.3. - - Each metadata block can store up to 2048 IDs (=8192 / 4). - - - 10) Extended Attribute Table - **************************** - - Extended attributes are arbitrary key value pairs attached to inodes. The key - names use dots as separators to create a hierarchy of name spaces. - - The key value pairs of all inodes are stored consecutively in a series of - metadata blocks. - - The values can either be stored inline, i.e. a key entry is directly followed - by a value, or out-of-line to deduplicate identical values and use a reference - instead. Typically, the first occurrence of a value is stored in line and - every consecutive use of the same value uses an out-of-line reference back to - the first one. - - - The keys are stored using the following data structure: - - +======+===========+=========================================================+ - | Type | Name | Description | - +======+===========+=========================================================+ - | u16 | type | A prefix ID for the key name. If the value that follows | - | | | is stored out-of-line, the flag 0x0100 is ORed to the | - | | | type ID. | - | | | | - | | +-------+-------------------------------------------------+ - | | | Value | Comment | - | | +-------+-------------------------------------------------+ - | | | 0 | Prefix the name with "user." | - | | | 1 | Prefix the name with "trusted." | - | | | 2 | Prefix the name with "security." | - +------+-----------+-------+-------------------------------------------------+ - | u16 | name size | The number of key bytes that follows. | - +------+-----------+---------------------------------------------------------+ - | u8[] | name | The remainder of the key without the prefix and without | - | | | trailing null byte. | - +------+-----------+---------------------------------------------------------+ - - - Following the key, this structure is used to store the value: - - +======+============+========================================================+ - | Type | Name | Description | - +======+============+========================================================+ - | u32 | value size | The size of the value string. If the value is stored | - | | | out of line, this is always 8, i.e. the size of an | - | | | unsigned 64 bit integer. | - +------+------------+--------------------------------------------------------+ - | u8[] | value | This is 'value size' bytes of arbitrary binary data. | - | | | If the value is stored out-of-line, this is a 64 bit | - | | | reference, i.e. a location of a metadata block, | - | | | shifted left by 16 and OR-ed with an offset into the | - | | | uncompressed block, giving the location of another | - | | | value structure. | - +------+------------+--------------------------------------------------------+ - - - The metadata block location given by an out-of-line reference is relative to - the location of the first block. - - - To actually address a block of key value pairs associated with an inode, a - lookup table is used that specifies the start and size of a sequence of key - value pairs. - - All an inode needs to store is a 32 bit index into this table. If two inodes - have an identical attribute sets, the key/value sequence is only written once, - there is only one lookup table entry and both inodes have the same index. - - Each lookup table entry has the following structure: - - +======+============+========================================================+ - | Type | Name | Description | - +======+============+========================================================+ - | u64 | xattr ref | A reference to the start of the key value block, i.e. | - | | | the metadata block location shifted left by 16, OR-ed | - | | | with am offset into the uncompressed block. | - +------+------------+--------------------------------------------------------+ - | u32 | count | The number of key value pairs. | - +------+------------+--------------------------------------------------------+ - | u32 | size | The exact, uncompressed size in bytes of the entire | - | | | block of key value pairs, counting what has been | - | | | written to disk and including the key/value entry | - | | | structures. | - +------+------------+--------------------------------------------------------+ - - This lookup table is stored as outlined in section 2.3. - - Each metadata block can hold 512 (= 8192 / 16) entries. - - However, in contrast to section 2.3, additional data is given before the list - of metdata block locations, to locate the key-value pairs, as well as the - actual number of lookup table entries that are not specified in the super - block. - - - The 'Xattr table' entry in the superblock gives the absolute location of the - following data structure which is stored on-disk as is, uncompressed: - - +=======+===========+========================================================+ - | Type | Name | Description | - +=======+===========+========================================================+ - | u64 | kv start | The absolute position of the first metadata block | - | | | holding the key/value pairs. | - +-------+-----------+--------------------------------------------------------+ - | u32 | count | The number of entries in the lookup table. | - +-------+-----------+--------------------------------------------------------+ - | u32 | unused | SHOULD be set to 0, however Linux currently ignores | - | | | this field completely and squashfs-tools used to leak | - | | | stack data here, making it impossible for Linux to | - | | | ever re-purpose this field. | - +-------+-----------+--------------------------------------------------------+ - | u64[] | locations | An array holding the absolute on-disk location of each | - | | | metadata block of the lookup table. | - +-------+-----------+--------------------------------------------------------+ - - If an inode has a a valid xattr index (i.e. not 0xFFFFFFFF), the metadata - block index is computed as - - block_idx = floor(index / 512) - - which is then used to retrieve the metadata block index from the locations - array. - - Once the block has been read from disk and uncompressed, the byte offset into - the metadata block can be computed as - - offset = (index * 16) % 8192 - - From this position, the structure can be read that holds a reference to the - metadata block that contains the key/value pairs (and byte offset into the - uncompressed block where the pairs start), as well as the number of key/value - pairs and their total, uncompressed size. - |