diff options
author | Zachary Dremann <dremann@gmail.com> | 2021-08-01 11:50:38 -0400 |
---|---|---|
committer | David Oberhollenzer <david.oberhollenzer@sigma-star.at> | 2021-08-12 18:24:12 +0200 |
commit | 23b81d2554973fed261f51bfe878a9d983caf53d (patch) | |
tree | 29df2ac99aa8ec38dd785334c411414f05cb4b4f | |
parent | 915808aaf0660b03fc2e3fa5e95a2f2e2aaa6daf (diff) |
Replace format.txt with an asciidoc version
-rw-r--r-- | doc/format.adoc | 1067 | ||||
-rw-r--r-- | doc/format.txt | 1216 |
2 files changed, 1067 insertions, 1216 deletions
diff --git a/doc/format.adoc b/doc/format.adoc new file mode 100644 index 0000000..f81b6f8 --- /dev/null +++ b/doc/format.adoc @@ -0,0 +1,1067 @@ += Squashfs Binary Format +:toc: left +:toclevels: 4 +:sectnums: + +== About + +SquashFS is a compressed, read-only filesystem for Linux that can also be used +as a flexible, general purpose, compressed archive format, optimized for fast +random access with support for Unix permissions, sparse files and extended +attributes. + +SquashFS supports data and metadata compression through zlib, lz4, lzo, lzma, +xz or zstd. + +For fast random access, compressed files are split up in fixed size blocks +that are compressed separately. +The block size can be set between 4k and 1M (default for squashfs-tools and +squashfs-tools-ng is 128K). + +This document attempts to specify the on-disk format in detail. +It is based on a previous on-line version that was originally written by +Zachary Dremann and subsequently expanded by David Oberhollenzer during +reverse engineering attempts and available here: https://dr-emann.github.io/squashfs/. + +== Overview + +SquashFS always stores integers in little endian format. +The data blocks that make up the SquashFS archive are byte aligned, +i.e. they typically do not care for alignment. +The implementation in the Linux kernel requires the archive itself to +be a multiple of either 1k or 4k in size (called the device block size) +and user space tools typically use 4k to be compatible with both. + +A SquashFS archive consists of a maximum of nine parts: + +[%nowrap] +---- + _______________ +| | Important information about the archive, including +| Superblock | locations of other sections. +|_______________| +| | If non-default compression options have been used, +| Compression | they can optionally be stored here, to facilitate +| options | later, offline editing of the archive. +|_______________| +| | +| Data blocks | The contents of the files in the archive, +| & fragments | split into separately compressed blocks. +|_______________| +| | Metadata (ownership, permissions, etc) for +| Inode table | items in the archive. +|_______________| +| | +| Directory | Directory listings, including file names, and +| table | references to inodes. +|_______________| +| | +| Fragment | Description of fragment locations within the +| table | Datablocks & Fragments section. +|_______________| +| | A mapping from inode numbers to disk locations, +| Export table | required for NFS export. +|_______________| +| | +| UID/GID | A list of unique UID/GIDs. Inodes use an index into +| lookup table | this table to save memory. +|_______________| +| | +| Xattr | Extended attributes for items in the archive. +| table | +|_______________| +---- + +Although the super block details the exact positions of each section, most +implementations, including the one in the Linux kernel, insist on this exact +order. + +=== Packing File Data + +The file data is packed into the archive after the super block (and optional +compressor options). + +Files are divided into fixed size blocks that are separately compressed and +stored in order. SquashFS supports optional tail-end-packing of files that +are not an exact multiple of the block size. The remaining ends can either +be treated as a short block, or can be packed together with the tail ends of +other files in a single "fragment block". Files that are less than block size +are treated the same way. + +If the size of a data or fragment block would exceed the input size after +compression, the original, uncompressed data is stored, so that the size of a +block after compression never exceeds the input block size. + +=== Packing Metadata + +Metadata (e.g. inodes, directory listings, etc...) is treated as a continuous +stream of records that is chopped up into 8KiB blocks that are separately +compressed into special metadata blocks. + +The input size of 8KiB is fixed and independent of the data block size. +Similar to data blocks, if the compressed size would exceed 8KiB, the +uncompressed block is stored instead, so the on-disk size of a metadata +block never exceeds 8KiB. + +Individual entries are allowed to cross the block boundary, so e.g. an inode +may be located at the end of a metadata block with some part of it located at +the start of the next block. Both have to be read and decompressed when +reading this inode. If an entry is written across block boundaries, there +*MUST NOT* be any gap between the compressed metadata blocks on-disk. + + +In contrast to data blocks, every metadata block is preceded by a single, +16 bit unsigned integer. This integer holds the on-disk size of the block +that follows. The MSB is set if the block is stored uncompressed. Whenever +a metadata block is referenced, the position of this integer is given. + +To read a metadata block, seek to the indicated position and read the 16 bit +header. Sanity check that the lower 15 bit are less than 8KiB and proceed +to read that many bytes. If the highest bit of the header is cleared, +uncompress the data into an 8KiB buffer that *MUST NOT* overflow. + + +In the SquashFS archive format, metadata entries (e.g. inodes) are often +referenced using a 64 bit integer. The lower 16 bit hold an offset into the +uncompressed block and the upper 48 bit point to the on-disk location of the +block. + +The on-disk location is relative to the type of metadata entry, e.g. for +inodes it is relative to the start of the inode table given by the +super block. + +=== Storing Lookup Tables + +Lookup tables are arrays (i.e. sequences of identical sized records) that are +addressed by an index. + +Such tables are stored in the SquashFS format as metadata blocks, i.e. by +dividing the table data into 8KiB chunks that are separately compressed and +stored in sequence. + +To allow constant time lookup, a list of 64 bit unsigned integers is stored, +holding the on-disk locations of each metadata block. + +This list itself is stored uncompressed and not preceded by a header. + +When referring to a lookup table, the superblock gives the number of table +entries and points to this location list. + +Since the table entry size is a known, fixed value, the required number of +metadata blocks can be computed: + + block_count = ceil(table_count * entry_size / 8192) + +Which is also the number of 64 bit integers in the location list. + +When resolving a lookup table index, first work out the index of the +metadata block: + + meta_index = floor(index * entry_size / 8192) + +Using this index on the location list yields the on-disk location of +the metadata block containing the entry. + +After reading this metadata block, the byte offset into the block can +be computed to get the entry: + + offset = index * entry_size % 8192 + +The location list can be cached in memory. Resolving an index requires at +worst a single metadata block read (at most 8194 bytes fetched from an +unaligned on-disk location). + +=== Supported Compressors + +The SquashFS format supports the following compressors: + +* zlib deflate (referred to as "gzip" but only uses raw zlib streams) +* lzo +* lzma 1 (considered deprecated) +* lzma 2 (referred to as "xz") +* lz4 +* zstd + +The archive can only specify one compressor in the super block and has to use +it for both file data and metadata compression. Using one compressor for data +and switching to a different compressor for e.g. inodes is not supported. + +While it is technically not possible to pick a "null" compressor in the super +block, an implementation can still deliberately write only uncompressed blocks +to a SquashFS archive, or choose to store certain metadata blocks without +compression. + +The lzma 2 aka xz compressor *MUST* use `CRC32` checksums only. Using `SHA-256` is +not supported. + +== The Superblock + +The superblock is the first section of a SquashFS archive. It is always +96 bytes in size and contains important information about the archive, +including the locations of other sections. + +[cols="1,3,13a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | magic | Must be set to `0x73717368` ("hsqs" on disk). +| u32 | inode count | The number of inodes stored in the archive. +| u32 | mod time | Last modification time of the archive. Count seconds + since 00:00, Jan 1st 1970 UTC (not counting leap + seconds). This is unsigned, so it expires in the + year 2106 (as opposed to 2038). +| u32 | block size | The size of a data block in bytes. Must be a power + of two between 4096 (4k) and 1048576 (1 MiB). +| u32 | frag count | The number of entries in the fragment table. +| u16 | compressor | An ID designating the compressor used for both data + and meta data blocks. + +[cols=">1,2,8",frame="none",grid="none",options="header"] +!=== +! Value ! Name ! Comment +! 1 ! GZIP ! just zlib streams (no gzip headers\!) +! 2 ! LZO ! +! 3 ! LZMA ! LZMA version 1 +! 4 ! XZ ! LZMA version 2 as used by xz-utils +! 5 ! LZ4 ! +! 6 ! ZSTD ! +!=== + +| u16 | block log | The log~2~ of the block size. If the two fields do not + agree, the archive is considered corrupted. +| u16 | flags | Bit wise *OR* of the flag bits below. + + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Meaning +! 0x0001 ! Inodes are stored uncompressed. +! 0x0002 ! Data blocks are stored uncompressed. +! 0x0004 ! Unused, should always be unset. +! 0x0008 ! Fragments are stored uncompressed. +! 0x0010 ! Fragments are not used. +! 0x0020 ! Fragments are always generated. +! 0x0040 ! Data has been deduplicated. +! 0x0080 ! NFS export table exists. +! 0x0100 ! Xattrs are stored uncompressed. +! 0x0200 ! There are no Xattrs in the archive. +! 0x0400 ! Compressor options are present. +! 0x0800 ! The ID table is uncompressed. +!=== + +| u16 | id count | The number of entries in the ID lookup table. +| u16 | version major | Major version of the format. Must be set to 4. +| u16 | version minor | Minor version of the format. Must be set to 0. +| u64 | root inode | A reference to the inode of the root directory. +| u64 | bytes used | The number of bytes used by the archive. Because + SquashFS archives must be padded to a multiple of the underlying + device block size, this can be less than the actual file size. +| u64 | ID table | The byte offset at which the id table starts. +| u64 | Xattr table | The byte offset at which the xattr id table starts. +| u64 | Inode table | The byte offset at which the inode table starts. +| u64 | Dir. table | The byte offset at which the directory table starts. +| u64 | Frag table | The byte offset at which the fragment table starts. +| u64 | Export table | The byte offset at which the export table starts. +|=== + + +The Xattr table, fragment table and export table are optional. If they are +omitted from the archive, the respective fields indicating their position +must be set to `0xFFFFFFFFFFFFFFFF` (i.e. all bits set). + +Most of the flags only serve an informational purpose and are only useful +when editing the archive to convey the original packer settings. + +The only flag that actually carries information is the "Compressor options are +present" flag. In fact, this is the only flag that the Linux kernel +implementation actually tests for. + +The compressor options, however, are also only there for informal purpose, as +most compression libraries understand their own stream format irregardless of +the options used to compress and in fact don't provide any options for the +decompressor. In the Linux kernel, the XZ decompressor is currently the only +one that processes those options to pre-allocate the LZMA dictionary if a +non-default size was used. + +=== Compression Options + +If the compressor options flag is set in the superblock, the superblock is +immediately followed by a single metadata block, which is always uncompressed. + +The data stored in this block is compressor dependent. + +There are two special cases: + +* For LZ4, the compressor options always have to be present. +* The LZMA compressor does not support compressor options, so this section + must never be present. + +For the compressors currently implemented, a 4 to 8 byte payload follows. + +The following sub sections outline the contents for each compressor that +supports options. The default values if the options are missing are outlined +as well. + +==== GZIP + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | compression level | In the range 1 to 9 (inclusive). Defaults to 9. +| u16 | window size | In the rage 8 to 15 (inclusive). Defaults to 15. +| u16 | strategies | A bit field describing the enabled strategies. + If no flags are set, the default strategy is + implicitly used. Please consult the ZLIB manual + for details on specific strategies. + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0x0001 ! Default strategy. +! 0x0002 ! Filtered. +! 0x0004 ! Huffman Only. +! 0x0008 ! Run Length Encoded. +! 0x0010 ! Fixed. +!=== +|=== + +NOTE: The SquashFS writer typically tries all selected strategies (including +not setting any and letting zlib work with defaults) and stores the result +with the smallest size. + +==== XZ + + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | dictionary size | *SHOULD* be >= 8KiB, and must be either a power of + 2, or the sum of two consecutive powers of 2. +| u32 | Filters | A bit field describing the additional enabled + filters attempted to better compress executable + code. + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0x0001 ! x86 +! 0x0002 ! PowerPC +! 0x0004 ! IA64 +! 0x0008 ! ARM +! 0x0010 ! ARM thumb +! 0x0020 ! SPARC +!=== +|=== + +NOTE: A SquashFS writer typically tries all selected VLI filters (including +not setting any and letting libxz work with defaults) and stores the resulting +block that has the smallest size. + +Also note that further options, such as XZ presets, are not included. The +compressor typically uses the libxz defaults, i.e. level 6 and not using the +extreme flag. Likewise for `lc`, `lp` and `pb` (defaults are 3, 0 and 2 +respectively). + +If the encoder chooses to change those values, the decoder will still be +able to read the data, but there is currently no way to convey that those +values were changed. + +This is specifically problematic for the compression level, since increasing +the level can result in drastically increasing the decoders memory consumption. + +==== LZ4 + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | Version | *MUST* be set to 1. +| u32 | Flags | A bit field describing the enabled LZ4 flags. + There is currently only one possible flag: + + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0x0001 ! Use LZ4 High Compression(HC) mode. +!=== +|=== + +==== ZSTD + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | compression level | Should be in range 1 to 22 (inclusive). The real + maximum is the zstd defined ZSTD_maxCLevel(). + + + + + The default value is 15. +|=== + +==== LZO + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | algorithm | Which variant of LZO to use. + +[cols=">1m,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0 ! lzo1x_1 +! 1 ! lzo1x_1_11 +! 2 ! lzo1x_1_12 +! 3 ! lzo1x_1_15 +! 4 ! lzo1x_999 (default) +!=== + +| u32 | compression level | For lzo1x_999, this can be a value between 0 + and 9 inclusive (defaults to 8). *MUST* be 0 + for all other algorithms. +|=== + +== Data and Fragment Blocks + +As outlined in 2.1, file data is packed by dividing the input files into fixed +size chunks (the block size from the super block) that are stored in sequence. + +The picture below tries to illustrate this concept: + +.Packing of File Data +[%nowrap] +.... + _____ _____ _____ _ _____ _____ _ _ +File A: |__A__|__A__|__A__|A| File B: |__B__|__B__|B| File C: |C| + | | | | | | | | + | +---+ | | | | | | + | | +------+ | | | | | + | | | | | | | | + | | | +------|---------------+ | | | + | | | | +--|---------------------+ | | + | | | | | | | | + | | | | | +-----------------------+ | +------------+ + | | | | | | | | + V V V V V V V V + __ _ ___ ___ ___ __ Fragment block: |A|B|C| + Output: |_A|A|_A_|_B_|_B_|_F| | + __V__ + A |__F__| + | | + +------------------------+ +.... + +In the above diagram, file A consists of 3 blocks and a single tail end, file B has +2 blocks and one tail end while file C is smaller than block size. + + +For each file, the blocks are individually compressed and stored on disk +in order. + +The tail ends of A and B, together with the entire contents of C are packed +together into a fragment block F, that is compressed and stored on disk once +it is full. + +This tail-end-packing is completely optional. The tail ends (or in case of C +the entire file) can also be treated as truncated blocks that expand to less +than block size when uncompressed. + + +There are no headers in front of data or fragment blocks and there *MUST NOT* be +any gaps between data blocks from a single file, but a SquashFS packer is free +to leave gaps between two different files or fragment blocks. The packer is +also free to decide how to arrange fragments within a fragment block and what +fragments to pack together. + +To locate file data, the file inodes store the following information: + +* The uncompressed size of the file. From this, the number of blocks can + be computed: + + block_count = floor(file_size / block_size) # if tail end packing is used + block_count = ceil(file_size / block_size) # otherwise + +* The exact location of the first block, if one exists. +* For each consecutive block, the on-disk size. ++ +A 32 bit integer is used with bit 24 (i.e. `1 << 24`) set if the block +is stored uncompressed. + +* If tail-end-packing was done, the location of the fragment block and a + byte offset into the uncompressed fragment block. The size of the tail + end can be computed easily: + + tail_end_size = file_size % block_size + +Since a fragment block will likely be referred to by multiple files, inodes +don't store its on-disk location and size directly, but instead use a 32 bit +index into a fragment block lookup table (see the <<Fragment Table>>). + +If a data block other than the last one unpacks to less than block size, the +rest of the buffer is filled with 0 bytes. This way, sparse files are +implemented. Specifically if a block has an on-disk size of 0 this translates +to an entire block filled with 0 bytes without having to retrieve any data +from disk. + +The on-disk locations of file blocks *MAY* overlap and different file inodes are +free to refer to the same fragment. Typical SquashFS packers would explicitly +use this to for files that are duplicates of others. Doing so is NOT counted +as a hard link. + +If an inode references on-disk locations outside the data area, the result is +undefined. + +== Inode Table + +Inodes are packed into metadata blocks and are not aligned, i.e. they can span +the boundary between metadata blocks. To save space, there are different +inodes for each type (regular file, directory, device, etc.) of varying +contents and size. + +To further save more space, inodes come in two flavors: simple inode types +optimized for a simple, standard use case, and extended inode types where +extra information has to be stored. + +SquashFS more or less supports 32 bit UIDs and GIDs. As an optimization, those +IDs are stored in a lookup table (see <<ID Table>>) and the inodes themselves +hold a 16 bit index into this table. This allows to 32 bit UIDs/GIDs, but only +among 2^16^ unique values. + +The location of the first metadata block is indicated by the inode table start +in the superblock. The inode table ends at the start of the directory table. + +=== Common Inode Header + +All Inodes share a common header, which contains some common information, +as well as describing the type of Inode which follows. This header has the +following structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u16 | type | The type of item described by the inode which follows this header + +[cols=">1,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 1 ! Basic Directory +! 2 ! Basic File +! 3 ! Basic Symlink +! 4 ! Basic Block Device +! 5 ! Basic Character Device +! 6 ! Basic Named Pipe (FIFO) +! 7 ! Basic Socked +! 8 ! Extended Directory +! 9 ! Extended File +! 10 ! Extended Symlink +! 11 ! Extended Block Device +! 12 ! Extended Character Device +! 13 ! Extended Named Pipe (FIFO) +! 14 ! Extended Socked +!=== + +| u16 | permissions | A bit mask representing Unix file system permissions + for the inode. This only stores permissions, not the + type. The type is reconstructed from the field above. +| u16 | uid | An index into the <<ID Table>>, giving the user ID of the owner. +| u16 | gid | An index into the <<ID Table>>, giving the group ID of the owner. +| u32 | mtime | The unsigned number of seconds (not counting leap + seconds) since 00:00, Jan 1st, 1970 UTC when the item + described by the inode was last modified. +| u32 | inode number | Unique node number. Must be at least 1 and at most + the inode count from the super block. +|=== + +=== Directory Inodes + +Directory inodes mainly contain a reference into the directory table where +the listing of entries is stored. + +A basic directory has an entry listing of at most 64k (uncompressed) and +no extended attributes. The layout of the inode data is as follows: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | block index | The location of the metadata block in the directory + table where the entry information starts. This is + relative to the directory table location. +| u32 | link count | The number of hard links to this directory. +| u16 | file size | Total (uncompressed) size in bytes of the entry + listing in the directory table, including headers. + + + + + This value is 3 bytes larger than the real listing. + The Linux kernel creates "." and ".." entries for + offsets 0 and 1, and only after 3 looks into the + listing, subtracting 3 from the size. +| u16 | block offset | The (uncompressed) offset within the metadata block + in the directory table where the directory listing + starts. +| u32 | parent inode | The inode number of the parent of this directory. If + this is the root directory, this *SHOULD* be 0. +|=== + +NOTE: For historical reasons, the hard link count of a directory includes +the number of entries in the directory and is initialized to 2 for an empty +directory. I.e. a directory with N entries has at least N + 2 link count. + +If the "file size" is set to a value < 4, the directory is empty and there is +no corresponding listing in the directory table. + +An extended directory can have a listing that is at most 4GiB in size, may +have extended attributes and can have an optional index for faster lookup: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | Same as above. +| u32 | file size | Same as above. +| u32 | block index | Same as above. +| u32 | parent inode | Same as above. +| u16 | index count | The number of directory index entries following the + inode structure. +| u16 | block offset | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> or `0xFFFFFFFF` + if the inode has no extended attributes. +|=== + + +The index follows directly after the inode. See <<Directory Index>> for details on +how the directory index is structured. + +=== File Inodes + +Basic files can be at most 4 GiB in size (uncompressed), must be located +within the first 4 GiB of the SquashFS image, cannot have any extended +attributes and don't support hard-link or sparse file accounting: + + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | blocks start | The offset from the start of the archive to the first + data block. +| u32 | frag index | An index into the <<Fragment Table>> which describes the fragment + block that the tail end of this file is stored in. If not used, + this is set to `0xFFFFFFFF`. +| u32 | block offset | The (uncompressed) offset within the fragment block + where the tail end of this file is. See <<Data and Fragment Blocks>> + for details. +| u32 | file size | The (uncompressed) size of this file. +| u32[] | block sizes | An array of consecutive block sizes. See <<Data and Fragment Blocks>> for details. +|=== + +If 'frag index' is set to `0xFFFFFFFF`, the number of blocks is computed as + + ceil(file_size / block_size) + +otherwise, if 'frag index' is a valid fragment index, the block count is +computed as + + floor(file_size / block_size) + +and the size of the tail end is + + file_size % block_size + + +To access a data block, first compute the block index as + + index = floor(offset / block_size) + +then compute the on-disk location of the block by summing up the sizes of the +blocks that come before it: + + location = block_start + + for i = 0; i < index; i++ + location += block_sizes[i] & 0x00FFFFFF + + +The tail end, if present, is accessed by resolving the fragment index through +the fragment lookup table (see the <<Fragment Table>>), loading the fragment block and +using the given 'block offset' into the fragment block. + +Extended files have a 64 bit location and size, have additional counters for +sparse file accounting and hard links, and can have extended attributes: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u64 | blocks start | Same as above (but larger). +| u64 | file size | Same as above (but larger). +| u64 | sparse | The number of bytes saved by omitting zero bytes. + Used in the kernel for sparse file accounting. +| u32 | link count | The number of hard links to this node. +| u32 | frag index | Same as above. +| u32 | block offset | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> or `0xFFFFFFFF` + if the inode has no extended attributes. +| u32[] | block sizes | Same as above. +|=== + +=== Symbolic Links + +Symbolic links mainly have a target path stored directly after the inode +header, as well as a hard-link counter (yes, you can have hard links to +symlinks): + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | The number of hard links to this symlink. +| u32 | target size | The size in bytes of the target path this symlink + points to. +| u8[] | target path | An array of bytes holding the target path this + symlink points to. The path is 'target size' bytes + long and NOT null-terminated. +|=== + +The extended symlink type adds an additional extended attribute index: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | Same as above. +| u32 | target size | Same as above. +| u8[] | target path | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> +|=== + +=== Device Special Files + +Basic device special files only store a hard-link counter and a device number. +The layout is identical for both character and block devices: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | The number of hard links to this entry. +| u32 | device number | The system specific device number. + + + + + On Linux, this consists of major and minor device + numbers that can be extracted as follows: + + major = (dev & 0xFFF00) >> 8. + minor = (dev & 0x000FF) | ((dev >> 12) & 0xFFF00) +|=== + +The extended device file inode adds an additional extended attribute index: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | Same as above. +| u32 | device number | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> +|=== + +=== IPC Inodes (FIFO or Socket) + +Named pipe (FIFO) and socket special files only add a hard-link counter +after the inode header: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | The number of hard links to this entry. +|=== + +The extended versions add an additional extended attribute index: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | link count | Same as above. +| u32 | xattr index | An index into the <<Xattr Table>> +|=== + +== Directory Table + +For each directory inode, the directory table stores a linear list of all +entries, with references back to the inodes that describe those entries. + +The entry list itself is sorted ASCIIbetically by entry name and split into +multiple runs, each preceded by a short header. + +The directory inodes store the total, uncompressed size of the entire listing, +including headers. Using this size, a SquashFS reader can determine if another +header with further entries should be following once it reaches the end of a +run. + +To save space, the header indicates a metadata block and a reference inode +number. The entries that follow simply store a difference to that inode number +and an offset into the specified metadata block. + +Every time, the inode block changes or the difference of the inode number +to the reference in the header cannot be encoded in 16 bits anymore, a new +header is emitted. + +A header must be followed by *AT MOST* 256 entries. If there are more entries, +a new header *MUST* be emitted. + +Typically, inode allocation strategies would sort the children of a directory +and then allocate inode numbers incrementally, to optimize directory entry +listings. + +Since hard links might be further further away than ±32k of the reference +number, they might require a new header to be emitted. Inode number allocation +and picking of the reference could of course be optimized to prevent this. + +The directory header has the following structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | count | Number of entries following the header. +| u32 | start | The location of the metadata block in the inode table + where the inodes are stored. This is relative to the + inode table start from the super block. +| s32 | inode number | An arbitrary inode number. The entries that follow + store their inode number as a difference to this. +|=== + +The counter is stored off-by-one, i.e. a value of 0 indicates 1 entry follows. +This also makes it impossible to encode a size of 0, which wouldn't make any +sense. Empty directories simply have their size set to 0 in the inode instead, +so no extra dummy header has to be stored or looked up. + +The header is followed by multiple entries that each have this structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u16 | offset | An offset into the uncompressed inode metadata block. +| s16 | inode offset | The difference of this inode's number to the reference + stored in the header. +| u16 | type | The inode type. For extended inodes, the basic type is stored + here instead. +| u16 | name size | One less than the size of the entry name. +| u8[] | name | The file name of the entry without a trailing null byte. Has + `name size` + 1 bytes. +|=== + +In the entry structure itself, the file names are stored without trailing null +bytes. Since a zero length name makes no sense, the name length is stored +off-by-one, i.e. the value 0 cannot be encoded. + +The inode type is stored in the entry, but always as the corresponding +basic type. + +While the field is technically 16 bits, the kernel implementation currently +imposes an arbitrary limit of 255 on the name size field. Since the field is +off-by-one, this means that a file name in SquashFS can be at most 256 +characters long. + +=== Directory Index + +To speed up lookups on directories with lots of entries, the extended +directory inode can store an index, holding the locations of all directory +headers and the name of the first entry after the header. + +When searching for an entry, the reader can then iterate over the index to +find a range of metadata blocks that should contain a given entry and then +only scan over the given range. + +To allow for even faster lookups, a new header should be emitted every time +the entry list crosses a metadata block boundary. This narrows the boundary +down to a single metadata block lookup in most cases. + +The index entries have the following structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | index | This stores a byte offset from the first directory + header to the current header, as if the uncompressed + directory metadata blocks were laid out in memory + consecutively. +| u32 | start | Start offset of a directory table metadata block, + relative to the directory table start. +| u32 | name size | One less than the size of the entry name. +| u8[] | name | The name of the first entry following the header + without a trailing null byte. +|=== + +== Fragment Table + +Tail-ends and smaller than block size files can be combined into fragment +blocks that are at most 'block size' bytes long. + +The fragment table describes the location and size of the fragment blocks +(not the tail-ends within them). + +This is a lookup table which stores entries of the following shape: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u64 | start | The offset within the archive where the fragment block starts +| u32 | size | The on-disk size of the fragment block. If the block is + uncompressed, bit 24 (i.e. `1 << 24`) is set. +| u32 | unused | *SHOULD* be set to 0. +|=== + +The table is stored on-disk as described in <<Storing Lookup Tables>>. + +The fragment table location in the superblock points to an array of 64 bit +integers that store the on-disk locations of the metadata blocks containing +the lookup table. + +Each metadata block can store up to 512 entries (`8129 / 16`). + +The "unused" field is there for alignment and *SHOULD* be set to 0, however the +Linux kernel currently ignores this field completely, making it impossible for +Linux to ever re-purpose this field. + +== Export Table + +To support NFS exports, SquashFS needs a fast way to resolve an inode number +to an inode structure. + +For this purpose, a SquashFS archive can optionally contain an export table, +which is basically a flat array of 64 bit inode references, with the inode +number being used as an index into the array. + +Because the inode number 0 is not used (reserved as a sentinel value), the +array actually starts at inode number 1 and the index is thus +inode_number - 1. + +The array itself is stored in a series of metadata blocks, as outlined in +<<Storing Lookup Tables>>. + +Since each block can store 1024 references (`8192 / 8`), there will be +`ceil(inode_count / 1024)` metadata blocks for the entire array. + +== ID Table + +As outlined in <<Common Inode Header>>, SquashFS supports 32 bit user and group IDs. To +compact the inode table, the unique UID/GID values are collected in a lookup +table and a 16 bit table index is stored in the inode instead. + +This lookup table is stored as outlined in <<Storing Lookup Tables>>. + +Each metadata block can store up to 2048 IDs (`8192 / 4`). + +[[_xattr_table,Xattr Table]] +== Extended Attribute Table + +Extended attributes are arbitrary key value pairs attached to inodes. The key +names use dots as separators to create a hierarchy of name spaces. + +The key value pairs of all inodes are stored consecutively in a series of +metadata blocks. + +The values can either be stored inline, i.e. a key entry is directly followed +by a value, or out-of-line to deduplicate identical values and use a reference +instead. Typically, the first occurrence of a value is stored in line and +every consecutive use of the same value uses an out-of-line reference back to +the first one. + +The keys are stored using the following data structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u16 | type | A prefix ID for the key name. If the value that follows + is stored out-of-line, the flag `0x0100` is **OR**ed to the + type ID. + +[cols=">1,10",frame="none",grid="none",options="header"] +!=== +! Value ! Comment +! 0 ! Prefix the name with `"user."` +! 1 ! Prefix the name with `"trusted."` +! 2 ! Prefix the name with `"security."` +!=== + +| u16 | name size | The number of key bytes the follows. +| u8[] | name | The remainder of the key without the prefix and without a + trailing null byte. +|=== + +Following the key, this structure is used to store the value: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u32 | value size | The size of the value string. If the value is stored + out of line, this is always 8, i.e. the size of an + unsigned 64 bit integer. +| u8[] | value | This is 'value size' bytes of arbitrary binary data. + If the value is stored out-of-line, this is a 64 bit + reference, i.e. a location of a metadata block, + shifted left by 16 and **OR**ed with an offset into the + uncompressed block, giving the location of another + value structure. +|=== + +The metadata block location given by an out-of-line reference is relative to +the location of the first block. + +To actually address a block of key value pairs associated with an inode, a +lookup table is used that specifies the start and size of a sequence of key +value pairs. + +All an inode needs to store is a 32 bit index into this table. If two inodes +have an identical attribute sets, the key/value sequence is only written once, +there is only one lookup table entry and both inodes have the same index. + +Each lookup table entry has the following structure: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u64 | xattr ref | A reference to the start of the key value block, i.e. + the metadata block location shifted left by 16, **OR**ed + with an offset into the uncompressed block. +| u32 | count | The number of key value pairs. +| u32 | size | The exact, uncompressed size in bytes of the entire + block of key value pairs, counting what has been + written to disk and including the key/value entry + structures. +|=== + +This lookup table is stored as outlined in <<Storing Lookup Tables>> + +Each metadata block can hold 512 (`8192 / 16`) entries. + +However, in contrast to <<Storing Lookup Tables>>, additional data is given before +the list of metadata block locations, to locate the key-value pairs, as well as the +actual number of lookup table entries that are not specified in the super +block. + +The 'Xattr table' entry in the superblock gives the absolute location of the +following data structure which is stored on-disk as is, uncompressed: + +[cols="1,4,20a",frame="none",grid="none",options="header"] +|=== +| Type | Name | Description +| u64 | kv start | The absolute position of the first metadata block holding the + key/value pairs. +| u32 | count | The number of entries in the lookup table. +| u32 | unused | *SHOULD* be set to 0, however Linux currently ignores + this field completely and squashfs-tools used to leak + stack data here, making it impossible for Linux to + ever re-purpose this field. +| u64[] | locations | An array holding the absolute on-disk location of each + metadata block of the lookup table. +|=== + +If an inode has a a valid xattr index (i.e. not `0xFFFFFFFF`), the metadata +block index is computed as + + block_idx = floor(index / 512) + +which is then used to retrieve the metadata block index from the locations +array. + +Once the block has been read from disk and uncompressed, the byte offset into +the metadata block can be computed as + + offset = (index * 16) % 8192 + +From this position, the structure can be read that holds a reference to the +metadata block that contains the key/value pairs (and byte offset into the +uncompressed block where the pairs start), as well as the number of key/value +pairs and their total, uncompressed size. diff --git a/doc/format.txt b/doc/format.txt deleted file mode 100644 index 3eb4932..0000000 --- a/doc/format.txt +++ /dev/null @@ -1,1216 +0,0 @@ - - Squashfs Binary Format - ********************** - -0) Index -******** - - 0............Index - 1............About - 2............Overview - 2.1........Packing File Data - 2.2........Packing Metadata - 2.3........Storing Lookup Tables - 3............The Superblock - 3.1........Compression Options - 3.1.1....GZIP - 3.1.2....XZ - 3.1.3....LZ4 - 3.1.4....ZSTD - 3.1.5....LZO - 4............Data and Fragment Blocks - 5............Inode Table - 5.1........Common Inode Header - 5.2........Directory inodes - 5.3........File Inodes - 5.4........Symbolic Links - 5.5........Device Special Files - 5.6........IPC inodes (FIFO or Socket) - 6............Directory Table - 6.1........Directory Index - 7............Fragment Table - 8............Export Table - 9............ID Table - 10...........Extended Attribute Table - -1) About -******** - -SquashFS is a compressed, read-only filesystem for Linux that can also be used -as a flexible, general purpose, compressed archive format, optimized for fast -random access with support for Unix permissions, sparse files and extended -attributes. - -SquashFS supports data and metadata compression through zlib, lz4, lzo, lzma, -xz or zstd. - -For fast random access, compressed files are split up in fixed size blocks -that are compressed separately. The block size can be set between 4k and 1M -(default for squashfs-tools and squashfs-tools-ng is 128K). - - -This document attempts to specify the on-disk format in detail. - -It is based on a previous on-line version that was originally written by -Zachary Dremann and subsequently expanded by David Oberhollenzer during -reverse engineering attempts and available here: - - https://dr-emann.github.io/squashfs/ - - -2) Overview -*********** - -SquashFS always stores integers in little endian format. The data blocks that -make up the SquashFS archive are byte aligned, i.e. they typically do not care -for alignment. The implementation in the Linux kernel requires the archive -itself to be a multiple of either 1k or 4k in size (called the device block -size) and user space tools typically use 4k to be compatible with both. - -A SquashFS archive consists of a maximum of nine parts: - - _______________ - | | Important information about the archive, including - | Superblock | locations of other sections. - |_______________| - | | If non-default compression options have been used, - | Compression | they can optionally be stored here, to facilitate - | options | later, offline editing of the archive. - |_______________| - | | - | Data blocks | The contents of the files in the archive, - | & fragments | split into separately compressed blocks. - |_______________| - | | Metadata (ownership, permissions, etc) for - | Inode table | items in the archive. - |_______________| - | | - | Directory | Directory listings, including file names, and - | table | references to inodes. - |_______________| - | | - | Fragment | Description of fragment locations within the - | table | Datablocks & Fragments section. - |_______________| - | | A mapping from inode numbers to disk locations, - | Export table | required for NFS export. - |_______________| - | | - | UID/GID | A list of unique UID/GIDs. Inodes use an index into - | lookup table | this table to save memory. - |_______________| - | | - | Xattr | Extended attributes for items in the archive. - | table | - |_______________| - - -Although the super block details the exact positions of each section, most -implementations, including the one in the Linux kernel, insist on this exact -order. - - -2.1) Packing File Data - -The file data is packed into the archive after the super block (and optional -compressor options). - -Files are divided into fixed size blocks that are separately compressed and -stored in order. SquashFS supports optional tail-end-packing of files that -are not an exact multiple of the block size. The remaining ends can either -be treated as a short block, or can be packed together with the tail ends of -other files in a single "fragment block". Files that are less than block size -are treated the same way. - -If the size of a data or fragment block would exceed the input size after -compression, the original, uncompressed data is stored, so that the size of a -block after compression never exceeds the input block size. - - -2.2) Packing Metadata - -Metadata (e.g. inodes, directory listings, etc...) is treated as a continuous -stream of records that is chopped up into 8KiB blocks that are separately -compressed into special metadata blocks. - -The input size of 8KiB is fixed and independent of the data block size. -Similar to data blocks, if the compressed size would exceed 8KiB, the -uncompressed block is stored instead, so the on-disk size of a metadata -block never exceeds 8KiB. - -Individual entries are allowed to cross the block boundary, so e.g. an inode -may be located at the end of a metadata block with some part of it located at -the start of the next block. Both have to be read and decompressed when -reading this inode. If an entry is written across block boundaries, there -MUST NOT be any gap between the compressed metadata blocks on-disk. - - -In contrast to data blocks, every metadata block is preceded by a single, -16 bit unsigned integer. This integer holds the on-disk size of the block -that follows. The MSB is set if the block is stored uncompressed. Whenever -a metadata block is referenced, the position of this integer is given. - -To read a metadata block, seek to the indicated position and read the 16 bit -header. Sanity check that the lower 15 bit are less than 8KiB and proceed -to read that many bytes. If the highest bit of the header is cleared, -uncompress the data into an 8KiB buffer that MUST NOT overflow. - - -In the SquashFS archive format, metadata entries (e.g. inodes) are often -referenced using a 64 bit integer. The lower 16 bit hold an offset into the -uncompressed block and the upper 48 bit point to the on-disk location of the -block. - -The on-disk location is relative to the type of metadata entry, e.g. for -inodes it is relative to the start of the inode table given by the -super block. - - -2.3) Storing Lookup Tables - -Lookup tables are arrays (i.e. sequences of identical sized records) that are -addressed by an index. - -Such tables are stored in the SquashFS format as metadata blocks, i.e. by -dividing the table data into 8KiB chunks that are separately compressed and -stored in sequence. - -To allow constant time lookup, a list of 64 bit unsigned integers is stored, -holding the on-disk locations of each metadata block. - -This list itself is stored uncompressed and not preceded by a header. - -When referring to a lookup table, the superblock gives the number of table -entries and points to this location list. - -Since the table entry size is a known, fixed value, the required number of -metadata blocks can be computed: - - block_count = ceil(table_count * entry_size / 8192) - -Which is also the number of 64 bit integers in the location list. - - -When resolving a lookup table index, first work out the index of the -metadata block: - - meta_index = floor(index * entry_size / 8192) - -Using this index on the location list yields the on-disk location of -the metadata block containing the entry. - -After reading this metadata block, the byte offset into the block can -be computed to get the entry: - - offset = index * entry_size % 8192 - - -The location list can be cached in memory. Resolving an index requires at -worst a single metadata block read (at most 8194 bytes fetched from an -unaligned on-disk location). - - -2.4) Supported Compressors - -The SquashFS format supports the following compressors: - - - zlib deflate (referred to as "gzip" but only uses raw zlib streams) - - lzo - - lzma 1 (considered deprecated) - - lzma 2 (referred to as "xz") - - lz4 - - zstd - -The archive can only specify one compressor in the super block and has to use -it for both file data and metadata compression. Using one compressor for data -and switching to a different compressor for e.g. inodes is not supported. - -While it is technically not possible to pick a "null" compressor in the super -block, an implementation can still deliberately write only uncompressed blocks -to a SquashFS archive, or choose to store certain metadata blocks without -compression. - -The lzma 2 aka xz compressor MUST use CRC32 checksums only. Using SHA-256 is -not supported. - - -3) The superblock -***************** - -The superblock is the first section of a SquashFS archive. It is always -96 bytes in size and contains important information about the archive, -including the locations of other sections. - -+======+===============+=====================================================+ -| Type | Name | Description | -+======+===============+=====================================================+ -| u32 | magic | Must be set to 0x73717368 ("hsqs" on disk). | -+------+---------------+-----------------------------------------------------+ -| u32 | inode count | The number of inodes stored in the archive. | -+------+---------------+-----------------------------------------------------+ -| u32 | mod time | Last modification time of the archive. Count seconds| -| | | since 00:00, Jan 1st 1970 UTC (not counting leap | -| | | seconds). This is unsigned, so it expires in the | -| | | year 2106 (as opposed to 2038). | -+------+---------------+-----------------------------------------------------+ -| u32 | block size | The size of a data block in bytes. Must be a power | -| | | of two between 4096 (4k) and 1048576 (1 MiB) | -+------+---------------+-----------------------------------------------------+ -| u32 | frag count | The number of entries in the fragment table | -+------+---------------+-----------------------------------------------------+ -| u16 | compressor | An ID designating the compressor used for both data | -| | | and meta data blocks. | -| | | | -| | +-------+------+--------------------------------------+ -| | | Value | Name | Comment | -| | +-------+------+--------------------------------------+ -| | | 1 | GZIP | just zlib streams (no gzip headers!) | -| | | 2 | LZO | | -| | | 3 | LZMA | LZMA version 1 | -| | | 4 | XZ | LZMA version 2 as used by xz-utils | -| | | 5 | LZ4 | | -| | | 6 | ZSTD | | -+------+---------------+-------+------+--------------------------------------+ -| u16 | block log | The log2 of the block size. If the two fields do not| -| | | agree, the archive is considered corrupted. | -+------+---------------+-----------------------------------------------------+ -| u16 | flags | Bit wise OR of the flag bits below. | -| | | | -| | +--------+--------------------------------------------+ -| | | Value | Meaing | -| | +--------+--------------------------------------------+ -| | | 0x0001 | Inodes are stored uncompressed. | -| | | 0x0002 | Data blocks are stored uncompressed. | -| | | 0x0008 | Fragments are stored uncompressed. | -| | | 0x0010 | Fragments are not used. | -| | | 0x0020 | Fragments are always generated. | -| | | 0x0040 | Data has been deduplicated. | -| | | 0x0080 | NFS export table exists. | -| | | 0x0100 | Xattrs are stored uncompressed. | -| | | 0x0200 | There are no Xattrs in the archive. | -| | | 0x0400 | Compressor options are present. | -| | | 0x0800 | The ID table is uncompressed. | -+------+---------------+--------+--------------------------------------------+ -| u16 | id count | The number of entries in the ID lookup table. | -+------+---------------+-----------------------------------------------------+ -| u16 | version major | Major version of the format. Must be set to 4. | -+------+---------------+-----------------------------------------------------+ -| u16 | version minor | Minor version of the format. Must be set to 0. | -+------+---------------+-----------------------------------------------------+ -| u64 | root inode | A reference to the inode of the root directory. | -+------+---------------+-----------------------------------------------------+ -| u64 | bytes used | The number of bytes used by the archive. Because | -| | | SquashFS archives must be padded to a multiple of | -| | | the underlying device block size, this can be less | -| | | than the actual file size. | -+------+---------------+-----------------------------------------------------+ -| u64 | ID table | The byte offset at which the id table starts. | -+------+---------------+-----------------------------------------------------+ -| u64 | Xattr table | The byte offset at which the xattr id table starts. | -+------+---------------+-----------------------------------------------------+ -| u64 | Inode table | The byte offset at which the inode table starts. | -+------+---------------+-----------------------------------------------------+ -| u64 | Dir. table | The byte offset at which the directory table starts.| -+------+---------------+-----------------------------------------------------+ -| u64 | Frag table | The byte offset at which the fragment table starts. | -+------+---------------+-----------------------------------------------------+ -| u64 | Export table | The byte offset at which the export table starts. | -+------+---------------+-----------------------------------------------------+ - -The Xattr table, fragment table and export table are optional. If they are -omitted from the archive, the respective fields indicating their position -must be set to 0xFFFFFFFFFFFFFFFF (i.e. all bits set). - -Most of the flags only serve an informational purpose and are only useful -when editing the archive to convey the original packer settings. - -The only flag that actually carries information is the "Compressor options are -present" flag. In fact, this is the only flag that the Linux kernel -implementation actually tests for. - -The compressor options, however, are also only there for informal purpose, as -most compression libraries understand their own stream format irregardless of -the options used to compress and in fact don't provide any options for the -decompressor. In the Linux kernel, the XZ decompressor is currently the only -one that processes those options to pre-allocate the LZMA dictionary if a -non-default size was used. - - -3.1) Compression Options - -If the compressor options flag is set in the superblock, the superblock is -immediately followed by a single metadata block, which is always uncompressed. - -The data stored in this block is compressor dependent. - -There are two special cases: - - For LZ4, the compressor options always have to be present. - - The LZMA compressor does not support compressor options, so this section - must never be present. - -For the compressors currently implemented, a 4 to 8 byte payload follows. - -The following sub sections outline the contents for each compressor that -supports options. The default values if the options are missing are outlined -as well. - - -3.1.1) GZIP - -+======+===================+=================================================+ -| Type | Name | Description | -+======+===================+=================================================+ -| u32 | compression level | In the range 1 to 9 (inclusive). Defaults to 9. | -+------+-------------------+-------------------------------------------------+ -| u16 | window size | In the range 8 to 15 (inclusive) Defaults to 15.| -+------+-------------------+-------------------------------------------------+ -| u16 | strategies | A bit field describing the enabled strategies. | -| | | If no flags are set, the default strategy is | -| | | implicitly used. Please consult the ZLIB manual | -| | | for details on specific strategies. | -| | | | -| | +--------+----------------------------------------+ -| | | Value | Comment | -| | +--------+----------------------------------------+ -| | | 0x0001 | Default strategy. | -| | | 0x0002 | Filtered. | -| | | 0x0004 | Huffman Only. | -| | | 0x0008 | Run Length Encoded. | -| | | 0x0010 | Fixed. | -+------+-------------------+--------+----------------------------------------+ - -Note: The SquashFS writer typically tries all selected strategies (including -not setting any and letting zlib work with defaults) and stores the result -with the smallest size. - - -3.1.2) XZ - -+======+===================+=================================================+ -| Type | Name | Description | -+======+===================+=================================================+ -| u32 | dictionary size | SHOULD be >= 8KiB, and must be either a power of| -| | | 2, or the sum of two consecutive powers of 2. | -+------+-------------------+-------------------------------------------------+ -| u32 | Filters | A bit field describing the additional enabled | -| | | filters attempted to better compress executable | -| | | code. | -| | | | -| | +--------+----------------------------------------+ -| | | Value | Comment | -| | +--------+----------------------------------------+ -| | | 0x0001 | x86 | -| | | 0x0002 | PowerPC | -| | | 0x0004 | IA64 | -| | | 0x0008 | ARM | -| | | 0x0010 | ARM thumb | -| | | 0x0020 | SPARC | -+------+-------------------+--------+----------------------------------------+ - -Note: A SquashFS writer typically tries all selected VLI filters (including -not setting any and letting libxz work with defaults) and stores the resulting -block that has the smallest size. - -Also note that further options, such as XZ presets, are not included. The -compressor typically uses the libxz defaults, i.e. level 6 and not using the -extreme flag. Likewise for lc, lp and pb (defaults are 3, 0 and 2 -respectively). - -If the encoder chooses to change those values, the decoder will still be -able to read the data, but there is currently no way to convey that those -values were changed. - -This is specifically problematic for the compression level, since increasing -the level can result in drastically increasing the decoders memory consumption. - - -3.1.3) LZ4 - -+======+===================+=================================================+ -| Type | Name | Description | -+======+===================+=================================================+ -| u32 | Version | MUST be set to 1. | -+------+-------------------+-------------------------------------------------+ -| u32 | Flags | A bit field describing the enabled LZ4 flags. | -| | | There is currently only one possible flag: | -| | | | -| | +--------+----------------------------------------+ -| | | Value | Comment | -| | +--------+----------------------------------------+ -| | | 0x0001 | Use LZ4 High Compression(HC) mode. | -+------+-------------------+--------+----------------------------------------+ - -3.1.4) ZSTD - -+======+===================+=================================================+ -| Type | Name | Description | -+======+===================+=================================================+ -| u32 | compression level | Should be in range 1 to 22 (inclusive). The real| -| | | maximum is the zstd defined ZSTD_maxCLevel(). | -| | | | -| | | The default value is 15. | -+------+-------------------+-------------------------------------------------+ - -3.1.5) LZO - -+======+===================+=================================================+ -| Type | Name | Description | -+======+===================+=================================================+ -| u32 | algorithm | Which variant of LZO to use. | -| | | | -| | +--------+----------------------------------------+ -| | | Value | Comment | -| | +--------+----------------------------------------+ -| | | 0 | lzo1x_1 | -| | | 1 | lzo1x_1_11 | -| | | 2 | lzo1x_1_12 | -| | | 3 | lzo1x_1_15 | -| | | 4 | lzo1x_999 (default) | -+------+-------------------+--------+----------------------------------------+ -| u32 | compression level | For lzo1x_999, this can be a value between 0 | -| | | and 9 inclusive (defaults to 8). MUST be 0 | -| | | for all other algorithms. | -+------+-------------------+-------------------------------------------------+ - - - -4) Data and Fragment Blocks -*************************** - -As outlined in 2.1, file data is packed by dividing the input files into fixed -size chunks (the block size from the super block) that are stored in sequence. - -The picture below tries to illustrate this concept: - - - _____ _____ _____ _ _____ _____ _ _ - File A: |__A__|__A__|__A__|A| File B: |__B__|__B__|B| File C: |C| - | | | | | | | | - | +---+ | | | | | | - | | +------+ | | | | | - | | | | | | | | - | | | +------|---------------+ | | | - | | | | +--|---------------------+ | | - | | | | | | | | - | | | | | +-----------------------+ | +------------+ - | | | | | | | | - V V V V V V V V - __ _ ___ ___ ___ __ Fragment block: |A|B|C| - Output: |_A|A|_A_|_B_|_B_|_F| | - __V__ - A |__F__| - | | - +------------------------+ - -Figure 2.1: Packing of File Data. - - -In Figure 1, file A consists of 3 blocks and a single tail end, file B has -2 blocks and one tail end while file C is smaller than block size. - -For each file, the blocks are individually compressed and stored on disk -in order. - -The tail ends of A and B, together with the entire contents of C are packed -together into a fragment block F, that is compressed and stored on disk once -it is full. - -This tail-end-packing is completely optional. The tail ends (or in case of C -the entire file) can also be treated as truncated blocks that expand to less -than block size when uncompressed. - - -There are no headers in front of data or fragment blocks and there MUST NOT be -any gaps between data blocks from a single file, but a SquashFS packer is free -to leave gaps between two different files or fragment blocks. The packer is -also free to decide how to arrange fragments within a fragment block and what -fragments to pack together. - - - -To locate file data, the file inodes store the following information: - - - The uncompressed size of the file. From this, the number of blocks can - be computed: - - block_count = floor(file_size / block_size) if tail end packing is used - block_count = ceil(file_size / block_size) otherwise - - - The exact location of the first block, if one exists. - - For each consecutive block, the on-disk size. - - A 32 bit integer is used with bit 24 (i.e. 1 << 24) set if the block - is stored uncompressed. - - - If tail-end-packing was done, the location of the fragment block and a - byte offset into the uncompressed fragment block. The size of the tail - end can be computed easily: - - tail_end_size = file_size % block_size - - -Since a fragment block will likely be referred to by multiple files, inodes -don't store its on-disk location and size directly, but instead use a 32 bit -index into a fragment block lookup table (see section 7). - - -If a data block other than the last one unpacks to less than block size, the -rest of the buffer is filled with 0 bytes. This way, sparse files are -implemented. Specifically if a block has an on-disk size of 0 this translates -to an entire block filled with 0 bytes without having to retrieve any data -from disk. - - -The on-disk locations of file blocks MAY overlap and different file inodes are -free to refer to the same fragment. Typical SquashFS packers would explicitly -use this to for files that are duplicates of others. Doing so is NOT counted -as a hard link. - -If an inode references on-disk locations outside the data area, the result is -undefined. - - -5) Inode Table -************** - -Inodes are packed into metadata blocks and are not aligned, i.e. they can span -the boundary between metadata blocks. To save space, there are different -inodes for each type (regular file, directory, device, etc.) of varying -contents and size. - -To further save more space, inodes come in two flavors: simple inode types -optimized for a simple, standard use case, and extended inode types where -extra information has to be stored. - -SquashFS more or less supports 32 bit UIDs and GIDs. As an optimization, those -IDs are stored in a lookup table (see section 9) and the inodes themselves -hold a 16 bit index into this table. This allows to 32 bit UIDs/GIDs, but only -among 2^16 unique values. - - -The location of the first metadata block is indicated by the inode table start -in the superblock. The inode table ends at the start of the directory table. - - -5.1) Common Inode Header - -All Inodes share a common header, which contains some common information, -as well as describing the type of Inode which follows. This header has the -following structure: - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u16 | type | The type of item described by the inode which follows| -| | | this header. | -| | | | -| | +-------+----------------------------------------------+ -| | | Value | Comment | -| | +-------+----------------------------------------------+ -| | | 1 | Basic Directory | -| | | 2 | Basic File | -| | | 3 | Basic Symlink | -| | | 4 | Basic Block Device | -| | | 5 | Basic Character Device | -| | | 6 | Basic Named Pipe (FIFO) | -| | | 7 | Basic Socket | -| | | 8 | Extended Directory | -| | | 9 | Extended File | -| | | 10 | Extended Symlink | -| | | 11 | Extended Block Device | -| | | 12 | Extended Character Device | -| | | 13 | Extended Named Pipe (FIFO) | -| | | 14 | Extended Socket | -+------+--------------+-------+----------------------------------------------+ -| u16 | permissions | A bit mask representing Unix file system permissions | -| | | for the inode. This only stores permissions, not the | -| | | type. The type is reconstructed from the field above.| -+------+--------------+------------------------------------------------------+ -| u16 | uid | An index into the ID table, giving the user ID | -| | | of the owner. | -+------+--------------+------------------------------------------------------+ -| u16 | gid | An index into the ID table, giving the group ID | -| | | of the owner. | -+------+--------------+------------------------------------------------------+ -| u32 | mtime | The unsigned number of seconds (not counting leap | -| | | seconds) since 00:00, Jan 1st, 1970 UTC when the item| -| | | described by the inode was last modified. | -+------+--------------+------------------------------------------------------+ -| u32 | inode number | Unique node number. Must be at least 1 and at most | -| | | the inode count from the super block. | -+------+--------------+------------------------------------------------------+ - -Depending on the type, additional data follows, outlined in sections 5.2 -to 5.6. - - - -5.2) Directory inodes - -Directory inodes mainly contain a reference into the directory table where -the listing of entries is stored. - -A basic directory has an entry listing of at most 64k (uncompressed) and -no extended attributes. The layout of the inode data is as follows: - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u32 | block index | The location of the metadata block in the directory | -| | | table where the entry information starts. This is | -| | | relative to the directory table location. | -+------+--------------+------------------------------------------------------+ -| u32 | link count | The number of hard links to this directory. | -+------+--------------+------------------------------------------------------+ -| u16 | file size | Total (uncompressed) size in bytes of the entry | -| | | listing in the directory table, including headers. | -| | | | -| | | This value is 3 bytes larger than the real listing. | -| | | The Linux kernel creates "." and ".." entries for | -| | | offsets 0 and 1, and only after 3 looks into the | -| | | listing, subtracting 3 from the size. | -+------+--------------+------------------------------------------------------+ -| u16 | block offset | The (uncompressed) offset within the metadata block | -| | | in the directory table where the directory listing | -| | | starts. | -+------+--------------+------------------------------------------------------+ -| u32 | parent inode | The inode number of the parent of this directory. If | -| | | this is the root directory, this SHOULD be 0. | -+------+--------------+------------------------------------------------------+ - - -Note that for historical reasons, the hard link count of a directory includes -the number of entries in the directory and is initialized to 2 for an empty -directory. I.e. a directory with N entries has at least N + 2 link count. - - -If the "file size" is set to a value < 4, the directory is empty and there is -no corresponding listing in the directory table. - - -An extended directory can have a listing that is at most 4GiB in size, may -have extended attributes and can have an optional index for faster lookup: - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u32 | link count | Same as above. | -+------+--------------+------------------------------------------------------+ -| u32 | file size | Same as above. | -+------+--------------+------------------------------------------------------+ -| u32 | block index | Same as above. | -+------+--------------+------------------------------------------------------+ -| u32 | parent inode | Same as above. | -+------+--------------+------------------------------------------------------+ -| u16 | index count | The number of directory index entries following the | -| | | inode structure. | -+------+--------------+------------------------------------------------------+ -| u16 | block offset | Same as above. | -+------+--------------+------------------------------------------------------+ -| u32 | xattr index | An index into the xattr lookup table or 0xFFFFFFFF | -| | | if the inode has no extended attributes. | -+------+--------------+------------------------------------------------------+ - -The index follows directly after the inode. See section 6.1 for details on -how the directory index is structured. - - -5.3) File Inodes - -Basic files can be at most 4 GiB in size (uncompressed), must be located -within the first 4 GiB of the SquashFS image, cannot have any extended -attributes and don't support hard-link or sparse file accounting: - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u32 | blocks start | The offset from the start of the archive to the first| -| | | data block. | -+------+--------------+------------------------------------------------------+ -| u32 | frag index | An index into the fragment table which describes the | -| | | fragment block that the tail end of this file is | -| | | stored in. If not used, this is set to 0xFFFFFFFF. | -+------+--------------+------------------------------------------------------+ -| u32 | block offset | The (uncompressed) offset within the fragment block | -| | | where the tail end of this file is. See section 4 | -| | | for details. | -+------+--------------+------------------------------------------------------+ -| u32 | file size | The (uncompressed) size of this file. | -+------+--------------+------------------------------------------------------+ -| u32[]| block sizes | An array of consecutive block sizes. See section 4 | -| | | for details. | -+------+--------------+------------------------------------------------------+ - -If 'frag index' is set to 0xFFFFFFFF, the number of blocks is computed as - - ceil(file_size / block_size) - -otherwise, if 'frag index' is a valid fragment index, the block count is -computed as - - floor(file_size / block_size) - -and the size of the tail end is - - file_size % block_size - - -To access a data block, first compute the block index as - - index = floor(offset / block_size) - -then compute the on-disk location of the block by summing up the sizes of the -blocks that come before it: - - location = block_start - - for i = 0; i < index; i++ - location += block_sizes[i] & 0x00FFFFFF - - -The tail end, if present, is accessed by resolving the fragment index through -the fragment lookup table (see section 7), loading the fragment block and -using the given 'block offset' into the fragment block. - - - -Extended files have a 64 bit location and size, have additional counters for -sparse file accounting and hard links, and can have extended attributes: - - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u64 | blocks start | Same as above (but larger). | -+------+--------------+------------------------------------------------------+ -| u64 | file size | Same as above (but larger). | -+------+--------------+------------------------------------------------------+ -| u64 | sparse | The number of bytes saved by omitting zero bytes. | -| | | Used in the kernel for sparse file accounting. | -+------+--------------+------------------------------------------------------+ -| u32 | link count | The number of hard links to this node. | -+------+--------------+------------------------------------------------------+ -| u32 | frag index | Same as above. | -+------+--------------+------------------------------------------------------+ -| u32 | block offset | Same as above. | -+------+--------------+------------------------------------------------------+ -| u32 | xattr index | An index into the xattr lookup table or 0xFFFFFFFF | -| | | if the inode has no extended attributes. | -+------+--------------+------------------------------------------------------+ -| u32[]| block sizes | Same as above. | -+------+--------------+------------------------------------------------------+ - - -5.4) Symbolic Links - -Symbolic links mainly have a target path stored directly after the inode -header, as well as a hard-link counter (yes, you can have hard links to -symlinks): - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u32 | link count | The number of hard links to this symlink. | -+------+--------------+------------------------------------------------------+ -| u32 | target size | The size in bytes of the target path this symlink | -| | | points to. | -+------+--------------+------------------------------------------------------+ -| u8[] | target path | An array of bytes holding the target path this | -| | | symlink points to. The path is 'target size' bytes | -| | | long and NOT null-terminated. | -+------+--------------+------------------------------------------------------+ - -The extended symlink type adds an additional extended attribute index: - -+======+==============+=======================================+ -| Type | Name | Description | -+======+==============+=======================================+ -| u32 | link count | Same as above. | -+------+--------------+---------------------------------------+ -| u32 | target size | Same as above. | -+------+--------------+---------------------------------------+ -| u8[] | target path | Same as above. | -+------+--------------+---------------------------------------+ -| u32 | xattr index | An index into the xattr lookup table. | -+------+--------------+---------------------------------------+ - - -5.5) Device Special Files - -Basic device special files only store a hard-link counter and a device number. -The layout is identical for both character and block devices: - -+======+===============+=====================================================+ -| Type | Name | Description | -+======+===============+=====================================================+ -| u32 | link count | The number of hard links to this entry. | -+------+---------------+-----------------------------------------------------+ -| u32 | device number | The system specific device number. | -| | | | -| | | On Linux, this consists of major and minor device | -| | | numbers that can be extracted as follows: | -| | | major = (dev & 0xFFF00) >> 8. | -| | | minor = (dev & 0x000FF) | ((dev >> 12) & 0xFFF00) | -+------+---------------+-----------------------------------------------------+ - -The extended device file inode adds an additional extended attribute index: - -+======+===============+=========================================+ -| Type | Name | Description | -+======+===============+=========================================+ -| u32 | link count | Same as above. | -+------+---------------+-----------------------------------------+ -| u32 | device number | Same as above. | -+------+---------------+-----------------------------------------+ -| u32 | xattr index | An index into the xattr lookup table. | -+------+---------------+-----------------------------------------+ - - -5.6) IPC inodes (FIFO or Socket) - -Named pipe (FIFO) and socket special files only add a hard-link counter -after the inode header: - -+======+=============+=========================================+ -| Type | Name | Description | -+======+=============+=========================================+ -| u32 | link count | The number of hard links to this entry. | -+------+-------------+-----------------------------------------+ - -The extended versions add an additional extended attribute index: - -+======+=============+=========================================+ -| Type | Name | Description | -+======+=============+=========================================+ -| u32 | link count | Same as above. | -+------+-------------+-----------------------------------------+ -| u32 | xattr index | An index into the xattr lookup table. | -+------+-------------+-----------------------------------------+ - - - -6) Directory Table -****************** - -For each directory inode, the directory table stores a linear list of all -entries, with references back to the inodes that describe those entries. - -The entry list itself is sorted ASCIIbetically by entry name and split into -multiple runs, each preceded by a short header. - -The directory inodes store the total, uncompressed size of the entire listing, -including headers. Using this size, a SquashFS reader can determine if another -header with further entries should be following once it reaches the end of a -run. - -To save space, the header indicates a metadata block and a reference inode -number. The entries that follow simply store a difference to that inode number -and an offset into the specified metadata block. - -Every time, the inode block changes or the difference of the inode number -to the reference in the header cannot be encoded in 16 bits anymore, a new -header is emitted. - -A header must be followed by AT MOST 256 entries. If there are more entries, -a new header MUST be emitted. - -Typically, inode allocation strategies would sort the children of a directory -and then allocate inode numbers incrementally, to optimize directory entry -listings. - -Since hard links might be further further away than +/- 32k of the reference -number, they might require a new header to be emitted. Inode number allocation -and picking of the reference could of course be optimized to prevent this. - -The directory header has the following structure: - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u32 | count | Number of entries following the header | -+------+--------------+------------------------------------------------------+ -| u32 | start | The location of the metadata block in the inode table| -| | | where the inodes are stored. This is relative to the | -| | | inode table start from the super block. | -+------+--------------+------------------------------------------------------+ -| s32 | inode number | An arbitrary inode number. The entries that follow | -| | | store their inode number as a difference to this. | -+======+==============+======================================================+ - -The counter is stored off-by-one, i.e. a value of 0 indicates 1 entry follows. -This also makes it impossible to encode a size of 0, which wouldn't make any -sense. Empty directories simply have their size set to 0 in the inode instead, -so no extra dummy header has to be stored or looked up. - - -The header is followed by multiple entries that each have this structure: - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u16 | offset | An offset into the uncompressed inode metadata block.| -+------+--------------+------------------------------------------------------+ -| s16 | inode offset | The difference of this inodes number to the reference| -| | | stored in the header. | -+------+--------------+------------------------------------------------------+ -| u16 | type | The inode type. For extended inodes, the basic type | -| | | is stored here instead. | -+------+--------------+------------------------------------------------------+ -| u16 | name size | One less than the size of the entry name. | -+------+--------------+------------------------------------------------------+ -| u8[] | name | The file name of the entry without a trailing null | -| | | byte. Has 'name size' + 1 bytes. | -+------+--------------+------------------------------------------------------+ - -In the entry structure itself, the file names are stored without trailing null -bytes. Since a zero length name makes no sense, the name length is stored -off-by-one, i.e. the value 0 cannot be encoded. - -The inode type is stored in the entry, but always as the corresponding -basic type. - -While the field is technically 16 bits, the kernel implementation currently -imposes an arbitrary limit of 255 on the name size field. Since the field is -off-by-one, this means that a file name in SquashFS can be at most 256 -characters long. - - -6.1) Directory Index - -To speed up lookups on directories with lots of entries, the extended -directory inode can store an index, holding the locations of all directory -headers and the name of the first entry after the header. - -When searching for an entry, the reader can then iterate over the index to -find a range of metadata blocks that should contain a given entry and then -only scan over the given range. - -To allow for even faster lookups, a new header should be emitted every time -the entry list crosses a metadata block boundary. This narrows the boundary -down to a single metadata block lookup in most cases. - -The index entries have the following structure: - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u32 | index | This stores a byte offset from the first directory | -| | | header to the current header, as if the uncompressed | -| | | directory metadata blocks were laid out in memory | -| | | consecutively. | -+------+--------------+------------------------------------------------------+ -| u32 | start | Start offset of a directory table metadata block, | -| | | relative to the directory table start. | -+------+--------------+------------------------------------------------------+ -| u32 | name size | One less than the size of the entry name. | -+------+--------------+------------------------------------------------------+ -| u8[] | name | The name of the first entry following the header | -| | | without a trailing null byte. | -+------+--------------+------------------------------------------------------+ - - -7) Fragment Table -***************** - -Tail-ends and smaller than block size files can be combined into fragment -blocks that are at most 'block size' bytes long. - -The fragment table describes the location and size of the fragment blocks -(not the tail-ends within them). - - -This is a lookup table which stores entries of the following shape: - -+======+==============+======================================================+ -| Type | Name | Description | -+======+==============+======================================================+ -| u64 | start | The offset within the archive where the fragment | -| | | block starts. | -+------+--------------+------------------------------------------------------+ -| u32 | size | The on-disk size of the fragment block. If the block | -| | | is uncompressed, bit 24 (i.e. 1 << 24) is set. | -+------+--------------+------------------------------------------------------+ -| u32 | unused | SHOULD be set to 0. | -+------+--------------+------------------------------------------------------+ - - -The table is stored on-disk as described in section 2.3. - -The fragment table location in the superblock points to an array of 64 bit -integers that store the on-disk locations of the metadata blocks containing -the lookup table. - -Each metadata block can store up to 512 entries (= 8129 / 16). - -The "unused" field is there for alignment and SHOULD be set to 0, however the -Linux kernel currently ignores this field completely, making it impossible for -Linux to ever re-purpose this field. - - -8) Export Table -*************** - -To support NFS exports, SquashFS needs a fast way to resolve an inode number -to an inode structure. - -For this purpose, a SquashFS archive can optionally contain an export table, -which is basically a flat array of 64 bit inode references, with the inode -number being used as an index into the array. - -Because the inode number 0 is not used (reserved as a sentinel value), the -array actually starts at inode number 1 and the index is thus -inode_number - 1. - -The array itself is stored in a series of metadata blocks, as outlined in -section 2.3. - -Since each block can store 1024 references (= 8192 / 8), there will be -ceil(inode_count / 1024) metadata blocks for the entire array. - - -9) ID Table -*********** - -As outlined in section 5.1, SquashFS supports 32 bit user and group IDs. To -compact the inode table, the unique UID/GID values are collected in a lookup -table and a 16 bit table index is stored in the inode instead. - -This lookup table is stored as outlined in section 2.3. - -Each metadata block can store up to 2048 IDs (=8192 / 4). - - -10) Extended Attribute Table -**************************** - -Extended attributes are arbitrary key value pairs attached to inodes. The key -names use dots as separators to create a hierarchy of name spaces. - -The key value pairs of all inodes are stored consecutively in a series of -metadata blocks. - -The values can either be stored inline, i.e. a key entry is directly followed -by a value, or out-of-line to deduplicate identical values and use a reference -instead. Typically, the first occurrence of a value is stored in line and -every consecutive use of the same value uses an out-of-line reference back to -the first one. - - -The keys are stored using the following data structure: - -+======+===========+=========================================================+ -| Type | Name | Description | -+======+===========+=========================================================+ -| u16 | type | A prefix ID for the key name. If the value that follows | -| | | is stored out-of-line, the flag 0x0100 is ORed to the | -| | | type ID. | -| | | | -| | +-------+-------------------------------------------------+ -| | | Value | Comment | -| | +-------+-------------------------------------------------+ -| | | 0 | Prefix the name with "user." | -| | | 1 | Prefix the name with "trusted." | -| | | 2 | Prefix the name with "security." | -+------+-----------+-------+-------------------------------------------------+ -| u16 | name size | The number of key bytes that follows. | -+------+-----------+---------------------------------------------------------+ -| u8[] | name | The remainder of the key without the prefix and without | -| | | trailing null byte. | -+------+-----------+---------------------------------------------------------+ - - -Following the key, this structure is used to store the value: - -+======+============+========================================================+ -| Type | Name | Description | -+======+============+========================================================+ -| u32 | value size | The size of the value string. If the value is stored | -| | | out of line, this is always 8, i.e. the size of an | -| | | unsigned 64 bit integer. | -+------+------------+--------------------------------------------------------+ -| u8[] | value | This is 'value size' bytes of arbitrary binary data. | -| | | If the value is stored out-of-line, this is a 64 bit | -| | | reference, i.e. a location of a metadata block, | -| | | shifted left by 16 and OR-ed with an offset into the | -| | | uncompressed block, giving the location of another | -| | | value structure. | -+------+------------+--------------------------------------------------------+ - - -The metadata block location given by an out-of-line reference is relative to -the location of the first block. - - -To actually address a block of key value pairs associated with an inode, a -lookup table is used that specifies the start and size of a sequence of key -value pairs. - -All an inode needs to store is a 32 bit index into this table. If two inodes -have an identical attribute sets, the key/value sequence is only written once, -there is only one lookup table entry and both inodes have the same index. - -Each lookup table entry has the following structure: - -+======+============+========================================================+ -| Type | Name | Description | -+======+============+========================================================+ -| u64 | xattr ref | A reference to the start of the key value block, i.e. | -| | | the metadata block location shifted left by 16, OR-ed | -| | | with am offset into the uncompressed block. | -+------+------------+--------------------------------------------------------+ -| u32 | count | The number of key value pairs. | -+------+------------+--------------------------------------------------------+ -| u32 | size | The exact, uncompressed size in bytes of the entire | -| | | block of key value pairs, counting what has been | -| | | written to disk and including the key/value entry | -| | | structures. | -+------+------------+--------------------------------------------------------+ - -This lookup table is stored as outlined in section 2.3. - -Each metadata block can hold 512 (= 8192 / 16) entries. - -However, in contrast to section 2.3, additional data is given before the list -of metdata block locations, to locate the key-value pairs, as well as the -actual number of lookup table entries that are not specified in the super -block. - - -The 'Xattr table' entry in the superblock gives the absolute location of the -following data structure which is stored on-disk as is, uncompressed: - -+=======+===========+========================================================+ -| Type | Name | Description | -+=======+===========+========================================================+ -| u64 | kv start | The absolute position of the first metadata block | -| | | holding the key/value pairs. | -+-------+-----------+--------------------------------------------------------+ -| u32 | count | The number of entries in the lookup table. | -+-------+-----------+--------------------------------------------------------+ -| u32 | unused | SHOULD be set to 0, however Linux currently ignores | -| | | this field completely and squashfs-tools used to leak | -| | | stack data here, making it impossible for Linux to | -| | | ever re-purpose this field. | -+-------+-----------+--------------------------------------------------------+ -| u64[] | locations | An array holding the absolute on-disk location of each | -| | | metadata block of the lookup table. | -+-------+-----------+--------------------------------------------------------+ - -If an inode has a a valid xattr index (i.e. not 0xFFFFFFFF), the metadata -block index is computed as - - block_idx = floor(index / 512) - -which is then used to retrieve the metadata block index from the locations -array. - -Once the block has been read from disk and uncompressed, the byte offset into -the metadata block can be computed as - - offset = (index * 16) % 8192 - -From this position, the structure can be read that holds a reference to the -metadata block that contains the key/value pairs (and byte offset into the -uncompressed block where the pairs start), as well as the number of key/value -pairs and their total, uncompressed size. - |