diff options
Diffstat (limited to 'Documentation/filesystems/ext4/ondisk')
23 files changed, 0 insertions, 3747 deletions
diff --git a/Documentation/filesystems/ext4/ondisk/about.rst b/Documentation/filesystems/ext4/ondisk/about.rst deleted file mode 100644 index 0aadba0522644..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/about.rst +++ /dev/null @@ -1,44 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -About this Book -=============== - -This document attempts to describe the on-disk format for ext4 -filesystems. The same general ideas should apply to ext2/3 filesystems -as well, though they do not support all the features that ext4 supports, -and the fields will be shorter. - -**NOTE**: This is a work in progress, based on notes that the author -(djwong) made while picking apart a filesystem by hand. The data -structure definitions should be current as of Linux 4.18 and -e2fsprogs-1.44. All comments and corrections are welcome, since there is -undoubtedly plenty of lore that might not be reflected in freshly -created demonstration filesystems. - -License -------- -This book is licensed under the terms of the GNU Public License, v2. - -Terminology ------------ - -ext4 divides a storage device into an array of logical blocks both to -reduce bookkeeping overhead and to increase throughput by forcing larger -transfer sizes. Generally, the block size will be 4KiB (the same size as -pages on x86 and the block layer's default block size), though the -actual size is calculated as 2 ^ (10 + ``sb.s_log_block_size``) bytes. -Throughout this document, disk locations are given in terms of these -logical blocks, not raw LBAs, and not 1024-byte blocks. For the sake of -convenience, the logical block size will be referred to as -``$block_size`` throughout the rest of the document. - -When referenced in ``preformatted text`` blocks, ``sb`` refers to fields -in the super block, and ``inode`` refers to fields in an inode table -entry. - -Other References ----------------- - -Also see http://www.nongnu.org/ext2-doc/ for quite a collection of -information about ext2/3. Here's another old reference: -http://wiki.osdev.org/Ext2 diff --git a/Documentation/filesystems/ext4/ondisk/allocators.rst b/Documentation/filesystems/ext4/ondisk/allocators.rst deleted file mode 100644 index 7aa85152ace3d..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/allocators.rst +++ /dev/null @@ -1,56 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Block and Inode Allocation Policy ---------------------------------- - -ext4 recognizes (better than ext3, anyway) that data locality is -generally a desirably quality of a filesystem. On a spinning disk, -keeping related blocks near each other reduces the amount of movement -that the head actuator and disk must perform to access a data block, -thus speeding up disk IO. On an SSD there of course are no moving parts, -but locality can increase the size of each transfer request while -reducing the total number of requests. This locality may also have the -effect of concentrating writes on a single erase block, which can speed -up file rewrites significantly. Therefore, it is useful to reduce -fragmentation whenever possible. - -The first tool that ext4 uses to combat fragmentation is the multi-block -allocator. When a file is first created, the block allocator -speculatively allocates 8KiB of disk space to the file on the assumption -that the space will get written soon. When the file is closed, the -unused speculative allocations are of course freed, but if the -speculation is correct (typically the case for full writes of small -files) then the file data gets written out in a single multi-block -extent. A second related trick that ext4 uses is delayed allocation. -Under this scheme, when a file needs more blocks to absorb file writes, -the filesystem defers deciding the exact placement on the disk until all -the dirty buffers are being written out to disk. By not committing to a -particular placement until it's absolutely necessary (the commit timeout -is hit, or sync() is called, or the kernel runs out of memory), the hope -is that the filesystem can make better location decisions. - -The third trick that ext4 (and ext3) uses is that it tries to keep a -file's data blocks in the same block group as its inode. This cuts down -on the seek penalty when the filesystem first has to read a file's inode -to learn where the file's data blocks live and then seek over to the -file's data blocks to begin I/O operations. - -The fourth trick is that all the inodes in a directory are placed in the -same block group as the directory, when feasible. The working assumption -here is that all the files in a directory might be related, therefore it -is useful to try to keep them all together. - -The fifth trick is that the disk volume is cut up into 128MB block -groups; these mini-containers are used as outlined above to try to -maintain data locality. However, there is a deliberate quirk -- when a -directory is created in the root directory, the inode allocator scans -the block groups and puts that directory into the least heavily loaded -block group that it can find. This encourages directories to spread out -over a disk; as the top-level directory/file blobs fill up one block -group, the allocators simply move on to the next block group. Allegedly -this scheme evens out the loading on the block groups, though the author -suspects that the directories which are so unlucky as to land towards -the end of a spinning drive get a raw deal performance-wise. - -Of course if all of these mechanisms fail, one can always use e4defrag -to defragment files. diff --git a/Documentation/filesystems/ext4/ondisk/attributes.rst b/Documentation/filesystems/ext4/ondisk/attributes.rst deleted file mode 100644 index 0b01b67b81fe5..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/attributes.rst +++ /dev/null @@ -1,191 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Extended Attributes -------------------- - -Extended attributes (xattrs) are typically stored in a separate data -block on the disk and referenced from inodes via ``inode.i_file_acl*``. -The first use of extended attributes seems to have been for storing file -ACLs and other security data (selinux). With the ``user_xattr`` mount -option it is possible for users to store extended attributes so long as -all attribute names begin with “user”; this restriction seems to have -disappeared as of Linux 3.0. - -There are two places where extended attributes can be found. The first -place is between the end of each inode entry and the beginning of the -next inode entry. For example, if inode.i\_extra\_isize = 28 and -sb.inode\_size = 256, then there are 256 - (128 + 28) = 100 bytes -available for in-inode extended attribute storage. The second place -where extended attributes can be found is in the block pointed to by -``inode.i_file_acl``. As of Linux 3.11, it is not possible for this -block to contain a pointer to a second extended attribute block (or even -the remaining blocks of a cluster). In theory it is possible for each -attribute's value to be stored in a separate data block, though as of -Linux 3.11 the code does not permit this. - -Keys are generally assumed to be ASCIIZ strings, whereas values can be -strings or binary data. - -Extended attributes, when stored after the inode, have a header -``ext4_xattr_ibody_header`` that is 4 bytes long: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - \_\_le32 - - h\_magic - - Magic number for identification, 0xEA020000. This value is set by the - Linux driver, though e2fsprogs doesn't seem to check it(?) - -The beginning of an extended attribute block is in -``struct ext4_xattr_header``, which is 32 bytes long: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - \_\_le32 - - h\_magic - - Magic number for identification, 0xEA020000. - * - 0x4 - - \_\_le32 - - h\_refcount - - Reference count. - * - 0x8 - - \_\_le32 - - h\_blocks - - Number of disk blocks used. - * - 0xC - - \_\_le32 - - h\_hash - - Hash value of all attributes. - * - 0x10 - - \_\_le32 - - h\_checksum - - Checksum of the extended attribute block. - * - 0x14 - - \_\_u32 - - h\_reserved[2] - - Zero. - -The checksum is calculated against the FS UUID, the 64-bit block number -of the extended attribute block, and the entire block (header + -entries). - -Following the ``struct ext4_xattr_header`` or -``struct ext4_xattr_ibody_header`` is an array of -``struct ext4_xattr_entry``; each of these entries is at least 16 bytes -long. When stored in an external block, the ``struct ext4_xattr_entry`` -entries must be stored in sorted order. The sort order is -``e_name_index``, then ``e_name_len``, and finally ``e_name``. -Attributes stored inside an inode do not need be stored in sorted order. - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - \_\_u8 - - e\_name\_len - - Length of name. - * - 0x1 - - \_\_u8 - - e\_name\_index - - Attribute name index. There is a discussion of this below. - * - 0x2 - - \_\_le16 - - e\_value\_offs - - Location of this attribute's value on the disk block where it is stored. - Multiple attributes can share the same value. For an inode attribute - this value is relative to the start of the first entry; for a block this - value is relative to the start of the block (i.e. the header). - * - 0x4 - - \_\_le32 - - e\_value\_inum - - The inode where the value is stored. Zero indicates the value is in the - same block as this entry. This field is only used if the - INCOMPAT\_EA\_INODE feature is enabled. - * - 0x8 - - \_\_le32 - - e\_value\_size - - Length of attribute value. - * - 0xC - - \_\_le32 - - e\_hash - - Hash value of attribute name and attribute value. The kernel doesn't - update the hash for in-inode attributes, so for that case this value - must be zero, because e2fsck validates any non-zero hash regardless of - where the xattr lives. - * - 0x10 - - char - - e\_name[e\_name\_len] - - Attribute name. Does not include trailing NULL. - -Attribute values can follow the end of the entry table. There appears to -be a requirement that they be aligned to 4-byte boundaries. The values -are stored starting at the end of the block and grow towards the -xattr\_header/xattr\_entry table. When the two collide, the overflow is -put into a separate disk block. If the disk block fills up, the -filesystem returns -ENOSPC. - -The first four fields of the ``ext4_xattr_entry`` are set to zero to -mark the end of the key list. - -Attribute Name Indices -~~~~~~~~~~~~~~~~~~~~~~ - -Logically speaking, extended attributes are a series of key=value pairs. -The keys are assumed to be NULL-terminated strings. To reduce the amount -of on-disk space that the keys consume, the beginning of the key string -is matched against the attribute name index. If a match is found, the -attribute name index field is set, and matching string is removed from -the key name. Here is a map of name index values to key prefixes: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Name Index - - Key Prefix - * - 0 - - (no prefix) - * - 1 - - “user.” - * - 2 - - “system.posix\_acl\_access” - * - 3 - - “system.posix\_acl\_default” - * - 4 - - “trusted.” - * - 6 - - “security.” - * - 7 - - “system.” (inline\_data only?) - * - 8 - - “system.richacl” (SuSE kernels only?) - -For example, if the attribute key is “user.fubar”, the attribute name -index is set to 1 and the “fubar” name is recorded on disk. - -POSIX ACLs -~~~~~~~~~~ - -POSIX ACLs are stored in a reduced version of the Linux kernel (and -libacl's) internal ACL format. The key difference is that the version -number is different (1) and the ``e_id`` field is only stored for named -user and group ACLs. diff --git a/Documentation/filesystems/ext4/ondisk/bigalloc.rst b/Documentation/filesystems/ext4/ondisk/bigalloc.rst deleted file mode 100644 index c6d88557553c6..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/bigalloc.rst +++ /dev/null @@ -1,22 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Bigalloc --------- - -At the moment, the default size of a block is 4KiB, which is a commonly -supported page size on most MMU-capable hardware. This is fortunate, as -ext4 code is not prepared to handle the case where the block size -exceeds the page size. However, for a filesystem of mostly huge files, -it is desirable to be able to allocate disk blocks in units of multiple -blocks to reduce both fragmentation and metadata overhead. The -`bigalloc <Bigalloc>`__ feature provides exactly this ability. The -administrator can set a block cluster size at mkfs time (which is stored -in the s\_log\_cluster\_size field in the superblock); from then on, the -block bitmaps track clusters, not individual blocks. This means that -block groups can be several gigabytes in size (instead of just 128MiB); -however, the minimum allocation unit becomes a cluster, not a block, -even for directories. TaoBao had a patchset to extend the “use units of -clusters instead of blocks” to the extent tree, though it is not clear -where those patches went-- they eventually morphed into “extent tree v2” -but that code has not landed as of May 2015. - diff --git a/Documentation/filesystems/ext4/ondisk/bitmaps.rst b/Documentation/filesystems/ext4/ondisk/bitmaps.rst deleted file mode 100644 index c7546dbc197ae..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/bitmaps.rst +++ /dev/null @@ -1,28 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Block and inode Bitmaps ------------------------ - -The data block bitmap tracks the usage of data blocks within the block -group. - -The inode bitmap records which entries in the inode table are in use. - -As with most bitmaps, one bit represents the usage status of one data -block or inode table entry. This implies a block group size of 8 \* -number\_of\_bytes\_in\_a\_logical\_block. - -NOTE: If ``BLOCK_UNINIT`` is set for a given block group, various parts -of the kernel and e2fsprogs code pretends that the block bitmap contains -zeros (i.e. all blocks in the group are free). However, it is not -necessarily the case that no blocks are in use -- if ``meta_bg`` is set, -the bitmaps and group descriptor live inside the group. Unfortunately, -ext2fs\_test\_block\_bitmap2() will return '0' for those locations, -which produces confusing debugfs output. - -Inode Table ------------ -Inode tables are statically allocated at mkfs time. Each block group -descriptor points to the start of the table, and the superblock records -the number of inodes per group. See the section on inodes for more -information. diff --git a/Documentation/filesystems/ext4/ondisk/blockgroup.rst b/Documentation/filesystems/ext4/ondisk/blockgroup.rst deleted file mode 100644 index baf888e4c06a7..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/blockgroup.rst +++ /dev/null @@ -1,135 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Layout ------- - -The layout of a standard block group is approximately as follows (each -of these fields is discussed in a separate section below): - -.. list-table:: - :widths: 1 1 1 1 1 1 1 1 - :header-rows: 1 - - * - Group 0 Padding - - ext4 Super Block - - Group Descriptors - - Reserved GDT Blocks - - Data Block Bitmap - - inode Bitmap - - inode Table - - Data Blocks - * - 1024 bytes - - 1 block - - many blocks - - many blocks - - 1 block - - 1 block - - many blocks - - many more blocks - -For the special case of block group 0, the first 1024 bytes are unused, -to allow for the installation of x86 boot sectors and other oddities. -The superblock will start at offset 1024 bytes, whichever block that -happens to be (usually 0). However, if for some reason the block size = -1024, then block 0 is marked in use and the superblock goes in block 1. -For all other block groups, there is no padding. - -The ext4 driver primarily works with the superblock and the group -descriptors that are found in block group 0. Redundant copies of the -superblock and group descriptors are written to some of the block groups -across the disk in case the beginning of the disk gets trashed, though -not all block groups necessarily host a redundant copy (see following -paragraph for more details). If the group does not have a redundant -copy, the block group begins with the data block bitmap. Note also that -when the filesystem is freshly formatted, mkfs will allocate “reserve -GDT block” space after the block group descriptors and before the start -of the block bitmaps to allow for future expansion of the filesystem. By -default, a filesystem is allowed to increase in size by a factor of -1024x over the original filesystem size. - -The location of the inode table is given by ``grp.bg_inode_table_*``. It -is continuous range of blocks large enough to contain -``sb.s_inodes_per_group * sb.s_inode_size`` bytes. - -As for the ordering of items in a block group, it is generally -established that the super block and the group descriptor table, if -present, will be at the beginning of the block group. The bitmaps and -the inode table can be anywhere, and it is quite possible for the -bitmaps to come after the inode table, or for both to be in different -groups (flex\_bg). Leftover space is used for file data blocks, indirect -block maps, extent tree blocks, and extended attributes. - -Flexible Block Groups ---------------------- - -Starting in ext4, there is a new feature called flexible block groups -(flex\_bg). In a flex\_bg, several block groups are tied together as one -logical block group; the bitmap spaces and the inode table space in the -first block group of the flex\_bg are expanded to include the bitmaps -and inode tables of all other block groups in the flex\_bg. For example, -if the flex\_bg size is 4, then group 0 will contain (in order) the -superblock, group descriptors, data block bitmaps for groups 0-3, inode -bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining -space in group 0 is for file data. The effect of this is to group the -block metadata close together for faster loading, and to enable large -files to be continuous on disk. Backup copies of the superblock and -group descriptors are always at the beginning of block groups, even if -flex\_bg is enabled. The number of block groups that make up a flex\_bg -is given by 2 ^ ``sb.s_log_groups_per_flex``. - -Meta Block Groups ------------------ - -Without the option META\_BG, for safety concerns, all block group -descriptors copies are kept in the first block group. Given the default -128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4 -can have at most 2^27/64 = 2^21 block groups. This limits the entire -filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB. - -The solution to this problem is to use the metablock group feature -(META\_BG), which is already in ext3 for all 2.6 releases. With the -META\_BG feature, ext4 filesystems are partitioned into many metablock -groups. Each metablock group is a cluster of block groups whose group -descriptor structures can be stored in a single disk block. For ext4 -filesystems with 4 KB block size, a single metablock group partition -includes 64 block groups, or 8 GiB of disk space. The metablock group -feature moves the location of the group descriptors from the congested -first block group of the whole filesystem into the first group of each -metablock group itself. The backups are in the second and last group of -each metablock group. This increases the 2^21 maximum block groups limit -to the hard limit 2^32, allowing support for a 512PiB filesystem. - -The change in the filesystem format replaces the current scheme where -the superblock is followed by a variable-length set of block group -descriptors. Instead, the superblock and a single block group descriptor -block is placed at the beginning of the first, second, and last block -groups in a meta-block group. A meta-block group is a collection of -block groups which can be described by a single block group descriptor -block. Since the size of the block group descriptor structure is 32 -bytes, a meta-block group contains 32 block groups for filesystems with -a 1KB block size, and 128 block groups for filesystems with a 4KB -blocksize. Filesystems can either be created using this new block group -descriptor layout, or existing filesystems can be resized on-line, and -the field s\_first\_meta\_bg in the superblock will indicate the first -block group using this new layout. - -Please see an important note about ``BLOCK_UNINIT`` in the section about -block and inode bitmaps. - -Lazy Block Group Initialization -------------------------------- - -A new feature for ext4 are three block group descriptor flags that -enable mkfs to skip initializing other parts of the block group -metadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean -that the inode and block bitmaps for that group can be calculated and -therefore the on-disk bitmap blocks are not initialized. This is -generally the case for an empty block group or a block group containing -only fixed-location block group metadata. The INODE\_ZEROED flag means -that the inode table has been initialized; mkfs will unset this flag and -rely on the kernel to initialize the inode tables in the background. - -By not writing zeroes to the bitmaps and inode table, mkfs time is -reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM, -but the dumpe2fs output prints this as “uninit\_bg”. They are the same -thing. diff --git a/Documentation/filesystems/ext4/ondisk/blockmap.rst b/Documentation/filesystems/ext4/ondisk/blockmap.rst deleted file mode 100644 index 30e25750d88a4..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/blockmap.rst +++ /dev/null @@ -1,49 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| i.i\_block Offset | Where It Points | -+=====================+==============================================================================================================================================================================================================================+ -| 0 to 11 | Direct map to file blocks 0 to 11. | -+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| 12 | Indirect block: (file blocks 12 to (``$block_size`` / 4) + 11, or 12 to 1035 if 4KiB blocks) | -| | | -| | +------------------------------+--------------------------------------------------------------------+ | -| | | Indirect Block Offset | Where It Points | | -| | +==============================+====================================================================+ | -| | | 0 to (``$block_size`` / 4) | Direct map to (``$block_size`` / 4) blocks (1024 if 4KiB blocks) | | -| | +------------------------------+--------------------------------------------------------------------+ | -+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| 13 | Double-indirect block: (file blocks ``$block_size``/4 + 12 to (``$block_size`` / 4) ^ 2 + (``$block_size`` / 4) + 11, or 1036 to 1049611 if 4KiB blocks) | -| | | -| | +--------------------------------+---------------------------------------------------------------------------------------------------------+ | -| | | Double Indirect Block Offset | Where It Points | | -| | +================================+=========================================================================================================+ | -| | | 0 to (``$block_size`` / 4) | Map to (``$block_size`` / 4) indirect blocks (1024 if 4KiB blocks) | | -| | | | | | -| | | | +------------------------------+--------------------------------------------------------------------+ | | -| | | | | Indirect Block Offset | Where It Points | | | -| | | | +==============================+====================================================================+ | | -| | | | | 0 to (``$block_size`` / 4) | Direct map to (``$block_size`` / 4) blocks (1024 if 4KiB blocks) | | | -| | | | +------------------------------+--------------------------------------------------------------------+ | | -| | +--------------------------------+---------------------------------------------------------------------------------------------------------+ | -+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| 14 | Triple-indirect block: (file blocks (``$block_size`` / 4) ^ 2 + (``$block_size`` / 4) + 12 to (``$block_size`` / 4) ^ 3 + (``$block_size`` / 4) ^ 2 + (``$block_size`` / 4) + 12, or 1049612 to 1074791436 if 4KiB blocks) | -| | | -| | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | -| | | Triple Indirect Block Offset | Where It Points | | -| | +================================+================================================================================================================================================+ | -| | | 0 to (``$block_size`` / 4) | Map to (``$block_size`` / 4) double indirect blocks (1024 if 4KiB blocks) | | -| | | | | | -| | | | +--------------------------------+---------------------------------------------------------------------------------------------------------+ | | -| | | | | Double Indirect Block Offset | Where It Points | | | -| | | | +================================+=========================================================================================================+ | | -| | | | | 0 to (``$block_size`` / 4) | Map to (``$block_size`` / 4) indirect blocks (1024 if 4KiB blocks) | | | -| | | | | | | | | -| | | | | | +------------------------------+--------------------------------------------------------------------+ | | | -| | | | | | | Indirect Block Offset | Where It Points | | | | -| | | | | | +==============================+====================================================================+ | | | -| | | | | | | 0 to (``$block_size`` / 4) | Direct map to (``$block_size`` / 4) blocks (1024 if 4KiB blocks) | | | | -| | | | | | +------------------------------+--------------------------------------------------------------------+ | | | -| | | | +--------------------------------+---------------------------------------------------------------------------------------------------------+ | | -| | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | -+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/Documentation/filesystems/ext4/ondisk/blocks.rst b/Documentation/filesystems/ext4/ondisk/blocks.rst deleted file mode 100644 index 73d4dc0f7bda8..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/blocks.rst +++ /dev/null @@ -1,142 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Blocks ------- - -ext4 allocates storage space in units of “blocks”. A block is a group of -sectors between 1KiB and 64KiB, and the number of sectors must be an -integral power of 2. Blocks are in turn grouped into larger units called -block groups. Block size is specified at mkfs time and typically is -4KiB. You may experience mounting problems if block size is greater than -page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory -pages). By default a filesystem can contain 2^32 blocks; if the '64bit' -feature is enabled, then a filesystem can have 2^64 blocks. - -For 32-bit filesystems, limits are as follows: - -.. list-table:: - :widths: 1 1 1 1 1 - :header-rows: 1 - - * - Item - - 1KiB - - 2KiB - - 4KiB - - 64KiB - * - Blocks - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - Inodes - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - File System Size - - 4TiB - - 8TiB - - 16TiB - - 256PiB - * - Blocks Per Block Group - - 8,192 - - 16,384 - - 32,768 - - 524,288 - * - Inodes Per Block Group - - 8,192 - - 16,384 - - 32,768 - - 524,288 - * - Block Group Size - - 8MiB - - 32MiB - - 128MiB - - 32GiB - * - Blocks Per File, Extents - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - Blocks Per File, Block Maps - - 16,843,020 - - 134,480,396 - - 1,074,791,436 - - 4,398,314,962,956 (really 2^32 due to field size limitations) - * - File Size, Extents - - 4TiB - - 8TiB - - 16TiB - - 256TiB - * - File Size, Block Maps - - 16GiB - - 256GiB - - 4TiB - - 256TiB - -For 64-bit filesystems, limits are as follows: - -.. list-table:: - :widths: 1 1 1 1 1 - :header-rows: 1 - - * - Item - - 1KiB - - 2KiB - - 4KiB - - 64KiB - * - Blocks - - 2^64 - - 2^64 - - 2^64 - - 2^64 - * - Inodes - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - File System Size - - 16ZiB - - 32ZiB - - 64ZiB - - 1YiB - * - Blocks Per Block Group - - 8,192 - - 16,384 - - 32,768 - - 524,288 - * - Inodes Per Block Group - - 8,192 - - 16,384 - - 32,768 - - 524,288 - * - Block Group Size - - 8MiB - - 32MiB - - 128MiB - - 32GiB - * - Blocks Per File, Extents - - 2^32 - - 2^32 - - 2^32 - - 2^32 - * - Blocks Per File, Block Maps - - 16,843,020 - - 134,480,396 - - 1,074,791,436 - - 4,398,314,962,956 (really 2^32 due to field size limitations) - * - File Size, Extents - - 4TiB - - 8TiB - - 16TiB - - 256TiB - * - File Size, Block Maps - - 16GiB - - 256GiB - - 4TiB - - 256TiB - -Note: Files not using extents (i.e. files using block maps) must be -placed within the first 2^32 blocks of a filesystem. Files with extents -must be placed within the first 2^48 blocks of a filesystem. It's not -clear what happens with larger filesystems. diff --git a/Documentation/filesystems/ext4/ondisk/checksums.rst b/Documentation/filesystems/ext4/ondisk/checksums.rst deleted file mode 100644 index 9d6a793b2e030..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/checksums.rst +++ /dev/null @@ -1,73 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Checksums ---------- - -Starting in early 2012, metadata checksums were added to all major ext4 -and jbd2 data structures. The associated feature flag is metadata\_csum. -The desired checksum algorithm is indicated in the superblock, though as -of October 2012 the only supported algorithm is crc32c. Some data -structures did not have space to fit a full 32-bit checksum, so only the -lower 16 bits are stored. Enabling the 64bit feature increases the data -structure size so that full 32-bit checksums can be stored for many data -structures. However, existing 32-bit filesystems cannot be extended to -enable 64bit mode, at least not without the experimental resize2fs -patches to do so. - -Existing filesystems can have checksumming added by running -``tune2fs -O metadata_csum`` against the underlying device. If tune2fs -encounters directory blocks that lack sufficient empty space to add a -checksum, it will request that you run ``e2fsck -D`` to have the -directories rebuilt with checksums. This has the added benefit of -removing slack space from the directory files and rebalancing the htree -indexes. If you \_ignore\_ this step, your directories will not be -protected by a checksum! - -The following table describes the data elements that go into each type -of checksum. The checksum function is whatever the superblock describes -(crc32c as of October 2013) unless noted otherwise. - -.. list-table:: - :widths: 1 1 4 - :header-rows: 1 - - * - Metadata - - Length - - Ingredients - * - Superblock - - \_\_le32 - - The entire superblock up to the checksum field. The UUID lives inside - the superblock. - * - MMP - - \_\_le32 - - UUID + the entire MMP block up to the checksum field. - * - Extended Attributes - - \_\_le32 - - UUID + the entire extended attribute block. The checksum field is set to - zero. - * - Directory Entries - - \_\_le32 - - UUID + inode number + inode generation + the directory block up to the - fake entry enclosing the checksum field. - * - HTREE Nodes - - \_\_le32 - - UUID + inode number + inode generation + all valid extents + HTREE tail. - The checksum field is set to zero. - * - Extents - - \_\_le32 - - UUID + inode number + inode generation + the entire extent block up to - the checksum field. - * - Bitmaps - - \_\_le32 or \_\_le16 - - UUID + the entire bitmap. Checksums are stored in the group descriptor, - and truncated if the group descriptor size is 32 bytes (i.e. ^64bit) - * - Inodes - - \_\_le32 - - UUID + inode number + inode generation + the entire inode. The checksum - field is set to zero. Each inode has its own checksum. - * - Group Descriptors - - \_\_le16 - - If metadata\_csum, then UUID + group number + the entire descriptor; - else if gdt\_csum, then crc16(UUID + group number + the entire - descriptor). In all cases, only the lower 16 bits are stored. - diff --git a/Documentation/filesystems/ext4/ondisk/directory.rst b/Documentation/filesystems/ext4/ondisk/directory.rst deleted file mode 100644 index 8fcba68c28848..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/directory.rst +++ /dev/null @@ -1,426 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Directory Entries ------------------ - -In an ext4 filesystem, a directory is more or less a flat file that maps -an arbitrary byte string (usually ASCII) to an inode number on the -filesystem. There can be many directory entries across the filesystem -that reference the same inode number--these are known as hard links, and -that is why hard links cannot reference files on other filesystems. As -such, directory entries are found by reading the data block(s) -associated with a directory file for the particular directory entry that -is desired. - -Linear (Classic) Directories -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -By default, each directory lists its entries in an “almost-linear” -array. I write “almost” because it's not a linear array in the memory -sense because directory entries are not split across filesystem blocks. -Therefore, it is more accurate to say that a directory is a series of -data blocks and that each block contains a linear array of directory -entries. The end of each per-block array is signified by reaching the -end of the block; the last entry in the block has a record length that -takes it all the way to the end of the block. The end of the entire -directory is of course signified by reaching the end of the file. Unused -directory entries are signified by inode = 0. By default the filesystem -uses ``struct ext4_dir_entry_2`` for directory entries unless the -“filetype” feature flag is not set, in which case it uses -``struct ext4_dir_entry``. - -The original directory entry format is ``struct ext4_dir_entry``, which -is at most 263 bytes long, though on disk you'll need to reference -``dirent.rec_len`` to know for sure. - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - inode - - Number of the inode that this directory entry points to. - * - 0x4 - - \_\_le16 - - rec\_len - - Length of this directory entry. Must be a multiple of 4. - * - 0x6 - - \_\_le16 - - name\_len - - Length of the file name. - * - 0x8 - - char - - name[EXT4\_NAME\_LEN] - - File name. - -Since file names cannot be longer than 255 bytes, the new directory -entry format shortens the rec\_len field and uses the space for a file -type flag, probably to avoid having to load every inode during directory -tree traversal. This format is ``ext4_dir_entry_2``, which is at most -263 bytes long, though on disk you'll need to reference -``dirent.rec_len`` to know for sure. - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - inode - - Number of the inode that this directory entry points to. - * - 0x4 - - \_\_le16 - - rec\_len - - Length of this directory entry. - * - 0x6 - - \_\_u8 - - name\_len - - Length of the file name. - * - 0x7 - - \_\_u8 - - file\_type - - File type code, see ftype_ table below. - * - 0x8 - - char - - name[EXT4\_NAME\_LEN] - - File name. - -.. _ftype: - -The directory file type is one of the following values: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x0 - - Unknown. - * - 0x1 - - Regular file. - * - 0x2 - - Directory. - * - 0x3 - - Character device file. - * - 0x4 - - Block device file. - * - 0x5 - - FIFO. - * - 0x6 - - Socket. - * - 0x7 - - Symbolic link. - -In order to add checksums to these classic directory blocks, a phony -``struct ext4_dir_entry`` is placed at the end of each leaf block to -hold the checksum. The directory entry is 12 bytes long. The inode -number and name\_len fields are set to zero to fool old software into -ignoring an apparently empty directory entry, and the checksum is stored -in the place where the name normally goes. The structure is -``struct ext4_dir_entry_tail``: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - det\_reserved\_zero1 - - Inode number, which must be zero. - * - 0x4 - - \_\_le16 - - det\_rec\_len - - Length of this directory entry, which must be 12. - * - 0x6 - - \_\_u8 - - det\_reserved\_zero2 - - Length of the file name, which must be zero. - * - 0x7 - - \_\_u8 - - det\_reserved\_ft - - File type, which must be 0xDE. - * - 0x8 - - \_\_le32 - - det\_checksum - - Directory leaf block checksum. - -The leaf directory block checksum is calculated against the FS UUID, the -directory's inode number, the directory's inode generation number, and -the entire directory entry block up to (but not including) the fake -directory entry. - -Hash Tree Directories -~~~~~~~~~~~~~~~~~~~~~ - -A linear array of directory entries isn't great for performance, so a -new feature was added to ext3 to provide a faster (but peculiar) -balanced tree keyed off a hash of the directory entry name. If the -EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a -hashed btree (htree) to organize and find directory entries. For -backwards read-only compatibility with ext2, this tree is actually -hidden inside the directory file, masquerading as “empty” directory data -blocks! It was stated previously that the end of the linear directory -entry table was signified with an entry pointing to inode 0; this is -(ab)used to fool the old linear-scan algorithm into thinking that the -rest of the directory block is empty so that it moves on. - -The root of the tree always lives in the first data block of the -directory. By ext2 custom, the '.' and '..' entries must appear at the -beginning of this first block, so they are put here as two -``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of -the root node contains metadata about the tree and finally a hash->block -map to find nodes that are lower in the htree. If -``dx_root.info.indirect_levels`` is non-zero then the htree has two -levels; the data block pointed to by the root node's map is an interior -node, which is indexed by a minor hash. Interior nodes in this tree -contains a zeroed out ``struct ext4_dir_entry_2`` followed by a -minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear -array of all ``struct ext4_dir_entry_2``; all of these entries -(presumably) hash to the same value. If there is an overflow, the -entries simply overflow into the next leaf node, and the -least-significant bit of the hash (in the interior node map) that gets -us to this next leaf node is set. - -To traverse the directory as a htree, the code calculates the hash of -the desired file name and uses it to find the corresponding block -number. If the tree is flat, the block is a linear array of directory -entries that can be searched; otherwise, the minor hash of the file name -is computed and used against this second block to find the corresponding -third block number. That third block number will be a linear array of -directory entries. - -To traverse the directory as a linear array (such as the old code does), -the code simply reads every data block in the directory. The blocks used -for the htree will appear to have no entries (aside from '.' and '..') -and so only the leaf nodes will appear to have any interesting content. - -The root of the htree is in ``struct dx_root``, which is the full length -of a data block: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - \_\_le32 - - dot.inode - - inode number of this directory. - * - 0x4 - - \_\_le16 - - dot.rec\_len - - Length of this record, 12. - * - 0x6 - - u8 - - dot.name\_len - - Length of the name, 1. - * - 0x7 - - u8 - - dot.file\_type - - File type of this entry, 0x2 (directory) (if the feature flag is set). - * - 0x8 - - char - - dot.name[4] - - “.\\0\\0\\0” - * - 0xC - - \_\_le32 - - dotdot.inode - - inode number of parent directory. - * - 0x10 - - \_\_le16 - - dotdot.rec\_len - - block\_size - 12. The record length is long enough to cover all htree - data. - * - 0x12 - - u8 - - dotdot.name\_len - - Length of the name, 2. - * - 0x13 - - u8 - - dotdot.file\_type - - File type of this entry, 0x2 (directory) (if the feature flag is set). - * - 0x14 - - char - - dotdot\_name[4] - - “..\\0\\0” - * - 0x18 - - \_\_le32 - - struct dx\_root\_info.reserved\_zero - - Zero. - * - 0x1C - - u8 - - struct dx\_root\_info.hash\_version - - Hash type, see dirhash_ table below. - * - 0x1D - - u8 - - struct dx\_root\_info.info\_length - - Length of the tree information, 0x8. - * - 0x1E - - u8 - - struct dx\_root\_info.indirect\_levels - - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR - feature is set; cannot be larger than 2 otherwise. - * - 0x1F - - u8 - - struct dx\_root\_info.unused\_flags - - - * - 0x20 - - \_\_le16 - - limit - - Maximum number of dx\_entries that can follow this header, plus 1 for - the header itself. - * - 0x22 - - \_\_le16 - - count - - Actual number of dx\_entries that follow this header, plus 1 for the - header itself. - * - 0x24 - - \_\_le32 - - block - - The block number (within the directory file) that goes with hash=0. - * - 0x28 - - struct dx\_entry - - entries[0] - - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. - -.. _dirhash: - -The directory hash is one of the following values: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x0 - - Legacy. - * - 0x1 - - Half MD4. - * - 0x2 - - Tea. - * - 0x3 - - Legacy, unsigned. - * - 0x4 - - Half MD4, unsigned. - * - 0x5 - - Tea, unsigned. - -Interior nodes of an htree are recorded as ``struct dx_node``, which is -also the full length of a data block: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - \_\_le32 - - fake.inode - - Zero, to make it look like this entry is not in use. - * - 0x4 - - \_\_le16 - - fake.rec\_len - - The size of the block, in order to hide all of the dx\_node data. - * - 0x6 - - u8 - - name\_len - - Zero. There is no name for this “unused” directory entry. - * - 0x7 - - u8 - - file\_type - - Zero. There is no file type for this “unused” directory entry. - * - 0x8 - - \_\_le16 - - limit - - Maximum number of dx\_entries that can follow this header, plus 1 for - the header itself. - * - 0xA - - \_\_le16 - - count - - Actual number of dx\_entries that follow this header, plus 1 for the - header itself. - * - 0xE - - \_\_le32 - - block - - The block number (within the directory file) that goes with the lowest - hash value of this block. This value is stored in the parent block. - * - 0x12 - - struct dx\_entry - - entries[0] - - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block. - -The hash maps that exist in both ``struct dx_root`` and -``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes -long: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - \_\_le32 - - hash - - Hash code. - * - 0x4 - - \_\_le32 - - block - - Block number (within the directory file, not filesystem blocks) of the - next node in the htree. - -(If you think this is all quite clever and peculiar, so does the -author.) - -If metadata checksums are enabled, the last 8 bytes of the directory -block (precisely the length of one dx\_entry) are used to store a -``struct dx_tail``, which contains the checksum. The ``limit`` and -``count`` entries in the dx\_root/dx\_node structures are adjusted as -necessary to fit the dx\_tail into the block. If there is no space for -the dx\_tail, the user is notified to run e2fsck -D to rebuild the -directory index (which will ensure that there's space for the checksum. -The dx\_tail structure is 8 bytes long and looks like this: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - u32 - - dt\_reserved - - Zero. - * - 0x4 - - \_\_le32 - - dt\_checksum - - Checksum of the htree directory block. - -The checksum is calculated against the FS UUID, the htree index header -(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in -use, and the tail block (dx\_tail). diff --git a/Documentation/filesystems/ext4/ondisk/dynamic.rst b/Documentation/filesystems/ext4/ondisk/dynamic.rst deleted file mode 100644 index bb0c84333341a..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/dynamic.rst +++ /dev/null @@ -1,12 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Dynamic Structures -================== - -Dynamic metadata are created on the fly when files and blocks are -allocated to files. - -.. include:: inodes.rst -.. include:: ifork.rst -.. include:: directory.rst -.. include:: attributes.rst diff --git a/Documentation/filesystems/ext4/ondisk/eainode.rst b/Documentation/filesystems/ext4/ondisk/eainode.rst deleted file mode 100644 index ecc0d01a0a72c..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/eainode.rst +++ /dev/null @@ -1,18 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Large Extended Attribute Values -------------------------------- - -To enable ext4 to store extended attribute values that do not fit in the -inode or in the single extended attribute block attached to an inode, -the EA\_INODE feature allows us to store the value in the data blocks of -a regular file inode. This “EA inode” is linked only from the extended -attribute name index and must not appear in a directory entry. The -inode's i\_atime field is used to store a checksum of the xattr value; -and i\_ctime/i\_version store a 64-bit reference count, which enables -sharing of large xattr values between multiple owning inodes. For -backward compatibility with older versions of this feature, the -i\_mtime/i\_generation *may* store a back-reference to the inode number -and i\_generation of the **one** owning inode (in cases where the EA -inode is not referenced by multiple inodes) to verify that the EA inode -is the correct one being accessed. diff --git a/Documentation/filesystems/ext4/ondisk/globals.rst b/Documentation/filesystems/ext4/ondisk/globals.rst deleted file mode 100644 index 368bf7662b968..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/globals.rst +++ /dev/null @@ -1,13 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Global Structures -================= - -The filesystem is sharded into a number of block groups, each of which -have static metadata at fixed locations. - -.. include:: super.rst -.. include:: group_descr.rst -.. include:: bitmaps.rst -.. include:: mmp.rst -.. include:: journal.rst diff --git a/Documentation/filesystems/ext4/ondisk/group_descr.rst b/Documentation/filesystems/ext4/ondisk/group_descr.rst deleted file mode 100644 index 759827e5d2cf9..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/group_descr.rst +++ /dev/null @@ -1,170 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Block Group Descriptors ------------------------ - -Each block group on the filesystem has one of these descriptors -associated with it. As noted in the Layout section above, the group -descriptors (if present) are the second item in the block group. The -standard configuration is for each block group to contain a full copy of -the block group descriptor table unless the sparse\_super feature flag -is set. - -Notice how the group descriptor records the location of both bitmaps and -the inode table (i.e. they can float). This means that within a block -group, the only data structures with fixed locations are the superblock -and the group descriptor table. The flex\_bg mechanism uses this -property to group several block groups into a flex group and lay out all -of the groups' bitmaps and inode tables into one long run in the first -group of the flex group. - -If the meta\_bg feature flag is set, then several block groups are -grouped together into a meta group. Note that in the meta\_bg case, -however, the first and last two block groups within the larger meta -group contain only group descriptors for the groups inside the meta -group. - -flex\_bg and meta\_bg do not appear to be mutually exclusive features. - -In ext2, ext3, and ext4 (when the 64bit feature is not enabled), the -block group descriptor was only 32 bytes long and therefore ends at -bg\_checksum. On an ext4 filesystem with the 64bit feature enabled, the -block group descriptor expands to at least the 64 bytes described below; -the size is stored in the superblock. - -If gdt\_csum is set and metadata\_csum is not set, the block group -checksum is the crc16 of the FS UUID, the group number, and the group -descriptor structure. If metadata\_csum is set, then the block group -checksum is the lower 16 bits of the checksum of the FS UUID, the group -number, and the group descriptor structure. Both block and inode bitmap -checksums are calculated against the FS UUID, the group number, and the -entire bitmap. - -The block group descriptor is laid out in ``struct ext4_group_desc``. - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - bg\_block\_bitmap\_lo - - Lower 32-bits of location of block bitmap. - * - 0x4 - - \_\_le32 - - bg\_inode\_bitmap\_lo - - Lower 32-bits of location of inode bitmap. - * - 0x8 - - \_\_le32 - - bg\_inode\_table\_lo - - Lower 32-bits of location of inode table. - * - 0xC - - \_\_le16 - - bg\_free\_blocks\_count\_lo - - Lower 16-bits of free block count. - * - 0xE - - \_\_le16 - - bg\_free\_inodes\_count\_lo - - Lower 16-bits of free inode count. - * - 0x10 - - \_\_le16 - - bg\_used\_dirs\_count\_lo - - Lower 16-bits of directory count. - * - 0x12 - - \_\_le16 - - bg\_flags - - Block group flags. See the bgflags_ table below. - * - 0x14 - - \_\_le32 - - bg\_exclude\_bitmap\_lo - - Lower 32-bits of location of snapshot exclusion bitmap. - * - 0x18 - - \_\_le16 - - bg\_block\_bitmap\_csum\_lo - - Lower 16-bits of the block bitmap checksum. - * - 0x1A - - \_\_le16 - - bg\_inode\_bitmap\_csum\_lo - - Lower 16-bits of the inode bitmap checksum. - * - 0x1C - - \_\_le16 - - bg\_itable\_unused\_lo - - Lower 16-bits of unused inode count. If set, we needn't scan past the - ``(sb.s_inodes_per_group - gdt.bg_itable_unused)``\ th entry in the - inode table for this group. - * - 0x1E - - \_\_le16 - - bg\_checksum - - Group descriptor checksum; crc16(sb\_uuid+group+desc) if the - RO\_COMPAT\_GDT\_CSUM feature is set, or crc32c(sb\_uuid+group\_desc) & - 0xFFFF if the RO\_COMPAT\_METADATA\_CSUM feature is set. - * - - - - - - - These fields only exist if the 64bit feature is enabled and s_desc_size - > 32. - * - 0x20 - - \_\_le32 - - bg\_block\_bitmap\_hi - - Upper 32-bits of location of block bitmap. - * - 0x24 - - \_\_le32 - - bg\_inode\_bitmap\_hi - - Upper 32-bits of location of inodes bitmap. - * - 0x28 - - \_\_le32 - - bg\_inode\_table\_hi - - Upper 32-bits of location of inodes table. - * - 0x2C - - \_\_le16 - - bg\_free\_blocks\_count\_hi - - Upper 16-bits of free block count. - * - 0x2E - - \_\_le16 - - bg\_free\_inodes\_count\_hi - - Upper 16-bits of free inode count. - * - 0x30 - - \_\_le16 - - bg\_used\_dirs\_count\_hi - - Upper 16-bits of directory count. - * - 0x32 - - \_\_le16 - - bg\_itable\_unused\_hi - - Upper 16-bits of unused inode count. - * - 0x34 - - \_\_le32 - - bg\_exclude\_bitmap\_hi - - Upper 32-bits of location of snapshot exclusion bitmap. - * - 0x38 - - \_\_le16 - - bg\_block\_bitmap\_csum\_hi - - Upper 16-bits of the block bitmap checksum. - * - 0x3A - - \_\_le16 - - bg\_inode\_bitmap\_csum\_hi - - Upper 16-bits of the inode bitmap checksum. - * - 0x3C - - \_\_u32 - - bg\_reserved - - Padding to 64 bytes. - -.. _bgflags: - -Block group flags can be any combination of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - inode table and bitmap are not initialized (EXT4\_BG\_INODE\_UNINIT). - * - 0x2 - - block bitmap is not initialized (EXT4\_BG\_BLOCK\_UNINIT). - * - 0x4 - - inode table is zeroed (EXT4\_BG\_INODE\_ZEROED). diff --git a/Documentation/filesystems/ext4/ondisk/ifork.rst b/Documentation/filesystems/ext4/ondisk/ifork.rst deleted file mode 100644 index 5dbe3b2b121ab..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/ifork.rst +++ /dev/null @@ -1,194 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -The Contents of inode.i\_block ------------------------------- - -Depending on the type of file an inode describes, the 60 bytes of -storage in ``inode.i_block`` can be used in different ways. In general, -regular files and directories will use it for file block indexing -information, and special files will use it for special purposes. - -Symbolic Links -~~~~~~~~~~~~~~ - -The target of a symbolic link will be stored in this field if the target -string is less than 60 bytes long. Otherwise, either extents or block -maps will be used to allocate data blocks to store the link target. - -Direct/Indirect Block Addressing -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In ext2/3, file block numbers were mapped to logical block numbers by -means of an (up to) three level 1-1 block map. To find the logical block -that stores a particular file block, the code would navigate through -this increasingly complicated structure. Notice that there is neither a -magic number nor a checksum to provide any level of confidence that the -block isn't full of garbage. - -.. ifconfig:: builder != 'latex' - - .. include:: blockmap.rst - -.. ifconfig:: builder == 'latex' - - [Table omitted because LaTeX doesn't support nested tables.] - -Note that with this block mapping scheme, it is necessary to fill out a -lot of mapping data even for a large contiguous file! This inefficiency -led to the creation of the extent mapping scheme, discussed below. - -Notice also that a file using this mapping scheme cannot be placed -higher than 2^32 blocks. - -Extent Tree -~~~~~~~~~~~ - -In ext4, the file to logical block map has been replaced with an extent -tree. Under the old scheme, allocating a contiguous run of 1,000 blocks -requires an indirect block to map all 1,000 entries; with extents, the -mapping is reduced to a single ``struct ext4_extent`` with -``ee_len = 1000``. If flex\_bg is enabled, it is possible to allocate -very large files with a single extent, at a considerable reduction in -metadata block use, and some improvement in disk efficiency. The inode -must have the extents flag (0x80000) flag set for this feature to be in -use. - -Extents are arranged as a tree. Each node of the tree begins with a -``struct ext4_extent_header``. If the node is an interior node -(``eh.eh_depth`` > 0), the header is followed by ``eh.eh_entries`` -instances of ``struct ext4_extent_idx``; each of these index entries -points to a block containing more nodes in the extent tree. If the node -is a leaf node (``eh.eh_depth == 0``), then the header is followed by -``eh.eh_entries`` instances of ``struct ext4_extent``; these instances -point to the file's data blocks. The root node of the extent tree is -stored in ``inode.i_block``, which allows for the first four extents to -be recorded without the use of extra metadata blocks. - -The extent tree header is recorded in ``struct ext4_extent_header``, -which is 12 bytes long: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le16 - - eh\_magic - - Magic number, 0xF30A. - * - 0x2 - - \_\_le16 - - eh\_entries - - Number of valid entries following the header. - * - 0x4 - - \_\_le16 - - eh\_max - - Maximum number of entries that could follow the header. - * - 0x6 - - \_\_le16 - - eh\_depth - - Depth of this extent node in the extent tree. 0 = this extent node - points to data blocks; otherwise, this extent node points to other - extent nodes. The extent tree can be at most 5 levels deep: a logical - block number can be at most ``2^32``, and the smallest ``n`` that - satisfies ``4*(((blocksize - 12)/12)^n) >= 2^32`` is 5. - * - 0x8 - - \_\_le32 - - eh\_generation - - Generation of the tree. (Used by Lustre, but not standard ext4). - -Internal nodes of the extent tree, also known as index nodes, are -recorded as ``struct ext4_extent_idx``, and are 12 bytes long: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - ei\_block - - This index node covers file blocks from 'block' onward. - * - 0x4 - - \_\_le32 - - ei\_leaf\_lo - - Lower 32-bits of the block number of the extent node that is the next - level lower in the tree. The tree node pointed to can be either another - internal node or a leaf node, described below. - * - 0x8 - - \_\_le16 - - ei\_leaf\_hi - - Upper 16-bits of the previous field. - * - 0xA - - \_\_u16 - - ei\_unused - - - -Leaf nodes of the extent tree are recorded as ``struct ext4_extent``, -and are also 12 bytes long: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - ee\_block - - First file block number that this extent covers. - * - 0x4 - - \_\_le16 - - ee\_len - - Number of blocks covered by extent. If the value of this field is <= - 32768, the extent is initialized. If the value of the field is > 32768, - the extent is uninitialized and the actual extent length is ``ee_len`` - - 32768. Therefore, the maximum length of a initialized extent is 32768 - blocks, and the maximum length of an uninitialized extent is 32767. - * - 0x6 - - \_\_le16 - - ee\_start\_hi - - Upper 16-bits of the block number to which this extent points. - * - 0x8 - - \_\_le32 - - ee\_start\_lo - - Lower 32-bits of the block number to which this extent points. - -Prior to the introduction of metadata checksums, the extent header + -extent entries always left at least 4 bytes of unallocated space at the -end of each extent tree data block (because (2^x % 12) >= 4). Therefore, -the 32-bit checksum is inserted into this space. The 4 extents in the -inode do not need checksumming, since the inode is already checksummed. -The checksum is calculated against the FS UUID, the inode number, the -inode generation, and the entire extent block leading up to (but not -including) the checksum itself. - -``struct ext4_extent_tail`` is 4 bytes long: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - eb\_checksum - - Checksum of the extent block, crc32c(uuid+inum+igeneration+extentblock) - -Inline Data -~~~~~~~~~~~ - -If the inline data feature is enabled for the filesystem and the flag is -set for the inode, it is possible that the first 60 bytes of the file -data are stored here. diff --git a/Documentation/filesystems/ext4/ondisk/index.rst b/Documentation/filesystems/ext4/ondisk/index.rst deleted file mode 100644 index f7d082c3a4359..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/index.rst +++ /dev/null @@ -1,9 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -============================== -Data Structures and Algorithms -============================== -.. include:: about.rst -.. include:: overview.rst -.. include:: globals.rst -.. include:: dynamic.rst diff --git a/Documentation/filesystems/ext4/ondisk/inlinedata.rst b/Documentation/filesystems/ext4/ondisk/inlinedata.rst deleted file mode 100644 index d1075178ce0b2..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/inlinedata.rst +++ /dev/null @@ -1,37 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Inline Data ------------ - -The inline data feature was designed to handle the case that a file's -data is so tiny that it readily fits inside the inode, which -(theoretically) reduces disk block consumption and reduces seeks. If the -file is smaller than 60 bytes, then the data are stored inline in -``inode.i_block``. If the rest of the file would fit inside the extended -attribute space, then it might be found as an extended attribute -“system.data” within the inode body (“ibody EA”). This of course -constrains the amount of extended attributes one can attach to an inode. -If the data size increases beyond i\_block + ibody EA, a regular block -is allocated and the contents moved to that block. - -Pending a change to compact the extended attribute key used to store -inline data, one ought to be able to store 160 bytes of data in a -256-byte inode (as of June 2015, when i\_extra\_isize is 28). Prior to -that, the limit was 156 bytes due to inefficient use of inode space. - -The inline data feature requires the presence of an extended attribute -for “system.data”, even if the attribute value is zero length. - -Inline Directories -~~~~~~~~~~~~~~~~~~ - -The first four bytes of i\_block are the inode number of the parent -directory. Following that is a 56-byte space for an array of directory -entries; see ``struct ext4_dir_entry``. If there is a “system.data” -attribute in the inode body, the EA value is an array of -``struct ext4_dir_entry`` as well. Note that for inline directories, the -i\_block and EA space are treated as separate dirent blocks; directory -entries cannot span the two. - -Inline directory entries are not checksummed, as the inode checksum -should protect all inline data contents. diff --git a/Documentation/filesystems/ext4/ondisk/inodes.rst b/Documentation/filesystems/ext4/ondisk/inodes.rst deleted file mode 100644 index 655ce898f3f5c..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/inodes.rst +++ /dev/null @@ -1,575 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Index Nodes ------------ - -In a regular UNIX filesystem, the inode stores all the metadata -pertaining to the file (time stamps, block maps, extended attributes, -etc), not the directory entry. To find the information associated with a -file, one must traverse the directory files to find the directory entry -associated with a file, then load the inode to find the metadata for -that file. ext4 appears to cheat (for performance reasons) a little bit -by storing a copy of the file type (normally stored in the inode) in the -directory entry. (Compare all this to FAT, which stores all the file -information directly in the directory entry, but does not support hard -links and is in general more seek-happy than ext4 due to its simpler -block allocator and extensive use of linked lists.) - -The inode table is a linear array of ``struct ext4_inode``. The table is -sized to have enough blocks to store at least -``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the -block group containing an inode can be calculated as -``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the -group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There -is no inode 0. - -The inode checksum is calculated against the FS UUID, the inode number, -and the inode structure itself. - -The inode table entry is laid out in ``struct ext4_inode``. - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le16 - - i\_mode - - File mode. See the table i_mode_ below. - * - 0x2 - - \_\_le16 - - i\_uid - - Lower 16-bits of Owner UID. - * - 0x4 - - \_\_le32 - - i\_size\_lo - - Lower 32-bits of size in bytes. - * - 0x8 - - \_\_le32 - - i\_atime - - Last access time, in seconds since the epoch. However, if the EA\_INODE - inode flag is set, this inode stores an extended attribute value and - this field contains the checksum of the value. - * - 0xC - - \_\_le32 - - i\_ctime - - Last inode change time, in seconds since the epoch. However, if the - EA\_INODE inode flag is set, this inode stores an extended attribute - value and this field contains the lower 32 bits of the attribute value's - reference count. - * - 0x10 - - \_\_le32 - - i\_mtime - - Last data modification time, in seconds since the epoch. However, if the - EA\_INODE inode flag is set, this inode stores an extended attribute - value and this field contains the number of the inode that owns the - extended attribute. - * - 0x14 - - \_\_le32 - - i\_dtime - - Deletion Time, in seconds since the epoch. - * - 0x18 - - \_\_le16 - - i\_gid - - Lower 16-bits of GID. - * - 0x1A - - \_\_le16 - - i\_links\_count - - Hard link count. Normally, ext4 does not permit an inode to have more - than 65,000 hard links. This applies to files as well as directories, - which means that there cannot be more than 64,998 subdirectories in a - directory (each subdirectory's '..' entry counts as a hard link, as does - the '.' entry in the directory itself). With the DIR\_NLINK feature - enabled, ext4 supports more than 64,998 subdirectories by setting this - field to 1 to indicate that the number of hard links is not known. - * - 0x1C - - \_\_le32 - - i\_blocks\_lo - - Lower 32-bits of “block” count. If the huge\_file feature flag is not - set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks - on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in - ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi - << 32)`` 512-byte blocks on disk. If huge\_file is set and - EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file - consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on - disk. - * - 0x20 - - \_\_le32 - - i\_flags - - Inode flags. See the table i_flags_ below. - * - 0x24 - - 4 bytes - - i\_osd1 - - See the table i_osd1_ for more details. - * - 0x28 - - 60 bytes - - i\_block[EXT4\_N\_BLOCKS=15] - - Block map or extent tree. See the section “The Contents of inode.i\_block”. - * - 0x64 - - \_\_le32 - - i\_generation - - File version (for NFS). - * - 0x68 - - \_\_le32 - - i\_file\_acl\_lo - - Lower 32-bits of extended attribute block. ACLs are of course one of - many possible extended attributes; I think the name of this field is a - result of the first use of extended attributes being for ACLs. - * - 0x6C - - \_\_le32 - - i\_size\_high / i\_dir\_acl - - Upper 32-bits of file/directory size. In ext2/3 this field was named - i\_dir\_acl, though it was usually set to zero and never used. - * - 0x70 - - \_\_le32 - - i\_obso\_faddr - - (Obsolete) fragment address. - * - 0x74 - - 12 bytes - - i\_osd2 - - See the table i_osd2_ for more details. - * - 0x80 - - \_\_le16 - - i\_extra\_isize - - Size of this inode - 128. Alternately, the size of the extended inode - fields beyond the original ext2 inode, including this field. - * - 0x82 - - \_\_le16 - - i\_checksum\_hi - - Upper 16-bits of the inode checksum. - * - 0x84 - - \_\_le32 - - i\_ctime\_extra - - Extra change time bits. This provides sub-second precision. See Inode - Timestamps section. - * - 0x88 - - \_\_le32 - - i\_mtime\_extra - - Extra modification time bits. This provides sub-second precision. - * - 0x8C - - \_\_le32 - - i\_atime\_extra - - Extra access time bits. This provides sub-second precision. - * - 0x90 - - \_\_le32 - - i\_crtime - - File creation time, in seconds since the epoch. - * - 0x94 - - \_\_le32 - - i\_crtime\_extra - - Extra file creation time bits. This provides sub-second precision. - * - 0x98 - - \_\_le32 - - i\_version\_hi - - Upper 32-bits for version number. - * - 0x9C - - \_\_le32 - - i\_projid - - Project ID. - -.. _i_mode: - -The ``i_mode`` value is a combination of the following flags: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - S\_IXOTH (Others may execute) - * - 0x2 - - S\_IWOTH (Others may write) - * - 0x4 - - S\_IROTH (Others may read) - * - 0x8 - - S\_IXGRP (Group members may execute) - * - 0x10 - - S\_IWGRP (Group members may write) - * - 0x20 - - S\_IRGRP (Group members may read) - * - 0x40 - - S\_IXUSR (Owner may execute) - * - 0x80 - - S\_IWUSR (Owner may write) - * - 0x100 - - S\_IRUSR (Owner may read) - * - 0x200 - - S\_ISVTX (Sticky bit) - * - 0x400 - - S\_ISGID (Set GID) - * - 0x800 - - S\_ISUID (Set UID) - * - - - These are mutually-exclusive file types: - * - 0x1000 - - S\_IFIFO (FIFO) - * - 0x2000 - - S\_IFCHR (Character device) - * - 0x4000 - - S\_IFDIR (Directory) - * - 0x6000 - - S\_IFBLK (Block device) - * - 0x8000 - - S\_IFREG (Regular file) - * - 0xA000 - - S\_IFLNK (Symbolic link) - * - 0xC000 - - S\_IFSOCK (Socket) - -.. _i_flags: - -The ``i_flags`` field is a combination of these values: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented) - * - 0x2 - - This file should be preserved, should undeletion be desired - (EXT4\_UNRM\_FL). (not implemented) - * - 0x4 - - File is compressed (EXT4\_COMPR\_FL). (not really implemented) - * - 0x8 - - All writes to the file must be synchronous (EXT4\_SYNC\_FL). - * - 0x10 - - File is immutable (EXT4\_IMMUTABLE\_FL). - * - 0x20 - - File can only be appended (EXT4\_APPEND\_FL). - * - 0x40 - - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL). - * - 0x80 - - Do not update access time (EXT4\_NOATIME\_FL). - * - 0x100 - - Dirty compressed file (EXT4\_DIRTY\_FL). (not used) - * - 0x200 - - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used) - * - 0x400 - - Do not compress file (EXT4\_NOCOMPR\_FL). (not used) - * - 0x800 - - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was - EXT4\_ECOMPR\_FL (compression error), which was never used. - * - 0x1000 - - Directory has hashed indexes (EXT4\_INDEX\_FL). - * - 0x2000 - - AFS magic directory (EXT4\_IMAGIC\_FL). - * - 0x4000 - - File data must always be written through the journal - (EXT4\_JOURNAL\_DATA\_FL). - * - 0x8000 - - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4) - * - 0x10000 - - All directory entry data should be written synchronously (see - ``dirsync``) (EXT4\_DIRSYNC\_FL). - * - 0x20000 - - Top of directory hierarchy (EXT4\_TOPDIR\_FL). - * - 0x40000 - - This is a huge file (EXT4\_HUGE\_FILE\_FL). - * - 0x80000 - - Inode uses extents (EXT4\_EXTENTS\_FL). - * - 0x200000 - - Inode stores a large extended attribute value in its data blocks - (EXT4\_EA\_INODE\_FL). - * - 0x400000 - - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL). - (deprecated) - * - 0x01000000 - - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline) - * - 0x04000000 - - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in - mainline) - * - 0x08000000 - - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in - mainline) - * - 0x10000000 - - Inode has inline data (EXT4\_INLINE\_DATA\_FL). - * - 0x20000000 - - Create children with the same project ID (EXT4\_PROJINHERIT\_FL). - * - 0x80000000 - - Reserved for ext4 library (EXT4\_RESERVED\_FL). - * - - - Aggregate flags: - * - 0x4BDFFF - - User-visible flags. - * - 0x4B80FF - - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and - EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's - EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of - these flags in a special manner and they are masked out of the set of - flags that are saved directly to i\_flags. - -.. _i_osd1: - -The ``osd1`` field has multiple meanings depending on the creator: - -Linux: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - l\_i\_version - - Inode version. However, if the EA\_INODE inode flag is set, this inode - stores an extended attribute value and this field contains the upper 32 - bits of the attribute value's reference count. - -Hurd: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - h\_i\_translator - - ?? - -Masix: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - m\_i\_reserved - - ?? - -.. _i_osd2: - -The ``osd2`` field has multiple meanings depending on the filesystem creator: - -Linux: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le16 - - l\_i\_blocks\_high - - Upper 16-bits of the block count. Please see the note attached to - i\_blocks\_lo. - * - 0x2 - - \_\_le16 - - l\_i\_file\_acl\_high - - Upper 16-bits of the extended attribute block (historically, the file - ACL location). See the Extended Attributes section below. - * - 0x4 - - \_\_le16 - - l\_i\_uid\_high - - Upper 16-bits of the Owner UID. - * - 0x6 - - \_\_le16 - - l\_i\_gid\_high - - Upper 16-bits of the GID. - * - 0x8 - - \_\_le16 - - l\_i\_checksum\_lo - - Lower 16-bits of the inode checksum. - * - 0xA - - \_\_le16 - - l\_i\_reserved - - Unused. - -Hurd: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le16 - - h\_i\_reserved1 - - ?? - * - 0x2 - - \_\_u16 - - h\_i\_mode\_high - - Upper 16-bits of the file mode. - * - 0x4 - - \_\_le16 - - h\_i\_uid\_high - - Upper 16-bits of the Owner UID. - * - 0x6 - - \_\_le16 - - h\_i\_gid\_high - - Upper 16-bits of the GID. - * - 0x8 - - \_\_u32 - - h\_i\_author - - Author code? - -Masix: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le16 - - h\_i\_reserved1 - - ?? - * - 0x2 - - \_\_u16 - - m\_i\_file\_acl\_high - - Upper 16-bits of the extended attribute block (historically, the file - ACL location). - * - 0x4 - - \_\_u32 - - m\_i\_reserved2[2] - - ?? - -Inode Size -~~~~~~~~~~ - -In ext2 and ext3, the inode structure size was fixed at 128 bytes -(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of -128 bytes. Starting with ext4, it is possible to allocate a larger -on-disk inode at format time for all inodes in the filesystem to provide -space beyond the end of the original ext2 inode. The on-disk inode -record size is recorded in the superblock as ``s_inode_size``. The -number of bytes actually used by struct ext4\_inode beyond the original -128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each -inode, which allows struct ext4\_inode to grow for a new kernel without -having to upgrade all of the on-disk inodes. Access to fields beyond -EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within -``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as -of October 2013) the inode structure is 156 bytes -(``i_extra_isize = 28``). The extra space between the end of the inode -structure and the end of the inode record can be used to store extended -attributes. Each inode record can be as large as the filesystem block -size, though this is not terribly efficient. - -Finding an Inode -~~~~~~~~~~~~~~~~ - -Each block group contains ``sb->s_inodes_per_group`` inodes. Because -inode 0 is defined not to exist, this formula can be used to find the -block group that an inode lives in: -``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode -can be found within the block group's inode table at -``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte -address within the inode table, use -``offset = index * sb->s_inode_size``. - -Inode Timestamps -~~~~~~~~~~~~~~~~ - -Four timestamps are recorded in the lower 128 bytes of the inode -structure -- inode change time (ctime), access time (atime), data -modification time (mtime), and deletion time (dtime). The four fields -are 32-bit signed integers that represent seconds since the Unix epoch -(1970-01-01 00:00:00 GMT), which means that the fields will overflow in -January 2038. For inodes that are not linked from any directory but are -still open (orphan inodes), the dtime field is overloaded for use with -the orphan list. The superblock field ``s_last_orphan`` points to the -first inode in the orphan list; dtime is then the number of the next -orphaned inode, or zero if there are no more orphans. - -If the inode structure size ``sb->s_inode_size`` is larger than 128 -bytes and the ``i_inode_extra`` field is large enough to encompass the -respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime -inode fields are widened to 64 bits. Within this “extra” 32-bit field, -the lower two bits are used to extend the 32-bit seconds field to be 34 -bit wide; the upper 30 bits are used to provide nanosecond timestamp -accuracy. Therefore, timestamps should not overflow until May 2446. -dtime was not widened. There is also a fifth timestamp to record inode -creation time (crtime); this field is 64-bits wide and decoded in the -same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible -through the regular stat() interface, though debugfs will report them. - -We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)). -In other words: - -.. list-table:: - :widths: 20 20 20 20 20 - :header-rows: 1 - - * - Extra epoch bits - - MSB of 32-bit time - - Adjustment for signed 32-bit to 64-bit tv\_sec - - Decoded 64-bit tv\_sec - - valid time range - * - 0 0 - - 1 - - 0 - - ``-0x80000000 - -0x00000001`` - - 1901-12-13 to 1969-12-31 - * - 0 0 - - 0 - - 0 - - ``0x000000000 - 0x07fffffff`` - - 1970-01-01 to 2038-01-19 - * - 0 1 - - 1 - - 0x100000000 - - ``0x080000000 - 0x0ffffffff`` - - 2038-01-19 to 2106-02-07 - * - 0 1 - - 0 - - 0x100000000 - - ``0x100000000 - 0x17fffffff`` - - 2106-02-07 to 2174-02-25 - * - 1 0 - - 1 - - 0x200000000 - - ``0x180000000 - 0x1ffffffff`` - - 2174-02-25 to 2242-03-16 - * - 1 0 - - 0 - - 0x200000000 - - ``0x200000000 - 0x27fffffff`` - - 2242-03-16 to 2310-04-04 - * - 1 1 - - 1 - - 0x300000000 - - ``0x280000000 - 0x2ffffffff`` - - 2310-04-04 to 2378-04-22 - * - 1 1 - - 0 - - 0x300000000 - - ``0x300000000 - 0x37fffffff`` - - 2378-04-22 to 2446-05-10 - -This is a somewhat odd encoding since there are effectively seven times -as many positive values as negative values. There have also been -long-standing bugs decoding and encoding dates beyond 2038, which don't -seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels -incorrectly use the extra epoch bits 1,1 for dates between 1901 and -1970. At some point the kernel will be fixed and e2fsck will fix this -situation, assuming that it is run before 2310. diff --git a/Documentation/filesystems/ext4/ondisk/journal.rst b/Documentation/filesystems/ext4/ondisk/journal.rst deleted file mode 100644 index e7031af868767..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/journal.rst +++ /dev/null @@ -1,611 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Journal (jbd2) --------------- - -Introduced in ext3, the ext4 filesystem employs a journal to protect the -filesystem against corruption in the case of a system crash. A small -continuous region of disk (default 128MiB) is reserved inside the -filesystem as a place to land “important” data writes on-disk as quickly -as possible. Once the important data transaction is fully written to the -disk and flushed from the disk write cache, a record of the data being -committed is also written to the journal. At some later point in time, -the journal code writes the transactions to their final locations on -disk (this could involve a lot of seeking or a lot of small -read-write-erases) before erasing the commit record. Should the system -crash during the second slow write, the journal can be replayed all the -way to the latest commit record, guaranteeing the atomicity of whatever -gets written through the journal to the disk. The effect of this is to -guarantee that the filesystem does not become stuck midway through a -metadata update. - -For performance reasons, ext4 by default only writes filesystem metadata -through the journal. This means that file data blocks are /not/ -guaranteed to be in any consistent state after a crash. If this default -guarantee level (``data=ordered``) is not satisfactory, there is a mount -option to control journal behavior. If ``data=journal``, all data and -metadata are written to disk through the journal. This is slower but -safest. If ``data=writeback``, dirty data blocks are not flushed to the -disk before the metadata are written to disk through the journal. - -The journal inode is typically inode 8. The first 68 bytes of the -journal inode are replicated in the ext4 superblock. The journal itself -is normal (but hidden) file within the filesystem. The file usually -consumes an entire block group, though mke2fs tries to put it in the -middle of the disk. - -All fields in jbd2 are written to disk in big-endian order. This is the -opposite of ext4. - -NOTE: Both ext4 and ocfs2 use jbd2. - -The maximum size of a journal embedded in an ext4 filesystem is 2^32 -blocks. jbd2 itself does not seem to care. - -Layout -~~~~~~ - -Generally speaking, the journal has this format: - -.. list-table:: - :widths: 1 1 78 - :header-rows: 1 - - * - Superblock - - descriptor\_block (data\_blocks or revocation\_block) [more data or - revocations] commmit\_block - - [more transactions...] - * - - - One transaction - - - -Notice that a transaction begins with either a descriptor and some data, -or a block revocation list. A finished transaction always ends with a -commit. If there is no commit record (or the checksums don't match), the -transaction will be discarded during replay. - -External Journal -~~~~~~~~~~~~~~~~ - -Optionally, an ext4 filesystem can be created with an external journal -device (as opposed to an internal journal, which uses a reserved inode). -In this case, on the filesystem device, ``s_journal_inum`` should be -zero and ``s_journal_uuid`` should be set. On the journal device there -will be an ext4 super block in the usual place, with a matching UUID. -The journal superblock will be in the next full block after the -superblock. - -.. list-table:: - :widths: 1 1 1 1 76 - :header-rows: 1 - - * - 1024 bytes of padding - - ext4 Superblock - - Journal Superblock - - descriptor\_block (data\_blocks or revocation\_block) [more data or - revocations] commmit\_block - - [more transactions...] - * - - - - - - - One transaction - - - -Block Header -~~~~~~~~~~~~ - -Every block in the journal starts with a common 12-byte header -``struct journal_header_s``: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - \_\_be32 - - h\_magic - - jbd2 magic number, 0xC03B3998. - * - 0x4 - - \_\_be32 - - h\_blocktype - - Description of what this block contains. See the jbd2_blocktype_ table - below. - * - 0x8 - - \_\_be32 - - h\_sequence - - The transaction ID that goes with this block. - -.. _jbd2_blocktype: - -The journal block type can be any one of: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 1 - - Descriptor. This block precedes a series of data blocks that were - written through the journal during a transaction. - * - 2 - - Block commit record. This block signifies the completion of a - transaction. - * - 3 - - Journal superblock, v1. - * - 4 - - Journal superblock, v2. - * - 5 - - Block revocation records. This speeds up recovery by enabling the - journal to skip writing blocks that were subsequently rewritten. - -Super Block -~~~~~~~~~~~ - -The super block for the journal is much simpler as compared to ext4's. -The key data kept within are size of the journal, and where to find the -start of the log of transactions. - -The journal superblock is recorded as ``struct journal_superblock_s``, -which is 1024 bytes long: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - - - - - - - Static information describing the journal. - * - 0x0 - - journal\_header\_t (12 bytes) - - s\_header - - Common header identifying this as a superblock. - * - 0xC - - \_\_be32 - - s\_blocksize - - Journal device block size. - * - 0x10 - - \_\_be32 - - s\_maxlen - - Total number of blocks in this journal. - * - 0x14 - - \_\_be32 - - s\_first - - First block of log information. - * - - - - - - - Dynamic information describing the current state of the log. - * - 0x18 - - \_\_be32 - - s\_sequence - - First commit ID expected in log. - * - 0x1C - - \_\_be32 - - s\_start - - Block number of the start of log. Contrary to the comments, this field - being zero does not imply that the journal is clean! - * - 0x20 - - \_\_be32 - - s\_errno - - Error value, as set by jbd2\_journal\_abort(). - * - - - - - - - The remaining fields are only valid in a v2 superblock. - * - 0x24 - - \_\_be32 - - s\_feature\_compat; - - Compatible feature set. See the table jbd2_compat_ below. - * - 0x28 - - \_\_be32 - - s\_feature\_incompat - - Incompatible feature set. See the table jbd2_incompat_ below. - * - 0x2C - - \_\_be32 - - s\_feature\_ro\_compat - - Read-only compatible feature set. There aren't any of these currently. - * - 0x30 - - \_\_u8 - - s\_uuid[16] - - 128-bit uuid for journal. This is compared against the copy in the ext4 - super block at mount time. - * - 0x40 - - \_\_be32 - - s\_nr\_users - - Number of file systems sharing this journal. - * - 0x44 - - \_\_be32 - - s\_dynsuper - - Location of dynamic super block copy. (Not used?) - * - 0x48 - - \_\_be32 - - s\_max\_transaction - - Limit of journal blocks per transaction. (Not used?) - * - 0x4C - - \_\_be32 - - s\_max\_trans\_data - - Limit of data blocks per transaction. (Not used?) - * - 0x50 - - \_\_u8 - - s\_checksum\_type - - Checksum algorithm used for the journal. See jbd2_checksum_type_ for - more info. - * - 0x51 - - \_\_u8[3] - - s\_padding2 - - - * - 0x54 - - \_\_u32 - - s\_padding[42] - - - * - 0xFC - - \_\_be32 - - s\_checksum - - Checksum of the entire superblock, with this field set to zero. - * - 0x100 - - \_\_u8 - - s\_users[16\*48] - - ids of all file systems sharing the log. e2fsprogs/Linux don't allow - shared external journals, but I imagine Lustre (or ocfs2?), which use - the jbd2 code, might. - -.. _jbd2_compat: - -The journal compat features are any combination of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Journal maintains checksums on the data blocks. - (JBD2\_FEATURE\_COMPAT\_CHECKSUM) - -.. _jbd2_incompat: - -The journal incompat features are any combination of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE) - * - 0x2 - - Journal can deal with 64-bit block numbers. - (JBD2\_FEATURE\_INCOMPAT\_64BIT) - * - 0x4 - - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT) - * - 0x8 - - This journal uses v2 of the checksum on-disk format. Each journal - metadata block gets its own checksum, and the block tags in the - descriptor table contain checksums for each of the data blocks in the - journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2) - * - 0x10 - - This journal uses v3 of the checksum on-disk format. This is the same as - v2, but the journal block tag size is fixed regardless of the size of - block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3) - -.. _jbd2_checksum_type: - -Journal checksum type codes are one of the following. crc32 or crc32c are the -most likely choices. - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 1 - - CRC32 - * - 2 - - MD5 - * - 3 - - SHA1 - * - 4 - - CRC32C - -Descriptor Block -~~~~~~~~~~~~~~~~ - -The descriptor block contains an array of journal block tags that -describe the final locations of the data blocks that follow in the -journal. Descriptor blocks are open-coded instead of being completely -described by a data structure, but here is the block structure anyway. -Descriptor blocks consume at least 36 bytes, but use a full block: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - journal\_header\_t - - (open coded) - - Common block header. - * - 0xC - - struct journal\_block\_tag\_s - - open coded array[] - - Enough tags either to fill up the block or to describe all the data - blocks that follow this descriptor block. - -Journal block tags have any of the following formats, depending on which -journal feature and block tag flags are set. - -If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is -defined as ``struct journal_block_tag3_s``, which looks like the -following. The size is 16 or 32 bytes. - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - \_\_be32 - - t\_blocknr - - Lower 32-bits of the location of where the corresponding data block - should end up on disk. - * - 0x4 - - \_\_be32 - - t\_flags - - Flags that go with the descriptor. See the table jbd2_tag_flags_ for - more info. - * - 0x8 - - \_\_be32 - - t\_blocknr\_high - - Upper 32-bits of the location of where the corresponding data block - should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is - not enabled. - * - 0xC - - \_\_be32 - - t\_checksum - - Checksum of the journal UUID, the sequence number, and the data block. - * - - - - - - - This field appears to be open coded. It always comes at the end of the - tag, after t_checksum. This field is not present if the "same UUID" flag - is set. - * - 0x8 or 0xC - - char - - uuid[16] - - A UUID to go with this tag. This field appears to be copied from the - ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that - field. - -.. _jbd2_tag_flags: - -The journal tag flags are any combination of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - On-disk block is escaped. The first four bytes of the data block just - happened to match the jbd2 magic number. - * - 0x2 - - This block has the same UUID as previous, therefore the UUID field is - omitted. - * - 0x4 - - The data block was deleted by the transaction. (Not used?) - * - 0x8 - - This is the last tag in this descriptor block. - -If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag -is defined as ``struct journal_block_tag_s``, which looks like the -following. The size is 8, 12, 24, or 28 bytes: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - \_\_be32 - - t\_blocknr - - Lower 32-bits of the location of where the corresponding data block - should end up on disk. - * - 0x4 - - \_\_be16 - - t\_checksum - - Checksum of the journal UUID, the sequence number, and the data block. - Note that only the lower 16 bits are stored. - * - 0x6 - - \_\_be16 - - t\_flags - - Flags that go with the descriptor. See the table jbd2_tag_flags_ for - more info. - * - - - - - - - This next field is only present if the super block indicates support for - 64-bit block numbers. - * - 0x8 - - \_\_be32 - - t\_blocknr\_high - - Upper 32-bits of the location of where the corresponding data block - should end up on disk. - * - - - - - - - This field appears to be open coded. It always comes at the end of the - tag, after t_flags or t_blocknr_high. This field is not present if the - "same UUID" flag is set. - * - 0x8 or 0xC - - char - - uuid[16] - - A UUID to go with this tag. This field appears to be copied from the - ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that - field. - -If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or -JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a -``struct jbd2_journal_block_tail``, which looks like this: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - \_\_be32 - - t\_checksum - - Checksum of the journal UUID + the descriptor block, with this field set - to zero. - -Data Block -~~~~~~~~~~ - -In general, the data blocks being written to disk through the journal -are written verbatim into the journal file after the descriptor block. -However, if the first four bytes of the block match the jbd2 magic -number then those four bytes are replaced with zeroes and the “escaped” -flag is set in the descriptor block tag. - -Revocation Block -~~~~~~~~~~~~~~~~ - -A revocation block is used to prevent replay of a block in an earlier -transaction. This is used to mark blocks that were journalled at one -time but are no longer journalled. Typically this happens if a metadata -block is freed and re-allocated as a file data block; in this case, a -journal replay after the file block was written to disk will cause -corruption. - -**NOTE**: This mechanism is NOT used to express “this journal block is -superseded by this other journal block”, as the author (djwong) -mistakenly thought. Any block being added to a transaction will cause -the removal of all existing revocation records for that block. - -Revocation blocks are described in -``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in -length, but use a full block: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - journal\_header\_t - - r\_header - - Common block header. - * - 0xC - - \_\_be32 - - r\_count - - Number of bytes used in this block. - * - 0x10 - - \_\_be32 or \_\_be64 - - blocks[0] - - Blocks to revoke. - -After r\_count is a linear array of block numbers that are effectively -revoked by this transaction. The size of each block number is 8 bytes if -the superblock advertises 64-bit block number support, or 4 bytes -otherwise. - -If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or -JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation -block is a ``struct jbd2_journal_revoke_tail``, which has this format: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - \_\_be32 - - r\_checksum - - Checksum of the journal UUID + revocation block - -Commit Block -~~~~~~~~~~~~ - -The commit block is a sentry that indicates that a transaction has been -completely written to the journal. Once this commit block reaches the -journal, the data stored with this transaction can be written to their -final locations on disk. - -The commit block is described by ``struct commit_header``, which is 32 -bytes long (but uses a full block): - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Descriptor - * - 0x0 - - journal\_header\_s - - (open coded) - - Common block header. - * - 0xC - - unsigned char - - h\_chksum\_type - - The type of checksum to use to verify the integrity of the data blocks - in the transaction. See jbd2_checksum_type_ for more info. - * - 0xD - - unsigned char - - h\_chksum\_size - - The number of bytes used by the checksum. Most likely 4. - * - 0xE - - unsigned char - - h\_padding[2] - - - * - 0x10 - - \_\_be32 - - h\_chksum[JBD2\_CHECKSUM\_BYTES] - - 32 bytes of space to store checksums. If - JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 - are set, the first ``__be32`` is the checksum of the journal UUID and - the entire commit block, with this field zeroed. If - JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the - crc32 of all the blocks already written to the transaction. - * - 0x30 - - \_\_be64 - - h\_commit\_sec - - The time that the transaction was committed, in seconds since the epoch. - * - 0x38 - - \_\_be32 - - h\_commit\_nsec - - Nanoseconds component of the above timestamp. - diff --git a/Documentation/filesystems/ext4/ondisk/mmp.rst b/Documentation/filesystems/ext4/ondisk/mmp.rst deleted file mode 100644 index b7d7a3137f803..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/mmp.rst +++ /dev/null @@ -1,77 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Multiple Mount Protection -------------------------- - -Multiple mount protection (MMP) is a feature that protects the -filesystem against multiple hosts trying to use the filesystem -simultaneously. When a filesystem is opened (for mounting, or fsck, -etc.), the MMP code running on the node (call it node A) checks a -sequence number. If the sequence number is EXT4\_MMP\_SEQ\_CLEAN, the -open continues. If the sequence number is EXT4\_MMP\_SEQ\_FSCK, then -fsck is (hopefully) running, and open fails immediately. Otherwise, the -open code will wait for twice the specified MMP check interval and check -the sequence number again. If the sequence number has changed, then the -filesystem is active on another machine and the open fails. If the MMP -code passes all of those checks, a new MMP sequence number is generated -and written to the MMP block, and the mount proceeds. - -While the filesystem is live, the kernel sets up a timer to re-check the -MMP block at the specified MMP check interval. To perform the re-check, -the MMP sequence number is re-read; if it does not match the in-memory -MMP sequence number, then another node (node B) has mounted the -filesystem, and node A remounts the filesystem read-only. If the -sequence numbers match, the sequence number is incremented both in -memory and on disk, and the re-check is complete. - -The hostname and device filename are written into the MMP block whenever -an open operation succeeds. The MMP code does not use these values; they -are provided purely for informational purposes. - -The checksum is calculated against the FS UUID and the MMP structure. -The MMP structure (``struct mmp_struct``) is as follows: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Type - - Name - - Description - * - 0x0 - - \_\_le32 - - mmp\_magic - - Magic number for MMP, 0x004D4D50 (“MMP”). - * - 0x4 - - \_\_le32 - - mmp\_seq - - Sequence number, updated periodically. - * - 0x8 - - \_\_le64 - - mmp\_time - - Time that the MMP block was last updated. - * - 0x10 - - char[64] - - mmp\_nodename - - Hostname of the node that opened the filesystem. - * - 0x50 - - char[32] - - mmp\_bdevname - - Block device name of the filesystem. - * - 0x70 - - \_\_le16 - - mmp\_check\_interval - - The MMP re-check interval, in seconds. - * - 0x72 - - \_\_le16 - - mmp\_pad1 - - Zero. - * - 0x74 - - \_\_le32[226] - - mmp\_pad2 - - Zero. - * - 0x3FC - - \_\_le32 - - mmp\_checksum - - Checksum of the MMP block. diff --git a/Documentation/filesystems/ext4/ondisk/overview.rst b/Documentation/filesystems/ext4/ondisk/overview.rst deleted file mode 100644 index cbab18baba121..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/overview.rst +++ /dev/null @@ -1,26 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -High Level Design -================= - -An ext4 file system is split into a series of block groups. To reduce -performance difficulties due to fragmentation, the block allocator tries -very hard to keep each file's blocks within the same group, thereby -reducing seek times. The size of a block group is specified in -``sb.s_blocks_per_group`` blocks, though it can also calculated as 8 \* -``block_size_in_bytes``. With the default block size of 4KiB, each group -will contain 32,768 blocks, for a length of 128MiB. The number of block -groups is the size of the device divided by the size of a block group. - -All fields in ext4 are written to disk in little-endian order. HOWEVER, -all fields in jbd2 (the journal) are written to disk in big-endian -order. - -.. include:: blocks.rst -.. include:: blockgroup.rst -.. include:: special_inodes.rst -.. include:: allocators.rst -.. include:: checksums.rst -.. include:: bigalloc.rst -.. include:: inlinedata.rst -.. include:: eainode.rst diff --git a/Documentation/filesystems/ext4/ondisk/special_inodes.rst b/Documentation/filesystems/ext4/ondisk/special_inodes.rst deleted file mode 100644 index a82f70c9baeb3..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/special_inodes.rst +++ /dev/null @@ -1,38 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Special inodes --------------- - -ext4 reserves some inode for special features, as follows: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - inode Number - - Purpose - * - 0 - - Doesn't exist; there is no inode 0. - * - 1 - - List of defective blocks. - * - 2 - - Root directory. - * - 3 - - User quota. - * - 4 - - Group quota. - * - 5 - - Boot loader. - * - 6 - - Undelete directory. - * - 7 - - Reserved group descriptors inode. (“resize inode”) - * - 8 - - Journal inode. - * - 9 - - The “exclude” inode, for snapshots(?) - * - 10 - - Replica inode, used for some non-upstream feature? - * - 11 - - Traditional first non-reserved inode. Usually this is the lost+found directory. See s\_first\_ino in the superblock. - diff --git a/Documentation/filesystems/ext4/ondisk/super.rst b/Documentation/filesystems/ext4/ondisk/super.rst deleted file mode 100644 index 5f81dd87e0b93..0000000000000 --- a/Documentation/filesystems/ext4/ondisk/super.rst +++ /dev/null @@ -1,801 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Super Block ------------ - -The superblock records various information about the enclosing -filesystem, such as block counts, inode counts, supported features, -maintenance information, and more. - -If the sparse\_super feature flag is set, redundant copies of the -superblock and group descriptors are kept only in the groups whose group -number is either 0 or a power of 3, 5, or 7. If the flag is not set, -redundant copies are kept in all groups. - -The superblock checksum is calculated against the superblock structure, -which includes the FS UUID. - -The ext4 superblock is laid out as follows in -``struct ext4_super_block``: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - s\_inodes\_count - - Total inode count. - * - 0x4 - - \_\_le32 - - s\_blocks\_count\_lo - - Total block count. - * - 0x8 - - \_\_le32 - - s\_r\_blocks\_count\_lo - - This number of blocks can only be allocated by the super-user. - * - 0xC - - \_\_le32 - - s\_free\_blocks\_count\_lo - - Free block count. - * - 0x10 - - \_\_le32 - - s\_free\_inodes\_count - - Free inode count. - * - 0x14 - - \_\_le32 - - s\_first\_data\_block - - First data block. This must be at least 1 for 1k-block filesystems and - is typically 0 for all other block sizes. - * - 0x18 - - \_\_le32 - - s\_log\_block\_size - - Block size is 2 ^ (10 + s\_log\_block\_size). - * - 0x1C - - \_\_le32 - - s\_log\_cluster\_size - - Cluster size is (2 ^ s\_log\_cluster\_size) blocks if bigalloc is - enabled. Otherwise s\_log\_cluster\_size must equal s\_log\_block\_size. - * - 0x20 - - \_\_le32 - - s\_blocks\_per\_group - - Blocks per group. - * - 0x24 - - \_\_le32 - - s\_clusters\_per\_group - - Clusters per group, if bigalloc is enabled. Otherwise - s\_clusters\_per\_group must equal s\_blocks\_per\_group. - * - 0x28 - - \_\_le32 - - s\_inodes\_per\_group - - Inodes per group. - * - 0x2C - - \_\_le32 - - s\_mtime - - Mount time, in seconds since the epoch. - * - 0x30 - - \_\_le32 - - s\_wtime - - Write time, in seconds since the epoch. - * - 0x34 - - \_\_le16 - - s\_mnt\_count - - Number of mounts since the last fsck. - * - 0x36 - - \_\_le16 - - s\_max\_mnt\_count - - Number of mounts beyond which a fsck is needed. - * - 0x38 - - \_\_le16 - - s\_magic - - Magic signature, 0xEF53 - * - 0x3A - - \_\_le16 - - s\_state - - File system state. See super_state_ for more info. - * - 0x3C - - \_\_le16 - - s\_errors - - Behaviour when detecting errors. See super_errors_ for more info. - * - 0x3E - - \_\_le16 - - s\_minor\_rev\_level - - Minor revision level. - * - 0x40 - - \_\_le32 - - s\_lastcheck - - Time of last check, in seconds since the epoch. - * - 0x44 - - \_\_le32 - - s\_checkinterval - - Maximum time between checks, in seconds. - * - 0x48 - - \_\_le32 - - s\_creator\_os - - Creator OS. See the table super_creator_ for more info. - * - 0x4C - - \_\_le32 - - s\_rev\_level - - Revision level. See the table super_revision_ for more info. - * - 0x50 - - \_\_le16 - - s\_def\_resuid - - Default uid for reserved blocks. - * - 0x52 - - \_\_le16 - - s\_def\_resgid - - Default gid for reserved blocks. - * - - - - - - - These fields are for EXT4_DYNAMIC_REV superblocks only. - - Note: the difference between the compatible feature set and the - incompatible feature set is that if there is a bit set in the - incompatible feature set that the kernel doesn't know about, it should - refuse to mount the filesystem. - - e2fsck's requirements are more strict; if it doesn't know - about a feature in either the compatible or incompatible feature set, it - must abort and not try to meddle with things it doesn't understand... - * - 0x54 - - \_\_le32 - - s\_first\_ino - - First non-reserved inode. - * - 0x58 - - \_\_le16 - - s\_inode\_size - - Size of inode structure, in bytes. - * - 0x5A - - \_\_le16 - - s\_block\_group\_nr - - Block group # of this superblock. - * - 0x5C - - \_\_le32 - - s\_feature\_compat - - Compatible feature set flags. Kernel can still read/write this fs even - if it doesn't understand a flag; fsck should not do that. See the - super_compat_ table for more info. - * - 0x60 - - \_\_le32 - - s\_feature\_incompat - - Incompatible feature set. If the kernel or fsck doesn't understand one - of these bits, it should stop. See the super_incompat_ table for more - info. - * - 0x64 - - \_\_le32 - - s\_feature\_ro\_compat - - Readonly-compatible feature set. If the kernel doesn't understand one of - these bits, it can still mount read-only. See the super_rocompat_ table - for more info. - * - 0x68 - - \_\_u8 - - s\_uuid[16] - - 128-bit UUID for volume. - * - 0x78 - - char - - s\_volume\_name[16] - - Volume label. - * - 0x88 - - char - - s\_last\_mounted[64] - - Directory where filesystem was last mounted. - * - 0xC8 - - \_\_le32 - - s\_algorithm\_usage\_bitmap - - For compression (Not used in e2fsprogs/Linux) - * - - - - - - - Performance hints. Directory preallocation should only happen if the - EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on. - * - 0xCC - - \_\_u8 - - s\_prealloc\_blocks - - #. of blocks to try to preallocate for ... files? (Not used in - e2fsprogs/Linux) - * - 0xCD - - \_\_u8 - - s\_prealloc\_dir\_blocks - - #. of blocks to preallocate for directories. (Not used in - e2fsprogs/Linux) - * - 0xCE - - \_\_le16 - - s\_reserved\_gdt\_blocks - - Number of reserved GDT entries for future filesystem expansion. - * - - - - - - - Journalling support is valid only if EXT4_FEATURE_COMPAT_HAS_JOURNAL is - set. - * - 0xD0 - - \_\_u8 - - s\_journal\_uuid[16] - - UUID of journal superblock - * - 0xE0 - - \_\_le32 - - s\_journal\_inum - - inode number of journal file. - * - 0xE4 - - \_\_le32 - - s\_journal\_dev - - Device number of journal file, if the external journal feature flag is - set. - * - 0xE8 - - \_\_le32 - - s\_last\_orphan - - Start of list of orphaned inodes to delete. - * - 0xEC - - \_\_le32 - - s\_hash\_seed[4] - - HTREE hash seed. - * - 0xFC - - \_\_u8 - - s\_def\_hash\_version - - Default hash algorithm to use for directory hashes. See super_def_hash_ - for more info. - * - 0xFD - - \_\_u8 - - s\_jnl\_backup\_type - - If this value is 0 or EXT3\_JNL\_BACKUP\_BLOCKS (1), then the - ``s_jnl_blocks`` field contains a duplicate copy of the inode's - ``i_block[]`` array and ``i_size``. - * - 0xFE - - \_\_le16 - - s\_desc\_size - - Size of group descriptors, in bytes, if the 64bit incompat feature flag - is set. - * - 0x100 - - \_\_le32 - - s\_default\_mount\_opts - - Default mount options. See the super_mountopts_ table for more info. - * - 0x104 - - \_\_le32 - - s\_first\_meta\_bg - - First metablock block group, if the meta\_bg feature is enabled. - * - 0x108 - - \_\_le32 - - s\_mkfs\_time - - When the filesystem was created, in seconds since the epoch. - * - 0x10C - - \_\_le32 - - s\_jnl\_blocks[17] - - Backup copy of the journal inode's ``i_block[]`` array in the first 15 - elements and i\_size\_high and i\_size in the 16th and 17th elements, - respectively. - * - - - - - - - 64bit support is valid only if EXT4_FEATURE_COMPAT_64BIT is set. - * - 0x150 - - \_\_le32 - - s\_blocks\_count\_hi - - High 32-bits of the block count. - * - 0x154 - - \_\_le32 - - s\_r\_blocks\_count\_hi - - High 32-bits of the reserved block count. - * - 0x158 - - \_\_le32 - - s\_free\_blocks\_count\_hi - - High 32-bits of the free block count. - * - 0x15C - - \_\_le16 - - s\_min\_extra\_isize - - All inodes have at least # bytes. - * - 0x15E - - \_\_le16 - - s\_want\_extra\_isize - - New inodes should reserve # bytes. - * - 0x160 - - \_\_le32 - - s\_flags - - Miscellaneous flags. See the super_flags_ table for more info. - * - 0x164 - - \_\_le16 - - s\_raid\_stride - - RAID stride. This is the number of logical blocks read from or written - to the disk before moving to the next disk. This affects the placement - of filesystem metadata, which will hopefully make RAID storage faster. - * - 0x166 - - \_\_le16 - - s\_mmp\_interval - - #. seconds to wait in multi-mount prevention (MMP) checking. In theory, - MMP is a mechanism to record in the superblock which host and device - have mounted the filesystem, in order to prevent multiple mounts. This - feature does not seem to be implemented... - * - 0x168 - - \_\_le64 - - s\_mmp\_block - - Block # for multi-mount protection data. - * - 0x170 - - \_\_le32 - - s\_raid\_stripe\_width - - RAID stripe width. This is the number of logical blocks read from or - written to the disk before coming back to the current disk. This is used - by the block allocator to try to reduce the number of read-modify-write - operations in a RAID5/6. - * - 0x174 - - \_\_u8 - - s\_log\_groups\_per\_flex - - Size of a flexible block group is 2 ^ ``s_log_groups_per_flex``. - * - 0x175 - - \_\_u8 - - s\_checksum\_type - - Metadata checksum algorithm type. The only valid value is 1 (crc32c). - * - 0x176 - - \_\_le16 - - s\_reserved\_pad - - - * - 0x178 - - \_\_le64 - - s\_kbytes\_written - - Number of KiB written to this filesystem over its lifetime. - * - 0x180 - - \_\_le32 - - s\_snapshot\_inum - - inode number of active snapshot. (Not used in e2fsprogs/Linux.) - * - 0x184 - - \_\_le32 - - s\_snapshot\_id - - Sequential ID of active snapshot. (Not used in e2fsprogs/Linux.) - * - 0x188 - - \_\_le64 - - s\_snapshot\_r\_blocks\_count - - Number of blocks reserved for active snapshot's future use. (Not used in - e2fsprogs/Linux.) - * - 0x190 - - \_\_le32 - - s\_snapshot\_list - - inode number of the head of the on-disk snapshot list. (Not used in - e2fsprogs/Linux.) - * - 0x194 - - \_\_le32 - - s\_error\_count - - Number of errors seen. - * - 0x198 - - \_\_le32 - - s\_first\_error\_time - - First time an error happened, in seconds since the epoch. - * - 0x19C - - \_\_le32 - - s\_first\_error\_ino - - inode involved in first error. - * - 0x1A0 - - \_\_le64 - - s\_first\_error\_block - - Number of block involved of first error. - * - 0x1A8 - - \_\_u8 - - s\_first\_error\_func[32] - - Name of function where the error happened. - * - 0x1C8 - - \_\_le32 - - s\_first\_error\_line - - Line number where error happened. - * - 0x1CC - - \_\_le32 - - s\_last\_error\_time - - Time of most recent error, in seconds since the epoch. - * - 0x1D0 - - \_\_le32 - - s\_last\_error\_ino - - inode involved in most recent error. - * - 0x1D4 - - \_\_le32 - - s\_last\_error\_line - - Line number where most recent error happened. - * - 0x1D8 - - \_\_le64 - - s\_last\_error\_block - - Number of block involved in most recent error. - * - 0x1E0 - - \_\_u8 - - s\_last\_error\_func[32] - - Name of function where the most recent error happened. - * - 0x200 - - \_\_u8 - - s\_mount\_opts[64] - - ASCIIZ string of mount options. - * - 0x240 - - \_\_le32 - - s\_usr\_quota\_inum - - Inode number of user `quota <quota>`__ file. - * - 0x244 - - \_\_le32 - - s\_grp\_quota\_inum - - Inode number of group `quota <quota>`__ file. - * - 0x248 - - \_\_le32 - - s\_overhead\_blocks - - Overhead blocks/clusters in fs. (Huh? This field is always zero, which - means that the kernel calculates it dynamically.) - * - 0x24C - - \_\_le32 - - s\_backup\_bgs[2] - - Block groups containing superblock backups (if sparse\_super2) - * - 0x254 - - \_\_u8 - - s\_encrypt\_algos[4] - - Encryption algorithms in use. There can be up to four algorithms in use - at any time; valid algorithm codes are given in the super_encrypt_ table - below. - * - 0x258 - - \_\_u8 - - s\_encrypt\_pw\_salt[16] - - Salt for the string2key algorithm for encryption. - * - 0x268 - - \_\_le32 - - s\_lpf\_ino - - Inode number of lost+found - * - 0x26C - - \_\_le32 - - s\_prj\_quota\_inum - - Inode that tracks project quotas. - * - 0x270 - - \_\_le32 - - s\_checksum\_seed - - Checksum seed used for metadata\_csum calculations. This value is - crc32c(~0, $orig\_fs\_uuid). - * - 0x274 - - \_\_u8 - - s\_wtime_hi - - Upper 8 bits of the s_wtime field. - * - 0x275 - - \_\_u8 - - s\_wtime_hi - - Upper 8 bits of the s_mtime field. - * - 0x276 - - \_\_u8 - - s\_mkfs_time_hi - - Upper 8 bits of the s_mkfs_time field. - * - 0x277 - - \_\_u8 - - s\_lastcheck_hi - - Upper 8 bits of the s_lastcheck_hi field. - * - 0x278 - - \_\_u8 - - s\_first_error_time_hi - - Upper 8 bits of the s_first_error_time_hi field. - * - 0x279 - - \_\_u8 - - s\_last_error_time_hi - - Upper 8 bits of the s_last_error_time_hi field. - * - 0x27A - - \_\_u8[2] - - s\_pad - - Zero padding. - * - 0x27C - - \_\_le32 - - s\_reserved[96] - - Padding to the end of the block. - * - 0x3FC - - \_\_le32 - - s\_checksum - - Superblock checksum. - -.. _super_state: - -The superblock state is some combination of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x0001 - - Cleanly umounted - * - 0x0002 - - Errors detected - * - 0x0004 - - Orphans being recovered - -.. _super_errors: - -The superblock error policy is one of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 1 - - Continue - * - 2 - - Remount read-only - * - 3 - - Panic - -.. _super_creator: - -The filesystem creator is one of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0 - - Linux - * - 1 - - Hurd - * - 2 - - Masix - * - 3 - - FreeBSD - * - 4 - - Lites - -.. _super_revision: - -The superblock revision is one of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0 - - Original format - * - 1 - - v2 format w/ dynamic inode sizes - -Note that ``EXT4_DYNAMIC_REV`` refers to a revision 1 or newer filesystem. - -.. _super_compat: - -The superblock compatible features field is a combination of any of the -following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Directory preallocation (COMPAT\_DIR\_PREALLOC). - * - 0x2 - - “imagic inodes”. Not clear from the code what this does - (COMPAT\_IMAGIC\_INODES). - * - 0x4 - - Has a journal (COMPAT\_HAS\_JOURNAL). - * - 0x8 - - Supports extended attributes (COMPAT\_EXT\_ATTR). - * - 0x10 - - Has reserved GDT blocks for filesystem expansion - (COMPAT\_RESIZE\_INODE). Requires RO\_COMPAT\_SPARSE\_SUPER. - * - 0x20 - - Has directory indices (COMPAT\_DIR\_INDEX). - * - 0x40 - - “Lazy BG”. Not in Linux kernel, seems to have been for uninitialized - block groups? (COMPAT\_LAZY\_BG) - * - 0x80 - - “Exclude inode”. Not used. (COMPAT\_EXCLUDE\_INODE). - * - 0x100 - - “Exclude bitmap”. Seems to be used to indicate the presence of - snapshot-related exclude bitmaps? Not defined in kernel or used in - e2fsprogs (COMPAT\_EXCLUDE\_BITMAP). - * - 0x200 - - Sparse Super Block, v2. If this flag is set, the SB field s\_backup\_bgs - points to the two block groups that contain backup superblocks - (COMPAT\_SPARSE\_SUPER2). - -.. _super_incompat: - -The superblock incompatible features field is a combination of any of the -following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Compression (INCOMPAT\_COMPRESSION). - * - 0x2 - - Directory entries record the file type. See ext4\_dir\_entry\_2 below - (INCOMPAT\_FILETYPE). - * - 0x4 - - Filesystem needs recovery (INCOMPAT\_RECOVER). - * - 0x8 - - Filesystem has a separate journal device (INCOMPAT\_JOURNAL\_DEV). - * - 0x10 - - Meta block groups. See the earlier discussion of this feature - (INCOMPAT\_META\_BG). - * - 0x40 - - Files in this filesystem use extents (INCOMPAT\_EXTENTS). - * - 0x80 - - Enable a filesystem size of 2^64 blocks (INCOMPAT\_64BIT). - * - 0x100 - - Multiple mount protection. Not implemented (INCOMPAT\_MMP). - * - 0x200 - - Flexible block groups. See the earlier discussion of this feature - (INCOMPAT\_FLEX\_BG). - * - 0x400 - - Inodes can be used to store large extended attribute values - (INCOMPAT\_EA\_INODE). - * - 0x1000 - - Data in directory entry (INCOMPAT\_DIRDATA). (Not implemented?) - * - 0x2000 - - Metadata checksum seed is stored in the superblock. This feature enables - the administrator to change the UUID of a metadata\_csum filesystem - while the filesystem is mounted; without it, the checksum definition - requires all metadata blocks to be rewritten (INCOMPAT\_CSUM\_SEED). - * - 0x4000 - - Large directory >2GB or 3-level htree (INCOMPAT\_LARGEDIR). Prior to - this feature, directories could not be larger than 4GiB and could not - have an htree more than 2 levels deep. If this feature is enabled, - directories can be larger than 4GiB and have a maximum htree depth of 3. - * - 0x8000 - - Data in inode (INCOMPAT\_INLINE\_DATA). - * - 0x10000 - - Encrypted inodes are present on the filesystem. (INCOMPAT\_ENCRYPT). - -.. _super_rocompat: - -The superblock read-only compatible features field is a combination of any of -the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - Sparse superblocks. See the earlier discussion of this feature - (RO\_COMPAT\_SPARSE\_SUPER). - * - 0x2 - - This filesystem has been used to store a file greater than 2GiB - (RO\_COMPAT\_LARGE\_FILE). - * - 0x4 - - Not used in kernel or e2fsprogs (RO\_COMPAT\_BTREE\_DIR). - * - 0x8 - - This filesystem has files whose sizes are represented in units of - logical blocks, not 512-byte sectors. This implies a very large file - indeed! (RO\_COMPAT\_HUGE\_FILE) - * - 0x10 - - Group descriptors have checksums. In addition to detecting corruption, - this is useful for lazy formatting with uninitialized groups - (RO\_COMPAT\_GDT\_CSUM). - * - 0x20 - - Indicates that the old ext3 32,000 subdirectory limit no longer applies - (RO\_COMPAT\_DIR\_NLINK). A directory's i\_links\_count will be set to 1 - if it is incremented past 64,999. - * - 0x40 - - Indicates that large inodes exist on this filesystem - (RO\_COMPAT\_EXTRA\_ISIZE). - * - 0x80 - - This filesystem has a snapshot (RO\_COMPAT\_HAS\_SNAPSHOT). - * - 0x100 - - `Quota <Quota>`__ (RO\_COMPAT\_QUOTA). - * - 0x200 - - This filesystem supports “bigalloc”, which means that file extents are - tracked in units of clusters (of blocks) instead of blocks - (RO\_COMPAT\_BIGALLOC). - * - 0x400 - - This filesystem supports metadata checksumming. - (RO\_COMPAT\_METADATA\_CSUM; implies RO\_COMPAT\_GDT\_CSUM, though - GDT\_CSUM must not be set) - * - 0x800 - - Filesystem supports replicas. This feature is neither in the kernel nor - e2fsprogs. (RO\_COMPAT\_REPLICA) - * - 0x1000 - - Read-only filesystem image; the kernel will not mount this image - read-write and most tools will refuse to write to the image. - (RO\_COMPAT\_READONLY) - * - 0x2000 - - Filesystem tracks project quotas. (RO\_COMPAT\_PROJECT) - -.. _super_def_hash: - -The ``s_def_hash_version`` field is one of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x0 - - Legacy. - * - 0x1 - - Half MD4. - * - 0x2 - - Tea. - * - 0x3 - - Legacy, unsigned. - * - 0x4 - - Half MD4, unsigned. - * - 0x5 - - Tea, unsigned. - -.. _super_mountopts: - -The ``s_default_mount_opts`` field is any combination of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x0001 - - Print debugging info upon (re)mount. (EXT4\_DEFM\_DEBUG) - * - 0x0002 - - New files take the gid of the containing directory (instead of the fsgid - of the current process). (EXT4\_DEFM\_BSDGROUPS) - * - 0x0004 - - Support userspace-provided extended attributes. (EXT4\_DEFM\_XATTR\_USER) - * - 0x0008 - - Support POSIX access control lists (ACLs). (EXT4\_DEFM\_ACL) - * - 0x0010 - - Do not support 32-bit UIDs. (EXT4\_DEFM\_UID16) - * - 0x0020 - - All data and metadata are commited to the journal. - (EXT4\_DEFM\_JMODE\_DATA) - * - 0x0040 - - All data are flushed to the disk before metadata are committed to the - journal. (EXT4\_DEFM\_JMODE\_ORDERED) - * - 0x0060 - - Data ordering is not preserved; data may be written after the metadata - has been written. (EXT4\_DEFM\_JMODE\_WBACK) - * - 0x0100 - - Disable write flushes. (EXT4\_DEFM\_NOBARRIER) - * - 0x0200 - - Track which blocks in a filesystem are metadata and therefore should not - be used as data blocks. This option will be enabled by default on 3.18, - hopefully. (EXT4\_DEFM\_BLOCK\_VALIDITY) - * - 0x0400 - - Enable DISCARD support, where the storage device is told about blocks - becoming unused. (EXT4\_DEFM\_DISCARD) - * - 0x0800 - - Disable delayed allocation. (EXT4\_DEFM\_NODELALLOC) - -.. _super_flags: - -The ``s_flags`` field is any combination of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x0001 - - Signed directory hash in use. - * - 0x0002 - - Unsigned directory hash in use. - * - 0x0004 - - To test development code. - -.. _super_encrypt: - -The ``s_encrypt_algos`` list can contain any of the following: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0 - - Invalid algorithm (ENCRYPTION\_MODE\_INVALID). - * - 1 - - 256-bit AES in XTS mode (ENCRYPTION\_MODE\_AES\_256\_XTS). - * - 2 - - 256-bit AES in GCM mode (ENCRYPTION\_MODE\_AES\_256\_GCM). - * - 3 - - 256-bit AES in CBC mode (ENCRYPTION\_MODE\_AES\_256\_CBC). - -Total size of the superblock is 1024 bytes. |