mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-01 02:36:02 +00:00
e080a26725
As commit 2e6506e1c4
("mm/migrate: fix deadlock in
migrate_pages_batch() on large folios") has landed upstream, large
folios can be safely enabled for compressed inodes since all
prerequisites have already landed in 6.11-rc1.
Stress tests has been running on my fleet for over 20 days without any
regression. Additionally, users [1] have requested it for months.
Let's allow large folios for EROFS full cases upstream now for wider
testing.
[1] https://lore.kernel.org/r/CAGsJ_4wtE8OcpinuqVwG4jtdx6Qh5f+TON6wz+4HMCq=A2qFcA@mail.gmail.com
Cc: Barry Song <21cnbao@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
[ Gao Xiang: minor commit typo fixes. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240819025207.3808649-1-hsiangkao@linux.alibaba.com
369 lines
18 KiB
ReStructuredText
369 lines
18 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
======================================
|
|
EROFS - Enhanced Read-Only File System
|
|
======================================
|
|
|
|
Overview
|
|
========
|
|
|
|
EROFS filesystem stands for Enhanced Read-Only File System. It aims to form a
|
|
generic read-only filesystem solution for various read-only use cases instead
|
|
of just focusing on storage space saving without considering any side effects
|
|
of runtime performance.
|
|
|
|
It is designed to meet the needs of flexibility, feature extendability and user
|
|
payload friendly, etc. Apart from those, it is still kept as a simple
|
|
random-access friendly high-performance filesystem to get rid of unneeded I/O
|
|
amplification and memory-resident overhead compared to similar approaches.
|
|
|
|
It is implemented to be a better choice for the following scenarios:
|
|
|
|
- read-only storage media or
|
|
|
|
- part of a fully trusted read-only solution, which means it needs to be
|
|
immutable and bit-for-bit identical to the official golden image for
|
|
their releases due to security or other considerations and
|
|
|
|
- hope to minimize extra storage space with guaranteed end-to-end performance
|
|
by using compact layout, transparent file compression and direct access,
|
|
especially for those embedded devices with limited memory and high-density
|
|
hosts with numerous containers.
|
|
|
|
Here are the main features of EROFS:
|
|
|
|
- Little endian on-disk design;
|
|
|
|
- Block-based distribution and file-based distribution over fscache are
|
|
supported;
|
|
|
|
- Support multiple devices to refer to external blobs, which can be used
|
|
for container images;
|
|
|
|
- 32-bit block addresses for each device, therefore 16TiB address space at
|
|
most with 4KiB block size for now;
|
|
|
|
- Two inode layouts for different requirements:
|
|
|
|
===================== ============ ======================================
|
|
compact (v1) extended (v2)
|
|
===================== ============ ======================================
|
|
Inode metadata size 32 bytes 64 bytes
|
|
Max file size 4 GiB 16 EiB (also limited by max. vol size)
|
|
Max uids/gids 65536 4294967296
|
|
Per-inode timestamp no yes (64 + 32-bit timestamp)
|
|
Max hardlinks 65536 4294967296
|
|
Metadata reserved 8 bytes 18 bytes
|
|
===================== ============ ======================================
|
|
|
|
- Support extended attributes as an option;
|
|
|
|
- Support a bloom filter that speeds up negative extended attribute lookups;
|
|
|
|
- Support POSIX.1e ACLs by using extended attributes;
|
|
|
|
- Support transparent data compression as an option:
|
|
LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In
|
|
addition, inplace decompression is also supported to avoid bounce compressed
|
|
buffers and unnecessary page cache thrashing.
|
|
|
|
- Support chunk-based data deduplication and rolling-hash compressed data
|
|
deduplication;
|
|
|
|
- Support tailpacking inline compared to byte-addressed unaligned metadata
|
|
or smaller block size alternatives;
|
|
|
|
- Support merging tail-end data into a special inode as fragments.
|
|
|
|
- Support large folios to make use of THPs (Transparent Hugepages);
|
|
|
|
- Support direct I/O on uncompressed files to avoid double caching for loop
|
|
devices;
|
|
|
|
- Support FSDAX on uncompressed images for secure containers and ramdisks in
|
|
order to get rid of unnecessary page cache.
|
|
|
|
- Support file-based on-demand loading with the Fscache infrastructure.
|
|
|
|
The following git tree provides the file system user-space tools under
|
|
development, such as a formatting tool (mkfs.erofs), an on-disk consistency &
|
|
compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
|
|
|
|
- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
|
|
|
|
For more information, please also refer to the documentation site:
|
|
|
|
- https://erofs.docs.kernel.org
|
|
|
|
Bugs and patches are welcome, please kindly help us and send to the following
|
|
linux-erofs mailing list:
|
|
|
|
- linux-erofs mailing list <linux-erofs@lists.ozlabs.org>
|
|
|
|
Mount options
|
|
=============
|
|
|
|
=================== =========================================================
|
|
(no)user_xattr Setup Extended User Attributes. Note: xattr is enabled
|
|
by default if CONFIG_EROFS_FS_XATTR is selected.
|
|
(no)acl Setup POSIX Access Control List. Note: acl is enabled
|
|
by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
|
|
cache_strategy=%s Select a strategy for cached decompression from now on:
|
|
|
|
========== =============================================
|
|
disabled In-place I/O decompression only;
|
|
readahead Cache the last incomplete compressed physical
|
|
cluster for further reading. It still does
|
|
in-place I/O decompression for the rest
|
|
compressed physical clusters;
|
|
readaround Cache the both ends of incomplete compressed
|
|
physical clusters for further reading.
|
|
It still does in-place I/O decompression
|
|
for the rest compressed physical clusters.
|
|
========== =============================================
|
|
dax={always,never} Use direct access (no page cache). See
|
|
Documentation/filesystems/dax.rst.
|
|
dax A legacy option which is an alias for ``dax=always``.
|
|
device=%s Specify a path to an extra device to be used together.
|
|
fsid=%s Specify a filesystem image ID for Fscache back-end.
|
|
domain_id=%s Specify a domain ID in fscache mode so that different images
|
|
with the same blobs under a given domain ID can share storage.
|
|
=================== =========================================================
|
|
|
|
Sysfs Entries
|
|
=============
|
|
|
|
Information about mounted erofs file systems can be found in /sys/fs/erofs.
|
|
Each mounted filesystem will have a directory in /sys/fs/erofs based on its
|
|
device name (i.e., /sys/fs/erofs/sda).
|
|
(see also Documentation/ABI/testing/sysfs-fs-erofs)
|
|
|
|
On-disk details
|
|
===============
|
|
|
|
Summary
|
|
-------
|
|
Different from other read-only file systems, an EROFS volume is designed
|
|
to be as simple as possible::
|
|
|
|
|-> aligned with the block size
|
|
____________________________________________________________
|
|
| |SB| | ... | Metadata | ... | Data | Metadata | ... | Data |
|
|
|_|__|_|_____|__________|_____|______|__________|_____|______|
|
|
0 +1K
|
|
|
|
All data areas should be aligned with the block size, but metadata areas
|
|
may not. All metadatas can be now observed in two different spaces (views):
|
|
|
|
1. Inode metadata space
|
|
|
|
Each valid inode should be aligned with an inode slot, which is a fixed
|
|
value (32 bytes) and designed to be kept in line with compact inode size.
|
|
|
|
Each inode can be directly found with the following formula:
|
|
inode offset = meta_blkaddr * block_size + 32 * nid
|
|
|
|
::
|
|
|
|
|-> aligned with 8B
|
|
|-> followed closely
|
|
+ meta_blkaddr blocks |-> another slot
|
|
_____________________________________________________________________
|
|
| ... | inode | xattrs | extents | data inline | ... | inode ...
|
|
|________|_______|(optional)|(optional)|__(optional)_|_____|__________
|
|
|-> aligned with the inode slot size
|
|
. .
|
|
. .
|
|
. .
|
|
. .
|
|
. .
|
|
. .
|
|
.____________________________________________________|-> aligned with 4B
|
|
| xattr_ibody_header | shared xattrs | inline xattrs |
|
|
|____________________|_______________|_______________|
|
|
|-> 12 bytes <-|->x * 4 bytes<-| .
|
|
. . .
|
|
. . .
|
|
. . .
|
|
._______________________________.______________________.
|
|
| id | id | id | id | ... | id | ent | ... | ent| ... |
|
|
|____|____|____|____|______|____|_____|_____|____|_____|
|
|
|-> aligned with 4B
|
|
|-> aligned with 4B
|
|
|
|
Inode could be 32 or 64 bytes, which can be distinguished from a common
|
|
field which all inode versions have -- i_format::
|
|
|
|
__________________ __________________
|
|
| i_format | | i_format |
|
|
|__________________| |__________________|
|
|
| ... | | ... |
|
|
| | | |
|
|
|__________________| 32 bytes | |
|
|
| |
|
|
|__________________| 64 bytes
|
|
|
|
Xattrs, extents, data inline are placed after the corresponding inode with
|
|
proper alignment, and they could be optional for different data mappings.
|
|
_currently_ total 5 data layouts are supported:
|
|
|
|
== ====================================================================
|
|
0 flat file data without data inline (no extent);
|
|
1 fixed-sized output data compression (with non-compacted indexes);
|
|
2 flat file data with tail packing data inline (no extent);
|
|
3 fixed-sized output data compression (with compacted indexes, v5.3+);
|
|
4 chunk-based file (v5.15+).
|
|
== ====================================================================
|
|
|
|
The size of the optional xattrs is indicated by i_xattr_count in inode
|
|
header. Large xattrs or xattrs shared by many different files can be
|
|
stored in shared xattrs metadata rather than inlined right after inode.
|
|
|
|
2. Shared xattrs metadata space
|
|
|
|
Shared xattrs space is similar to the above inode space, started with
|
|
a specific block indicated by xattr_blkaddr, organized one by one with
|
|
proper align.
|
|
|
|
Each share xattr can also be directly found by the following formula:
|
|
xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
|
|
|
|
::
|
|
|
|
|-> aligned by 4 bytes
|
|
+ xattr_blkaddr blocks |-> aligned with 4 bytes
|
|
_________________________________________________________________________
|
|
| ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
|
|
|________|_____________|_____________|_____|______________|_______________
|
|
|
|
Directories
|
|
-----------
|
|
All directories are now organized in a compact on-disk format. Note that
|
|
each directory block is divided into index and name areas in order to support
|
|
random file lookup, and all directory entries are _strictly_ recorded in
|
|
alphabetical order in order to support improved prefix binary search
|
|
algorithm (could refer to the related source code).
|
|
|
|
::
|
|
|
|
___________________________
|
|
/ |
|
|
/ ______________|________________
|
|
/ / | nameoff1 | nameoffN-1
|
|
____________.______________._______________v________________v__________
|
|
| dirent | dirent | ... | dirent | filename | filename | ... | filename |
|
|
|___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
|
|
\ ^
|
|
\ | * could have
|
|
\ | trailing '\0'
|
|
\________________________| nameoff0
|
|
Directory block
|
|
|
|
Note that apart from the offset of the first filename, nameoff0 also indicates
|
|
the total number of directory entries in this block since it is no need to
|
|
introduce another on-disk field at all.
|
|
|
|
Chunk-based files
|
|
-----------------
|
|
In order to support chunk-based data deduplication, a new inode data layout has
|
|
been supported since Linux v5.15: Files are split in equal-sized data chunks
|
|
with ``extents`` area of the inode metadata indicating how to get the chunk
|
|
data: these can be simply as a 4-byte block address array or in the 8-byte
|
|
chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more
|
|
details.)
|
|
|
|
By the way, chunk-based files are all uncompressed for now.
|
|
|
|
Long extended attribute name prefixes
|
|
-------------------------------------
|
|
There are use cases where extended attributes with different values can have
|
|
only a few common prefixes (such as overlayfs xattrs). The predefined prefixes
|
|
work inefficiently in both image size and runtime performance in such cases.
|
|
|
|
The long xattr name prefixes feature is introduced to address this issue. The
|
|
overall idea is that, apart from the existing predefined prefixes, the xattr
|
|
entry could also refer to user-specified long xattr name prefixes, e.g.
|
|
"trusted.overlay.".
|
|
|
|
When referring to a long xattr name prefix, the highest bit (bit 7) of
|
|
erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole
|
|
represent the index of the referred long name prefix among all long name
|
|
prefixes. Therefore, only the trailing part of the name apart from the long
|
|
xattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty if
|
|
the full xattr name matches exactly as its long xattr name prefix.
|
|
|
|
All long xattr prefixes are stored one by one in the packed inode as long as
|
|
the packed inode is valid, or in the meta inode otherwise. The
|
|
xattr_prefix_count (of the on-disk superblock) indicates the total number of
|
|
long xattr name prefixes, while (xattr_prefix_start * 4) indicates the start
|
|
offset of long name prefixes in the packed/meta inode. Note that, long extended
|
|
attribute name prefixes are disabled if xattr_prefix_count is 0.
|
|
|
|
Each long name prefix is stored in the format: ALIGN({__le16 len, data}, 4),
|
|
where len represents the total size of the data part. The data part is actually
|
|
represented by 'struct erofs_xattr_long_prefix', where base_index represents the
|
|
index of the predefined xattr name prefix, e.g. EROFS_XATTR_INDEX_TRUSTED for
|
|
"trusted.overlay." long name prefix, while the infix string keeps the string
|
|
after stripping the short prefix, e.g. "overlay." for the example above.
|
|
|
|
Data compression
|
|
----------------
|
|
EROFS implements fixed-sized output compression which generates fixed-sized
|
|
compressed data blocks from variable-sized input in contrast to other existing
|
|
fixed-sized input solutions. Relatively higher compression ratios can be gotten
|
|
by using fixed-sized output compression since nowadays popular data compression
|
|
algorithms are mostly LZ77-based and such fixed-sized output approach can be
|
|
benefited from the historical dictionary (aka. sliding window).
|
|
|
|
In details, original (uncompressed) data is turned into several variable-sized
|
|
extents and in the meanwhile, compressed into physical clusters (pclusters).
|
|
In order to record each variable-sized extent, logical clusters (lclusters) are
|
|
introduced as the basic unit of compress indexes to indicate whether a new
|
|
extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now
|
|
fixed in block size, as illustrated below::
|
|
|
|
|<- variable-sized extent ->|<- VLE ->|
|
|
clusterofs clusterofs clusterofs
|
|
| | |
|
|
_________v_________________________________v_______________________v________
|
|
... | . | | . | | . ...
|
|
____|____._________|______________|________.___ _|______________|__.________
|
|
|-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-|
|
|
(HEAD) (NONHEAD) (HEAD) (NONHEAD) .
|
|
. CBLKCNT . .
|
|
. . .
|
|
. . .
|
|
_______._____________________________.______________._________________
|
|
... | | | | ...
|
|
_______|______________|______________|______________|_________________
|
|
|-> big pcluster <-|-> pcluster <-|
|
|
|
|
A physical cluster can be seen as a container of physical compressed blocks
|
|
which contains compressed data. Previously, only lcluster-sized (4KB) pclusters
|
|
were supported. After big pcluster feature is introduced (available since
|
|
Linux v5.13), pcluster can be a multiple of lcluster size.
|
|
|
|
For each HEAD lcluster, clusterofs is recorded to indicate where a new extent
|
|
starts and blkaddr is used to seek the compressed data. For each NONHEAD
|
|
lcluster, delta0 and delta1 are available instead of blkaddr to indicate the
|
|
distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is
|
|
also a HEAD lcluster except that its data is uncompressed. See the comments
|
|
around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
|
|
|
|
If big pcluster is enabled, pcluster size in lclusters needs to be recorded as
|
|
well. Let the delta0 of the first NONHEAD lcluster store the compressed block
|
|
count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy
|
|
to understand its delta0 is constantly 1, as illustrated below::
|
|
|
|
__________________________________________________________
|
|
| HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD |
|
|
|__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_|
|
|
|<----- a big pcluster (with CBLKCNT) ------>|<-- -->|
|
|
a lcluster-sized pcluster (without CBLKCNT) ^
|
|
|
|
If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT,
|
|
but it's easy to know the size of such pcluster is 1 lcluster as well.
|
|
|
|
Since Linux v6.1, each pcluster can be used for multiple variable-sized extents,
|
|
therefore it can be used for compressed data deduplication.
|