License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2006-10-11 01:20:50 -07:00
|
|
|
/*
|
2006-10-11 01:20:53 -07:00
|
|
|
* linux/fs/ext4/ialloc.c
|
2006-10-11 01:20:50 -07:00
|
|
|
*
|
|
|
|
* Copyright (C) 1992, 1993, 1994, 1995
|
|
|
|
* Remy Card (card@masi.ibp.fr)
|
|
|
|
* Laboratoire MASI - Institut Blaise Pascal
|
|
|
|
* Universite Pierre et Marie Curie (Paris VI)
|
|
|
|
*
|
|
|
|
* BSD ufs-inspired inode and directory allocation by
|
|
|
|
* Stephen Tweedie (sct@redhat.com), 1993
|
|
|
|
* Big-endian to little-endian byte-swapping/bitmaps by
|
|
|
|
* David S. Miller (davem@caip.rutgers.edu), 1995
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/time.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/stat.h>
|
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/quotaops.h>
|
|
|
|
#include <linux/buffer_head.h>
|
|
|
|
#include <linux/random.h>
|
|
|
|
#include <linux/bitops.h>
|
2006-10-11 01:21:05 -07:00
|
|
|
#include <linux/blkdev.h>
|
2017-02-02 17:54:15 +01:00
|
|
|
#include <linux/cred.h>
|
|
|
|
|
2006-10-11 01:20:50 -07:00
|
|
|
#include <asm/byteorder.h>
|
2009-06-17 11:48:11 -04:00
|
|
|
|
2008-04-29 18:13:32 -04:00
|
|
|
#include "ext4.h"
|
|
|
|
#include "ext4_jbd2.h"
|
2006-10-11 01:20:50 -07:00
|
|
|
#include "xattr.h"
|
|
|
|
#include "acl.h"
|
|
|
|
|
2009-06-17 11:48:11 -04:00
|
|
|
#include <trace/events/ext4.h>
|
|
|
|
|
2006-10-11 01:20:50 -07:00
|
|
|
/*
|
|
|
|
* ialloc.c contains the inodes allocation and deallocation routines
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The free inodes are managed by bitmaps. A file system contains several
|
|
|
|
* blocks groups. Each group contains 1 bitmap block for blocks, 1 bitmap
|
|
|
|
* block for inodes, N blocks for the inode table and data blocks.
|
|
|
|
*
|
|
|
|
* The file system contains group descriptors which are located after the
|
|
|
|
* super block. Each descriptor contains the number of the bitmap block and
|
|
|
|
* the free blocks count in the block.
|
|
|
|
*/
|
|
|
|
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
/*
|
|
|
|
* To avoid calling the atomic setbit hundreds or thousands of times, we only
|
|
|
|
* need to use it within a single byte (to ensure we get endianness right).
|
|
|
|
* We can use memset for the rest of the bitmap as there are no other users.
|
|
|
|
*/
|
2010-10-27 21:30:15 -04:00
|
|
|
void ext4_mark_bitmap_end(int start_bit, int end_bit, char *bitmap)
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (start_bit >= end_bit)
|
|
|
|
return;
|
|
|
|
|
|
|
|
ext4_debug("mark end bits +%d through +%d used\n", start_bit, end_bit);
|
|
|
|
for (i = start_bit; i < ((start_bit + 7) & ~7UL); i++)
|
|
|
|
ext4_set_bit(i, bitmap);
|
|
|
|
if (i < end_bit)
|
|
|
|
memset(bitmap + (i >> 3), 0xff, (end_bit - i) >> 3);
|
|
|
|
}
|
|
|
|
|
2012-02-20 17:52:46 -05:00
|
|
|
void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate)
|
|
|
|
{
|
|
|
|
if (uptodate) {
|
|
|
|
set_buffer_uptodate(bh);
|
|
|
|
set_bitmap_uptodate(bh);
|
|
|
|
}
|
|
|
|
unlock_buffer(bh);
|
|
|
|
put_bh(bh);
|
|
|
|
}
|
|
|
|
|
2015-10-17 21:33:24 -04:00
|
|
|
static int ext4_validate_inode_bitmap(struct super_block *sb,
|
|
|
|
struct ext4_group_desc *desc,
|
|
|
|
ext4_group_t block_group,
|
|
|
|
struct buffer_head *bh)
|
|
|
|
{
|
|
|
|
ext4_fsblk_t blk;
|
2020-10-15 13:37:59 -07:00
|
|
|
struct ext4_group_info *grp;
|
|
|
|
|
|
|
|
if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
|
|
|
|
return 0;
|
|
|
|
|
2015-10-17 21:33:24 -04:00
|
|
|
if (buffer_verified(bh))
|
|
|
|
return 0;
|
2024-08-20 21:22:34 +08:00
|
|
|
|
|
|
|
grp = ext4_get_group_info(sb, block_group);
|
ext4: allow ext4_get_group_info() to fail
Previously, ext4_get_group_info() would treat an invalid group number
as BUG(), since in theory it should never happen. However, if a
malicious attaker (or fuzzer) modifies the superblock via the block
device while it is the file system is mounted, it is possible for
s_first_data_block to get set to a very large number. In that case,
when calculating the block group of some block number (such as the
starting block of a preallocation region), could result in an
underflow and very large block group number. Then the BUG_ON check in
ext4_get_group_info() would fire, resutling in a denial of service
attack that can be triggered by root or someone with write access to
the block device.
For a quality of implementation perspective, it's best that even if
the system administrator does something that they shouldn't, that it
will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info()
will call ext4_error and return NULL. We also add fallback code in
all of the callers of ext4_get_group_info() that it might NULL.
Also, since ext4_get_group_info() was already borderline to be an
inline function, un-inline it. The results in a next reduction of the
compiled text size of ext4 by roughly 2k.
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2023-04-29 00:06:28 -04:00
|
|
|
if (!grp || EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
|
2015-10-17 21:33:24 -04:00
|
|
|
return -EFSCORRUPTED;
|
|
|
|
|
|
|
|
ext4_lock_group(sb, block_group);
|
2018-07-12 19:08:05 -04:00
|
|
|
if (buffer_verified(bh))
|
|
|
|
goto verified;
|
2015-10-17 21:33:24 -04:00
|
|
|
blk = ext4_inode_bitmap(sb, desc);
|
2024-08-20 21:22:32 +08:00
|
|
|
if (!ext4_inode_bitmap_csum_verify(sb, desc, bh) ||
|
2019-11-21 13:09:43 -05:00
|
|
|
ext4_simulate_fail(sb, EXT4_SIM_IBITMAP_CRC)) {
|
2015-10-17 21:33:24 -04:00
|
|
|
ext4_unlock_group(sb, block_group);
|
|
|
|
ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
|
|
|
|
"inode_bitmap = %llu", block_group, blk);
|
2018-05-12 11:39:40 -04:00
|
|
|
ext4_mark_group_bitmap_corrupted(sb, block_group,
|
|
|
|
EXT4_GROUP_INFO_IBITMAP_CORRUPT);
|
2015-10-17 21:33:24 -04:00
|
|
|
return -EFSBADCRC;
|
|
|
|
}
|
|
|
|
set_buffer_verified(bh);
|
2018-07-12 19:08:05 -04:00
|
|
|
verified:
|
2015-10-17 21:33:24 -04:00
|
|
|
ext4_unlock_group(sb, block_group);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-10-11 01:20:50 -07:00
|
|
|
/*
|
|
|
|
* Read the inode allocation bitmap for a given block_group, reading
|
|
|
|
* into the specified slot in the superblock's bitmap cache.
|
|
|
|
*
|
2020-03-29 13:21:41 -07:00
|
|
|
* Return buffer_head of bitmap on success, or an ERR_PTR on error.
|
2006-10-11 01:20:50 -07:00
|
|
|
*/
|
|
|
|
static struct buffer_head *
|
2008-08-02 21:21:02 -04:00
|
|
|
ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
2006-10-11 01:20:53 -07:00
|
|
|
struct ext4_group_desc *desc;
|
2018-03-26 23:54:10 -04:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2006-10-11 01:20:50 -07:00
|
|
|
struct buffer_head *bh = NULL;
|
2008-08-02 21:21:02 -04:00
|
|
|
ext4_fsblk_t bitmap_blk;
|
2015-10-17 21:33:24 -04:00
|
|
|
int err;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2006-10-11 01:20:53 -07:00
|
|
|
desc = ext4_get_group_desc(sb, block_group, NULL);
|
2006-10-11 01:20:50 -07:00
|
|
|
if (!desc)
|
2015-10-17 21:33:24 -04:00
|
|
|
return ERR_PTR(-EFSCORRUPTED);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
|
2008-08-02 21:21:02 -04:00
|
|
|
bitmap_blk = ext4_inode_bitmap(sb, desc);
|
2018-03-26 23:54:10 -04:00
|
|
|
if ((bitmap_blk <= le32_to_cpu(sbi->s_es->s_first_data_block)) ||
|
|
|
|
(bitmap_blk >= ext4_blocks_count(sbi->s_es))) {
|
|
|
|
ext4_error(sb, "Invalid inode bitmap blk %llu in "
|
|
|
|
"block_group %u", bitmap_blk, block_group);
|
2018-05-12 12:15:21 -04:00
|
|
|
ext4_mark_group_bitmap_corrupted(sb, block_group,
|
|
|
|
EXT4_GROUP_INFO_IBITMAP_CORRUPT);
|
2018-03-26 23:54:10 -04:00
|
|
|
return ERR_PTR(-EFSCORRUPTED);
|
|
|
|
}
|
2008-08-02 21:21:02 -04:00
|
|
|
bh = sb_getblk(sb, bitmap_blk);
|
|
|
|
if (unlikely(!bh)) {
|
2018-08-01 12:02:31 -04:00
|
|
|
ext4_warning(sb, "Cannot read inode bitmap - "
|
|
|
|
"block_group = %u, inode_bitmap = %llu",
|
|
|
|
block_group, bitmap_blk);
|
2018-05-12 11:35:01 -04:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2008-08-02 21:21:02 -04:00
|
|
|
}
|
2009-01-05 21:49:55 -05:00
|
|
|
if (bitmap_uptodate(bh))
|
2012-04-29 18:33:10 -04:00
|
|
|
goto verify;
|
2008-08-02 21:21:02 -04:00
|
|
|
|
ext4: fix initialization of UNINIT bitmap blocks
This fixes a bug which caused on-line resizing of filesystems with a
1k blocksize to fail. The root cause of this bug was the fact that if
an uninitalized bitmap block gets read in by userspace (which
e2fsprogs does try to avoid, but can happen when the blocksize is less
than the pagesize and an adjacent blocks is read into memory)
ext4_read_block_bitmap() was erroneously depending on the buffer
uptodate flag to decide whether it needed to initialize the bitmap
block in memory --- i.e., to set the standard set of blocks in use by
a block group (superblock, bitmaps, inode table, etc.). Essentially,
ext4_read_block_bitmap() assumed it was the only routine that might
try to read a block containing a block bitmap, which is simply not
true.
To fix this, ext4_read_block_bitmap() and ext4_read_inode_bitmap()
must always initialize uninitialized bitmap blocks. Once a block or
inode is allocated out of that bitmap, it will be marked as
initialized in the block group descriptor, so in general this won't
result any extra unnecessary work.
Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 08:09:18 -04:00
|
|
|
lock_buffer(bh);
|
2009-01-05 21:49:55 -05:00
|
|
|
if (bitmap_uptodate(bh)) {
|
|
|
|
unlock_buffer(bh);
|
2012-04-29 18:33:10 -04:00
|
|
|
goto verify;
|
2009-01-05 21:49:55 -05:00
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
|
2009-05-02 20:35:09 -04:00
|
|
|
ext4_lock_group(sb, block_group);
|
2018-06-14 00:58:00 -04:00
|
|
|
if (ext4_has_group_desc_csum(sb) &&
|
|
|
|
(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))) {
|
|
|
|
if (block_group == 0) {
|
|
|
|
ext4_unlock_group(sb, block_group);
|
|
|
|
unlock_buffer(bh);
|
|
|
|
ext4_error(sb, "Inode bitmap for bg 0 marked "
|
|
|
|
"uninitialized");
|
|
|
|
err = -EFSCORRUPTED;
|
|
|
|
goto out;
|
|
|
|
}
|
2018-02-19 14:16:47 -05:00
|
|
|
memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
|
|
|
|
ext4_mark_bitmap_end(EXT4_INODES_PER_GROUP(sb),
|
|
|
|
sb->s_blocksize * 8, bh->b_data);
|
2009-01-05 21:49:55 -05:00
|
|
|
set_bitmap_uptodate(bh);
|
2008-08-02 21:21:02 -04:00
|
|
|
set_buffer_uptodate(bh);
|
2012-04-29 18:33:10 -04:00
|
|
|
set_buffer_verified(bh);
|
2009-05-02 20:35:09 -04:00
|
|
|
ext4_unlock_group(sb, block_group);
|
2009-01-03 22:33:39 -05:00
|
|
|
unlock_buffer(bh);
|
2008-08-02 21:21:02 -04:00
|
|
|
return bh;
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
}
|
2009-05-02 20:35:09 -04:00
|
|
|
ext4_unlock_group(sb, block_group);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
|
2009-01-05 21:49:55 -05:00
|
|
|
if (buffer_uptodate(bh)) {
|
|
|
|
/*
|
|
|
|
* if not uninit if bh is uptodate,
|
|
|
|
* bitmap is also uptodate
|
|
|
|
*/
|
|
|
|
set_bitmap_uptodate(bh);
|
|
|
|
unlock_buffer(bh);
|
2012-04-29 18:33:10 -04:00
|
|
|
goto verify;
|
2009-01-05 21:49:55 -05:00
|
|
|
}
|
|
|
|
/*
|
2012-02-20 17:52:46 -05:00
|
|
|
* submit the buffer_head for reading
|
2009-01-05 21:49:55 -05:00
|
|
|
*/
|
2011-03-21 21:38:05 -04:00
|
|
|
trace_ext4_load_inode_bitmap(sb, block_group);
|
2024-09-06 17:17:46 +08:00
|
|
|
ext4_read_bh(bh, REQ_META | REQ_PRIO,
|
|
|
|
ext4_end_bitmap_read,
|
|
|
|
ext4_simulate_fail(sb, EXT4_SIM_IBITMAP_EIO));
|
2012-02-20 17:52:46 -05:00
|
|
|
if (!buffer_uptodate(bh)) {
|
2008-08-02 21:21:02 -04:00
|
|
|
put_bh(bh);
|
2020-03-28 19:33:43 -04:00
|
|
|
ext4_error_err(sb, EIO, "Cannot read inode bitmap - "
|
|
|
|
"block_group = %u, inode_bitmap = %llu",
|
|
|
|
block_group, bitmap_blk);
|
2018-05-12 12:15:21 -04:00
|
|
|
ext4_mark_group_bitmap_corrupted(sb, block_group,
|
|
|
|
EXT4_GROUP_INFO_IBITMAP_CORRUPT);
|
2015-10-17 21:33:24 -04:00
|
|
|
return ERR_PTR(-EIO);
|
2008-08-02 21:21:02 -04:00
|
|
|
}
|
2012-04-29 18:33:10 -04:00
|
|
|
|
|
|
|
verify:
|
2015-10-17 21:33:24 -04:00
|
|
|
err = ext4_validate_inode_bitmap(sb, desc, block_group, bh);
|
|
|
|
if (err)
|
|
|
|
goto out;
|
2006-10-11 01:20:50 -07:00
|
|
|
return bh;
|
2015-10-17 21:33:24 -04:00
|
|
|
out:
|
|
|
|
put_bh(bh);
|
|
|
|
return ERR_PTR(err);
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* NOTE! When we get the inode, we're the only people
|
|
|
|
* that have access to it, and as such there are no
|
|
|
|
* race conditions we have to worry about. The inode
|
|
|
|
* is not on the hash-lists, and it cannot be reached
|
|
|
|
* through the filesystem because the directory entry
|
|
|
|
* has been deleted earlier.
|
|
|
|
*
|
|
|
|
* HOWEVER: we must make sure that we get no aliases,
|
|
|
|
* which means that we have to call "clear_inode()"
|
|
|
|
* _before_ we mark the inode not in use in the inode
|
|
|
|
* bitmaps. Otherwise a newly created file might use
|
|
|
|
* the same inode number (not actually the same pointer
|
|
|
|
* though), and then we'd have two inodes sharing the
|
|
|
|
* same inode number and space on the harddisk.
|
|
|
|
*/
|
2008-09-08 22:25:24 -04:00
|
|
|
void ext4_free_inode(handle_t *handle, struct inode *inode)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
2008-09-08 22:25:24 -04:00
|
|
|
struct super_block *sb = inode->i_sb;
|
2006-10-11 01:20:50 -07:00
|
|
|
int is_directory;
|
|
|
|
unsigned long ino;
|
|
|
|
struct buffer_head *bitmap_bh = NULL;
|
|
|
|
struct buffer_head *bh2;
|
2008-01-28 23:58:27 -05:00
|
|
|
ext4_group_t block_group;
|
2006-10-11 01:20:50 -07:00
|
|
|
unsigned long bit;
|
2008-09-08 22:25:24 -04:00
|
|
|
struct ext4_group_desc *gdp;
|
|
|
|
struct ext4_super_block *es;
|
2006-10-11 01:20:53 -07:00
|
|
|
struct ext4_sb_info *sbi;
|
2009-03-04 18:38:18 -05:00
|
|
|
int fatal = 0, err, count, cleared;
|
2013-08-28 18:32:58 -04:00
|
|
|
struct ext4_group_info *grp;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2012-03-19 23:41:49 -04:00
|
|
|
if (!sb) {
|
|
|
|
printk(KERN_ERR "EXT4-fs: %s:%d: inode on "
|
|
|
|
"nonexistent device\n", __func__, __LINE__);
|
2006-10-11 01:20:50 -07:00
|
|
|
return;
|
|
|
|
}
|
2012-03-19 23:41:49 -04:00
|
|
|
if (atomic_read(&inode->i_count) > 1) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: count=%d",
|
|
|
|
__func__, __LINE__, inode->i_ino,
|
|
|
|
atomic_read(&inode->i_count));
|
2006-10-11 01:20:50 -07:00
|
|
|
return;
|
|
|
|
}
|
2012-03-19 23:41:49 -04:00
|
|
|
if (inode->i_nlink) {
|
|
|
|
ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: nlink=%d\n",
|
|
|
|
__func__, __LINE__, inode->i_ino, inode->i_nlink);
|
2006-10-11 01:20:50 -07:00
|
|
|
return;
|
|
|
|
}
|
2006-10-11 01:20:53 -07:00
|
|
|
sbi = EXT4_SB(sb);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
ino = inode->i_ino;
|
2008-09-08 22:25:24 -04:00
|
|
|
ext4_debug("freeing inode %lu\n", ino);
|
2009-06-17 11:48:11 -04:00
|
|
|
trace_ext4_free_inode(inode);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2010-03-03 09:05:07 -05:00
|
|
|
dquot_initialize(inode);
|
2010-03-03 09:05:01 -05:00
|
|
|
dquot_free_inode(inode);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
is_directory = S_ISDIR(inode->i_mode);
|
|
|
|
|
|
|
|
/* Do this BEFORE marking the inode not in use or returning an error */
|
2010-06-07 13:16:22 -04:00
|
|
|
ext4_clear_inode(inode);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2018-01-11 13:17:49 -05:00
|
|
|
es = sbi->s_es;
|
2006-10-11 01:20:53 -07:00
|
|
|
if (ino < EXT4_FIRST_INO(sb) || ino > le32_to_cpu(es->s_inodes_count)) {
|
2010-02-15 14:19:27 -05:00
|
|
|
ext4_error(sb, "reserved or nonexistent inode %lu", ino);
|
2006-10-11 01:20:50 -07:00
|
|
|
goto error_return;
|
|
|
|
}
|
2006-10-11 01:20:53 -07:00
|
|
|
block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
|
|
|
|
bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
|
2008-08-02 21:21:02 -04:00
|
|
|
bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
|
2013-08-28 18:32:58 -04:00
|
|
|
/* Don't bother if the inode bitmap is corrupt. */
|
2015-10-17 21:33:24 -04:00
|
|
|
if (IS_ERR(bitmap_bh)) {
|
|
|
|
fatal = PTR_ERR(bitmap_bh);
|
|
|
|
bitmap_bh = NULL;
|
|
|
|
goto error_return;
|
|
|
|
}
|
2020-10-15 13:37:59 -07:00
|
|
|
if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
|
|
|
|
grp = ext4_get_group_info(sb, block_group);
|
ext4: allow ext4_get_group_info() to fail
Previously, ext4_get_group_info() would treat an invalid group number
as BUG(), since in theory it should never happen. However, if a
malicious attaker (or fuzzer) modifies the superblock via the block
device while it is the file system is mounted, it is possible for
s_first_data_block to get set to a very large number. In that case,
when calculating the block group of some block number (such as the
starting block of a preallocation region), could result in an
underflow and very large block group number. Then the BUG_ON check in
ext4_get_group_info() would fire, resutling in a denial of service
attack that can be triggered by root or someone with write access to
the block device.
For a quality of implementation perspective, it's best that even if
the system administrator does something that they shouldn't, that it
will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info()
will call ext4_error and return NULL. We also add fallback code in
all of the callers of ext4_get_group_info() that it might NULL.
Also, since ext4_get_group_info() was already borderline to be an
inline function, un-inline it. The results in a next reduction of the
compiled text size of ext4 by roughly 2k.
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2023-04-29 00:06:28 -04:00
|
|
|
if (!grp || unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
|
2020-10-15 13:37:59 -07:00
|
|
|
fatal = -EFSCORRUPTED;
|
|
|
|
goto error_return;
|
|
|
|
}
|
2015-10-17 21:33:24 -04:00
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
BUFFER_TRACE(bitmap_bh, "get_write_access");
|
2021-08-16 11:57:04 +02:00
|
|
|
fatal = ext4_journal_get_write_access(handle, sb, bitmap_bh,
|
|
|
|
EXT4_JTR_NONE);
|
2006-10-11 01:20:50 -07:00
|
|
|
if (fatal)
|
|
|
|
goto error_return;
|
|
|
|
|
2010-05-16 07:00:00 -04:00
|
|
|
fatal = -ESRCH;
|
|
|
|
gdp = ext4_get_group_desc(sb, block_group, &bh2);
|
|
|
|
if (gdp) {
|
2006-10-11 01:20:50 -07:00
|
|
|
BUFFER_TRACE(bh2, "get_write_access");
|
2021-08-16 11:57:04 +02:00
|
|
|
fatal = ext4_journal_get_write_access(handle, sb, bh2,
|
|
|
|
EXT4_JTR_NONE);
|
2010-05-16 07:00:00 -04:00
|
|
|
}
|
|
|
|
ext4_lock_group(sb, block_group);
|
ext4: use proper little-endian bitops
ext4_{set,clear}_bit() is defined as __test_and_{set,clear}_bit_le() for
ext4. Only two ext4_{set,clear}_bit() calls check the return value. The
rest of calls ignore the return value and they can be replaced with
__{set,clear}_bit_le().
This changes ext4_{set,clear}_bit() from __test_and_{set,clear}_bit_le()
to __{set,clear}_bit_le() and introduces ext4_test_and_{set,clear}_bit()
for the two places where old bit needs to be returned.
This ext4_{set,clear}_bit() change is considered safe, because if someone
uses these macros without noticing the change, new ext4_{set,clear}_bit
don't have return value and causes compiler errors where the return value
is used.
This also removes unused ext4_find_first_zero_bit().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-12-28 20:32:07 -05:00
|
|
|
cleared = ext4_test_and_clear_bit(bit, bitmap_bh->b_data);
|
2010-05-16 07:00:00 -04:00
|
|
|
if (fatal || !cleared) {
|
|
|
|
ext4_unlock_group(sb, block_group);
|
|
|
|
goto out;
|
|
|
|
}
|
2009-03-04 19:31:53 -05:00
|
|
|
|
2010-05-16 07:00:00 -04:00
|
|
|
count = ext4_free_inodes_count(sb, gdp) + 1;
|
|
|
|
ext4_free_inodes_set(sb, gdp, count);
|
|
|
|
if (is_directory) {
|
|
|
|
count = ext4_used_dirs_count(sb, gdp) - 1;
|
|
|
|
ext4_used_dirs_set(sb, gdp, count);
|
2021-04-29 16:13:44 +05:30
|
|
|
if (percpu_counter_initialized(&sbi->s_dirs_counter))
|
|
|
|
percpu_counter_dec(&sbi->s_dirs_counter);
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
2024-08-20 21:22:32 +08:00
|
|
|
ext4_inode_bitmap_csum_set(sb, gdp, bitmap_bh);
|
2012-04-29 18:45:10 -04:00
|
|
|
ext4_group_desc_csum_set(sb, block_group, gdp);
|
2010-05-16 07:00:00 -04:00
|
|
|
ext4_unlock_group(sb, block_group);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2021-04-29 16:13:44 +05:30
|
|
|
if (percpu_counter_initialized(&sbi->s_freeinodes_counter))
|
|
|
|
percpu_counter_inc(&sbi->s_freeinodes_counter);
|
2010-05-16 07:00:00 -04:00
|
|
|
if (sbi->s_log_groups_per_flex) {
|
2020-02-18 19:08:51 -08:00
|
|
|
struct flex_groups *fg;
|
2009-03-04 19:09:10 -05:00
|
|
|
|
2020-02-18 19:08:51 -08:00
|
|
|
fg = sbi_array_rcu_deref(sbi, s_flex_groups,
|
|
|
|
ext4_flex_group(sbi, block_group));
|
|
|
|
atomic_inc(&fg->free_inodes);
|
2010-05-16 07:00:00 -04:00
|
|
|
if (is_directory)
|
2020-02-18 19:08:51 -08:00
|
|
|
atomic_dec(&fg->used_dirs);
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
2010-05-16 07:00:00 -04:00
|
|
|
BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata");
|
|
|
|
fatal = ext4_handle_dirty_metadata(handle, NULL, bh2);
|
|
|
|
out:
|
|
|
|
if (cleared) {
|
|
|
|
BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata");
|
|
|
|
err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
|
|
|
|
if (!fatal)
|
|
|
|
fatal = err;
|
2013-08-28 18:32:58 -04:00
|
|
|
} else {
|
2010-05-16 07:00:00 -04:00
|
|
|
ext4_error(sb, "bit already cleared for inode %lu", ino);
|
2018-05-12 11:39:40 -04:00
|
|
|
ext4_mark_group_bitmap_corrupted(sb, block_group,
|
|
|
|
EXT4_GROUP_INFO_IBITMAP_CORRUPT);
|
2013-08-28 18:32:58 -04:00
|
|
|
}
|
2010-05-16 07:00:00 -04:00
|
|
|
|
2006-10-11 01:20:50 -07:00
|
|
|
error_return:
|
|
|
|
brelse(bitmap_bh);
|
2006-10-11 01:20:53 -07:00
|
|
|
ext4_std_error(sb, fatal);
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
|
2009-03-12 12:18:34 -04:00
|
|
|
struct orlov_stats {
|
2013-03-11 23:39:59 -04:00
|
|
|
__u64 free_clusters;
|
2009-03-12 12:18:34 -04:00
|
|
|
__u32 free_inodes;
|
|
|
|
__u32 used_dirs;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper function for Orlov's allocator; returns critical information
|
|
|
|
* for a particular block group or flex_bg. If flex_size is 1, then g
|
|
|
|
* is a block group number; otherwise it is flex_bg number.
|
|
|
|
*/
|
2010-10-27 21:30:14 -04:00
|
|
|
static void get_orlov_stats(struct super_block *sb, ext4_group_t g,
|
|
|
|
int flex_size, struct orlov_stats *stats)
|
2009-03-12 12:18:34 -04:00
|
|
|
{
|
|
|
|
struct ext4_group_desc *desc;
|
|
|
|
|
2009-03-04 19:31:53 -05:00
|
|
|
if (flex_size > 1) {
|
2020-02-18 19:08:51 -08:00
|
|
|
struct flex_groups *fg = sbi_array_rcu_deref(EXT4_SB(sb),
|
|
|
|
s_flex_groups, g);
|
|
|
|
stats->free_inodes = atomic_read(&fg->free_inodes);
|
|
|
|
stats->free_clusters = atomic64_read(&fg->free_clusters);
|
|
|
|
stats->used_dirs = atomic_read(&fg->used_dirs);
|
2009-03-04 19:31:53 -05:00
|
|
|
return;
|
|
|
|
}
|
2009-03-12 12:18:34 -04:00
|
|
|
|
2009-03-04 19:31:53 -05:00
|
|
|
desc = ext4_get_group_desc(sb, g, NULL);
|
|
|
|
if (desc) {
|
|
|
|
stats->free_inodes = ext4_free_inodes_count(sb, desc);
|
2011-09-09 19:08:51 -04:00
|
|
|
stats->free_clusters = ext4_free_group_clusters(sb, desc);
|
2009-03-04 19:31:53 -05:00
|
|
|
stats->used_dirs = ext4_used_dirs_count(sb, desc);
|
|
|
|
} else {
|
|
|
|
stats->free_inodes = 0;
|
2011-09-09 18:58:51 -04:00
|
|
|
stats->free_clusters = 0;
|
2009-03-04 19:31:53 -05:00
|
|
|
stats->used_dirs = 0;
|
2009-03-12 12:18:34 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-10-11 01:20:50 -07:00
|
|
|
/*
|
|
|
|
* Orlov's allocator for directories.
|
|
|
|
*
|
|
|
|
* We always try to spread first-level directories.
|
|
|
|
*
|
2021-05-25 15:36:56 +08:00
|
|
|
* If there are blockgroups with both free inodes and free clusters counts
|
2006-10-11 01:20:50 -07:00
|
|
|
* not worse than average we return one with smallest directory count.
|
|
|
|
* Otherwise we simply return a random group.
|
|
|
|
*
|
|
|
|
* For the rest rules look so:
|
|
|
|
*
|
|
|
|
* It's OK to put directory into a group unless
|
|
|
|
* it has too many directories already (max_dirs) or
|
|
|
|
* it has too few free inodes left (min_inodes) or
|
2021-05-25 15:36:56 +08:00
|
|
|
* it has too few free clusters left (min_clusters) or
|
2008-04-21 22:45:55 +00:00
|
|
|
* Parent's group is preferred, if it doesn't satisfy these
|
2006-10-11 01:20:50 -07:00
|
|
|
* conditions we search cyclically through the rest. If none
|
|
|
|
* of the groups look good we just look for a group with more
|
|
|
|
* free inodes than average (starting at parent's group).
|
|
|
|
*/
|
|
|
|
|
2008-01-28 23:58:27 -05:00
|
|
|
static int find_group_orlov(struct super_block *sb, struct inode *parent,
|
2011-07-26 02:48:06 -04:00
|
|
|
ext4_group_t *group, umode_t mode,
|
2009-06-13 11:09:42 -04:00
|
|
|
const struct qstr *qstr)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
2008-01-28 23:58:27 -05:00
|
|
|
ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
|
2006-10-11 01:20:53 -07:00
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
2009-05-01 08:50:38 -04:00
|
|
|
ext4_group_t real_ngroups = ext4_get_groups_count(sb);
|
2006-10-11 01:20:53 -07:00
|
|
|
int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
|
2011-12-28 20:25:13 -05:00
|
|
|
unsigned int freei, avefreei, grp_free;
|
2021-05-25 15:36:56 +08:00
|
|
|
ext4_fsblk_t freec, avefreec;
|
2006-10-11 01:20:50 -07:00
|
|
|
unsigned int ndirs;
|
2009-03-12 12:18:34 -04:00
|
|
|
int max_dirs, min_inodes;
|
2011-09-09 18:58:51 -04:00
|
|
|
ext4_grpblk_t min_clusters;
|
2009-05-01 08:50:38 -04:00
|
|
|
ext4_group_t i, grp, g, ngroups;
|
2006-10-11 01:20:53 -07:00
|
|
|
struct ext4_group_desc *desc;
|
2009-03-12 12:18:34 -04:00
|
|
|
struct orlov_stats stats;
|
|
|
|
int flex_size = ext4_flex_bg_size(sbi);
|
2009-06-13 11:09:42 -04:00
|
|
|
struct dx_hash_info hinfo;
|
2009-03-12 12:18:34 -04:00
|
|
|
|
2009-05-01 08:50:38 -04:00
|
|
|
ngroups = real_ngroups;
|
2009-03-12 12:18:34 -04:00
|
|
|
if (flex_size > 1) {
|
2009-05-01 08:50:38 -04:00
|
|
|
ngroups = (real_ngroups + flex_size - 1) >>
|
2009-03-12 12:18:34 -04:00
|
|
|
sbi->s_log_groups_per_flex;
|
|
|
|
parent_group >>= sbi->s_log_groups_per_flex;
|
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter);
|
|
|
|
avefreei = freei / ngroups;
|
2021-05-25 15:36:56 +08:00
|
|
|
freec = percpu_counter_read_positive(&sbi->s_freeclusters_counter);
|
|
|
|
avefreec = freec;
|
2011-09-09 18:58:51 -04:00
|
|
|
do_div(avefreec, ngroups);
|
2006-10-11 01:20:50 -07:00
|
|
|
ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter);
|
|
|
|
|
2009-03-12 12:18:34 -04:00
|
|
|
if (S_ISDIR(mode) &&
|
2015-03-17 22:25:59 +00:00
|
|
|
((parent == d_inode(sb->s_root)) ||
|
2010-05-16 22:00:00 -04:00
|
|
|
(ext4_test_inode_flag(parent, EXT4_INODE_TOPDIR)))) {
|
2006-10-11 01:20:50 -07:00
|
|
|
int best_ndir = inodes_per_group;
|
2008-01-28 23:58:27 -05:00
|
|
|
int ret = -1;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2009-06-13 11:09:42 -04:00
|
|
|
if (qstr) {
|
|
|
|
hinfo.hash_version = DX_HASH_HALF_MD4;
|
|
|
|
hinfo.seed = sbi->s_hash_seed;
|
ext4: Support case-insensitive file name lookups
This patch implements the actual support for case-insensitive file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.
A filesystem that has the casefold feature set is able to configure
directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string. This operation is called a
case-insensitive file name lookup.
The feature is configured as an inode attribute applied to directories
and inherited by its children. This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.
* dcache handling:
For a +F directory, Ext4 only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().
d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.
For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.
* on-disk data:
Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.
DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware. The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.
* Dealing with invalid sequences:
By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that case-insensitive
file name lookup will not work only for that file. An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding. When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.
* Normalization algorithm:
The UTF-8 algorithms used to compare strings in ext4 is implemented
lives in fs/unicode, and is based on a previous version developed by
SGI. It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.
NFD seems to be the best normalization method for EXT4 because:
- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like
compatibility decompositions.
Although:
- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25 14:12:08 -04:00
|
|
|
ext4fs_dirhash(parent, qstr->name, qstr->len, &hinfo);
|
2022-10-05 16:43:38 +02:00
|
|
|
parent_group = hinfo.hash % ngroups;
|
2009-06-13 11:09:42 -04:00
|
|
|
} else
|
2022-10-09 20:44:02 -06:00
|
|
|
parent_group = get_random_u32_below(ngroups);
|
2006-10-11 01:20:50 -07:00
|
|
|
for (i = 0; i < ngroups; i++) {
|
2009-03-12 12:18:34 -04:00
|
|
|
g = (parent_group + i) % ngroups;
|
|
|
|
get_orlov_stats(sb, g, flex_size, &stats);
|
|
|
|
if (!stats.free_inodes)
|
2006-10-11 01:20:50 -07:00
|
|
|
continue;
|
2009-03-12 12:18:34 -04:00
|
|
|
if (stats.used_dirs >= best_ndir)
|
2006-10-11 01:20:50 -07:00
|
|
|
continue;
|
2009-03-12 12:18:34 -04:00
|
|
|
if (stats.free_inodes < avefreei)
|
2006-10-11 01:20:50 -07:00
|
|
|
continue;
|
2011-09-09 18:58:51 -04:00
|
|
|
if (stats.free_clusters < avefreec)
|
2006-10-11 01:20:50 -07:00
|
|
|
continue;
|
2009-03-12 12:18:34 -04:00
|
|
|
grp = g;
|
2008-01-28 23:58:27 -05:00
|
|
|
ret = 0;
|
2009-03-12 12:18:34 -04:00
|
|
|
best_ndir = stats.used_dirs;
|
|
|
|
}
|
|
|
|
if (ret)
|
|
|
|
goto fallback;
|
|
|
|
found_flex_bg:
|
|
|
|
if (flex_size == 1) {
|
|
|
|
*group = grp;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We pack inodes at the beginning of the flexgroup's
|
|
|
|
* inode tables. Block allocation decisions will do
|
|
|
|
* something similar, although regular files will
|
|
|
|
* start at 2nd block group of the flexgroup. See
|
|
|
|
* ext4_ext_find_goal() and ext4_find_near().
|
|
|
|
*/
|
|
|
|
grp *= flex_size;
|
|
|
|
for (i = 0; i < flex_size; i++) {
|
2009-05-01 08:50:38 -04:00
|
|
|
if (grp+i >= real_ngroups)
|
2009-03-12 12:18:34 -04:00
|
|
|
break;
|
|
|
|
desc = ext4_get_group_desc(sb, grp+i, NULL);
|
|
|
|
if (desc && ext4_free_inodes_count(sb, desc)) {
|
|
|
|
*group = grp+i;
|
|
|
|
return 0;
|
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
goto fallback;
|
|
|
|
}
|
|
|
|
|
2022-09-08 11:21:26 +02:00
|
|
|
max_dirs = ndirs / ngroups + inodes_per_group*flex_size / 16;
|
2009-03-12 12:18:34 -04:00
|
|
|
min_inodes = avefreei - inodes_per_group*flex_size / 4;
|
|
|
|
if (min_inodes < 1)
|
|
|
|
min_inodes = 1;
|
2011-09-09 18:58:51 -04:00
|
|
|
min_clusters = avefreec - EXT4_CLUSTERS_PER_GROUP(sb)*flex_size / 4;
|
2024-08-20 21:22:30 +08:00
|
|
|
if (min_clusters < 0)
|
|
|
|
min_clusters = 0;
|
2009-03-12 12:18:34 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Start looking in the flex group where we last allocated an
|
|
|
|
* inode for this parent directory
|
|
|
|
*/
|
|
|
|
if (EXT4_I(parent)->i_last_alloc_group != ~0) {
|
|
|
|
parent_group = EXT4_I(parent)->i_last_alloc_group;
|
|
|
|
if (flex_size > 1)
|
|
|
|
parent_group >>= sbi->s_log_groups_per_flex;
|
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
for (i = 0; i < ngroups; i++) {
|
2009-03-12 12:18:34 -04:00
|
|
|
grp = (parent_group + i) % ngroups;
|
|
|
|
get_orlov_stats(sb, grp, flex_size, &stats);
|
|
|
|
if (stats.used_dirs >= max_dirs)
|
2006-10-11 01:20:50 -07:00
|
|
|
continue;
|
2009-03-12 12:18:34 -04:00
|
|
|
if (stats.free_inodes < min_inodes)
|
2006-10-11 01:20:50 -07:00
|
|
|
continue;
|
2011-09-09 18:58:51 -04:00
|
|
|
if (stats.free_clusters < min_clusters)
|
2006-10-11 01:20:50 -07:00
|
|
|
continue;
|
2009-03-12 12:18:34 -04:00
|
|
|
goto found_flex_bg;
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
fallback:
|
2009-05-01 08:50:38 -04:00
|
|
|
ngroups = real_ngroups;
|
2009-03-12 12:18:34 -04:00
|
|
|
avefreei = freei / ngroups;
|
2009-04-22 21:00:36 -04:00
|
|
|
fallback_retry:
|
2009-03-12 12:18:34 -04:00
|
|
|
parent_group = EXT4_I(parent)->i_block_group;
|
2006-10-11 01:20:50 -07:00
|
|
|
for (i = 0; i < ngroups; i++) {
|
2009-03-12 12:18:34 -04:00
|
|
|
grp = (parent_group + i) % ngroups;
|
|
|
|
desc = ext4_get_group_desc(sb, grp, NULL);
|
2012-05-28 14:16:57 -04:00
|
|
|
if (desc) {
|
|
|
|
grp_free = ext4_free_inodes_count(sb, desc);
|
|
|
|
if (grp_free && grp_free >= avefreei) {
|
|
|
|
*group = grp;
|
|
|
|
return 0;
|
|
|
|
}
|
2009-03-12 12:18:34 -04:00
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (avefreei) {
|
|
|
|
/*
|
|
|
|
* The free-inodes counter is approximate, and for really small
|
|
|
|
* filesystems the above test can fail to find any blockgroups
|
|
|
|
*/
|
|
|
|
avefreei = 0;
|
2009-04-22 21:00:36 -04:00
|
|
|
goto fallback_retry;
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2008-01-28 23:58:27 -05:00
|
|
|
static int find_group_other(struct super_block *sb, struct inode *parent,
|
2011-07-26 02:48:06 -04:00
|
|
|
ext4_group_t *group, umode_t mode)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
2008-01-28 23:58:27 -05:00
|
|
|
ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
|
2009-05-01 08:50:38 -04:00
|
|
|
ext4_group_t i, last, ngroups = ext4_get_groups_count(sb);
|
2006-10-11 01:20:53 -07:00
|
|
|
struct ext4_group_desc *desc;
|
2009-03-12 12:18:34 -04:00
|
|
|
int flex_size = ext4_flex_bg_size(EXT4_SB(sb));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to place the inode is the same flex group as its
|
|
|
|
* parent. If we can't find space, use the Orlov algorithm to
|
|
|
|
* find another flex group, and store that information in the
|
|
|
|
* parent directory's inode information so that use that flex
|
|
|
|
* group for future allocations.
|
|
|
|
*/
|
|
|
|
if (flex_size > 1) {
|
|
|
|
int retry = 0;
|
|
|
|
|
|
|
|
try_again:
|
|
|
|
parent_group &= ~(flex_size-1);
|
|
|
|
last = parent_group + flex_size;
|
|
|
|
if (last > ngroups)
|
|
|
|
last = ngroups;
|
|
|
|
for (i = parent_group; i < last; i++) {
|
|
|
|
desc = ext4_get_group_desc(sb, i, NULL);
|
|
|
|
if (desc && ext4_free_inodes_count(sb, desc)) {
|
|
|
|
*group = i;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!retry && EXT4_I(parent)->i_last_alloc_group != ~0) {
|
|
|
|
retry = 1;
|
|
|
|
parent_group = EXT4_I(parent)->i_last_alloc_group;
|
|
|
|
goto try_again;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If this didn't work, use the Orlov search algorithm
|
|
|
|
* to find a new flex group; we pass in the mode to
|
|
|
|
* avoid the topdir algorithms.
|
|
|
|
*/
|
|
|
|
*group = parent_group + flex_size;
|
|
|
|
if (*group > ngroups)
|
|
|
|
*group = 0;
|
2011-02-21 21:01:42 -05:00
|
|
|
return find_group_orlov(sb, parent, group, mode, NULL);
|
2009-03-12 12:18:34 -04:00
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to place the inode in its parent directory
|
|
|
|
*/
|
2008-01-28 23:58:27 -05:00
|
|
|
*group = parent_group;
|
|
|
|
desc = ext4_get_group_desc(sb, *group, NULL);
|
2009-01-05 22:20:24 -05:00
|
|
|
if (desc && ext4_free_inodes_count(sb, desc) &&
|
2011-09-09 19:08:51 -04:00
|
|
|
ext4_free_group_clusters(sb, desc))
|
2008-01-28 23:58:27 -05:00
|
|
|
return 0;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We're going to place this inode in a different blockgroup from its
|
|
|
|
* parent. We want to cause files in a common directory to all land in
|
|
|
|
* the same blockgroup. But we want files which are in a different
|
|
|
|
* directory which shares a blockgroup with our parent to land in a
|
|
|
|
* different blockgroup.
|
|
|
|
*
|
|
|
|
* So add our directory's i_ino into the starting point for the hash.
|
|
|
|
*/
|
2008-01-28 23:58:27 -05:00
|
|
|
*group = (*group + parent->i_ino) % ngroups;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Use a quadratic hash to find a group with a free inode and some free
|
|
|
|
* blocks.
|
|
|
|
*/
|
|
|
|
for (i = 1; i < ngroups; i <<= 1) {
|
2008-01-28 23:58:27 -05:00
|
|
|
*group += i;
|
|
|
|
if (*group >= ngroups)
|
|
|
|
*group -= ngroups;
|
|
|
|
desc = ext4_get_group_desc(sb, *group, NULL);
|
2009-01-05 22:20:24 -05:00
|
|
|
if (desc && ext4_free_inodes_count(sb, desc) &&
|
2011-09-09 19:08:51 -04:00
|
|
|
ext4_free_group_clusters(sb, desc))
|
2008-01-28 23:58:27 -05:00
|
|
|
return 0;
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* That failed: try linear search for a free inode, even if that group
|
|
|
|
* has no free blocks.
|
|
|
|
*/
|
2008-01-28 23:58:27 -05:00
|
|
|
*group = parent_group;
|
2006-10-11 01:20:50 -07:00
|
|
|
for (i = 0; i < ngroups; i++) {
|
2008-01-28 23:58:27 -05:00
|
|
|
if (++*group >= ngroups)
|
|
|
|
*group = 0;
|
|
|
|
desc = ext4_get_group_desc(sb, *group, NULL);
|
2009-01-05 22:20:24 -05:00
|
|
|
if (desc && ext4_free_inodes_count(sb, desc))
|
2008-01-28 23:58:27 -05:00
|
|
|
return 0;
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
|
|
|
/*
|
|
|
|
* In no journal mode, if an inode has recently been deleted, we want
|
|
|
|
* to avoid reusing it until we're reasonably sure the inode table
|
|
|
|
* block has been written back to disk. (Yes, these values are
|
|
|
|
* somewhat arbitrary...)
|
|
|
|
*/
|
2020-04-13 22:30:52 -04:00
|
|
|
#define RECENTCY_MIN 60
|
2017-08-31 11:09:45 -04:00
|
|
|
#define RECENTCY_DIRTY 300
|
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
|
|
|
|
|
|
|
static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino)
|
|
|
|
{
|
|
|
|
struct ext4_group_desc *gdp;
|
|
|
|
struct ext4_inode *raw_inode;
|
|
|
|
struct buffer_head *bh;
|
2017-08-31 11:09:45 -04:00
|
|
|
int inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
|
|
|
|
int offset, ret = 0;
|
|
|
|
int recentcy = RECENTCY_MIN;
|
|
|
|
u32 dtime, now;
|
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
|
|
|
|
|
|
|
gdp = ext4_get_group_desc(sb, group, NULL);
|
|
|
|
if (unlikely(!gdp))
|
|
|
|
return 0;
|
|
|
|
|
2017-08-24 11:52:21 -04:00
|
|
|
bh = sb_find_get_block(sb, ext4_inode_table(sb, gdp) +
|
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
|
|
|
(ino / inodes_per_block));
|
2017-08-24 11:52:21 -04:00
|
|
|
if (!bh || !buffer_uptodate(bh))
|
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
|
|
|
/*
|
|
|
|
* If the block is not in the buffer cache, then it
|
|
|
|
* must have been written out.
|
|
|
|
*/
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
offset = (ino % inodes_per_block) * EXT4_INODE_SIZE(sb);
|
|
|
|
raw_inode = (struct ext4_inode *) (bh->b_data + offset);
|
2017-08-31 11:09:45 -04:00
|
|
|
|
|
|
|
/* i_dtime is only 32 bits on disk, but we only care about relative
|
|
|
|
* times in the range of a few minutes (i.e. long enough to sync a
|
|
|
|
* recently-deleted inode to disk), so using the low 32 bits of the
|
|
|
|
* clock (a 68 year range) is enough, see time_before32() */
|
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
|
|
|
dtime = le32_to_cpu(raw_inode->i_dtime);
|
2017-08-31 11:09:45 -04:00
|
|
|
now = ktime_get_real_seconds();
|
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
|
|
|
if (buffer_dirty(bh))
|
|
|
|
recentcy += RECENTCY_DIRTY;
|
|
|
|
|
2017-08-31 11:09:45 -04:00
|
|
|
if (dtime && time_before32(dtime, now) &&
|
|
|
|
time_before32(now, dtime + recentcy))
|
ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
|
|
|
ret = 1;
|
|
|
|
out:
|
|
|
|
brelse(bh);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
static int find_inode_bit(struct super_block *sb, ext4_group_t group,
|
|
|
|
struct buffer_head *bitmap, unsigned long *ino)
|
|
|
|
{
|
2020-03-18 13:13:17 +01:00
|
|
|
bool check_recently_deleted = EXT4_SB(sb)->s_journal == NULL;
|
|
|
|
unsigned long recently_deleted_ino = EXT4_INODES_PER_GROUP(sb);
|
|
|
|
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
next:
|
|
|
|
*ino = ext4_find_next_zero_bit((unsigned long *)
|
|
|
|
bitmap->b_data,
|
|
|
|
EXT4_INODES_PER_GROUP(sb), *ino);
|
|
|
|
if (*ino >= EXT4_INODES_PER_GROUP(sb))
|
2020-03-18 13:13:17 +01:00
|
|
|
goto not_found;
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
|
2020-03-18 13:13:17 +01:00
|
|
|
if (check_recently_deleted && recently_deleted(sb, group, *ino)) {
|
|
|
|
recently_deleted_ino = *ino;
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
*ino = *ino + 1;
|
|
|
|
if (*ino < EXT4_INODES_PER_GROUP(sb))
|
|
|
|
goto next;
|
2020-03-18 13:13:17 +01:00
|
|
|
goto not_found;
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
}
|
2020-03-18 13:13:17 +01:00
|
|
|
return 1;
|
|
|
|
not_found:
|
|
|
|
if (recently_deleted_ino >= EXT4_INODES_PER_GROUP(sb))
|
|
|
|
return 0;
|
|
|
|
/*
|
|
|
|
* Not reusing recently deleted inodes is mostly a preference. We don't
|
|
|
|
* want to report ENOSPC or skew allocation patterns because of that.
|
|
|
|
* So return even recently deleted inode if we could find better in the
|
|
|
|
* given range.
|
|
|
|
*/
|
|
|
|
*ino = recently_deleted_ino;
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2020-10-15 13:37:59 -07:00
|
|
|
int ext4_mark_inode_used(struct super_block *sb, int ino)
|
|
|
|
{
|
|
|
|
unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count);
|
|
|
|
struct buffer_head *inode_bitmap_bh = NULL, *group_desc_bh = NULL;
|
|
|
|
struct ext4_group_desc *gdp;
|
|
|
|
ext4_group_t group;
|
|
|
|
int bit;
|
2024-08-20 21:22:28 +08:00
|
|
|
int err;
|
2020-10-15 13:37:59 -07:00
|
|
|
|
|
|
|
if (ino < EXT4_FIRST_INO(sb) || ino > max_ino)
|
2024-08-20 21:22:28 +08:00
|
|
|
return -EFSCORRUPTED;
|
2020-10-15 13:37:59 -07:00
|
|
|
|
|
|
|
group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
|
|
|
|
bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
|
|
|
|
inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
|
|
|
|
if (IS_ERR(inode_bitmap_bh))
|
|
|
|
return PTR_ERR(inode_bitmap_bh);
|
|
|
|
|
|
|
|
if (ext4_test_bit(bit, inode_bitmap_bh->b_data)) {
|
|
|
|
err = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
|
2024-08-20 21:22:33 +08:00
|
|
|
if (!gdp) {
|
2020-10-15 13:37:59 -07:00
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ext4_set_bit(bit, inode_bitmap_bh->b_data);
|
|
|
|
|
|
|
|
BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
|
|
|
|
err = ext4_handle_dirty_metadata(NULL, NULL, inode_bitmap_bh);
|
|
|
|
if (err) {
|
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
err = sync_dirty_buffer(inode_bitmap_bh);
|
|
|
|
if (err) {
|
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We may have to initialize the block bitmap if it isn't already */
|
|
|
|
if (ext4_has_group_desc_csum(sb) &&
|
|
|
|
gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
|
|
|
|
struct buffer_head *block_bitmap_bh;
|
|
|
|
|
|
|
|
block_bitmap_bh = ext4_read_block_bitmap(sb, group);
|
|
|
|
if (IS_ERR(block_bitmap_bh)) {
|
|
|
|
err = PTR_ERR(block_bitmap_bh);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap");
|
|
|
|
err = ext4_handle_dirty_metadata(NULL, NULL, block_bitmap_bh);
|
|
|
|
sync_dirty_buffer(block_bitmap_bh);
|
|
|
|
|
|
|
|
/* recheck and clear flag under lock if we still need to */
|
|
|
|
ext4_lock_group(sb, group);
|
|
|
|
if (ext4_has_group_desc_csum(sb) &&
|
|
|
|
(gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
|
|
|
|
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
|
|
|
|
ext4_free_group_clusters_set(sb, gdp,
|
|
|
|
ext4_free_clusters_after_init(sb, group, gdp));
|
2023-02-22 04:30:27 +08:00
|
|
|
ext4_block_bitmap_csum_set(sb, gdp, block_bitmap_bh);
|
2020-10-15 13:37:59 -07:00
|
|
|
ext4_group_desc_csum_set(sb, group, gdp);
|
|
|
|
}
|
|
|
|
ext4_unlock_group(sb, group);
|
|
|
|
brelse(block_bitmap_bh);
|
|
|
|
|
|
|
|
if (err) {
|
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Update the relevant bg descriptor fields */
|
|
|
|
if (ext4_has_group_desc_csum(sb)) {
|
|
|
|
int free;
|
|
|
|
|
|
|
|
ext4_lock_group(sb, group); /* while we modify the bg desc */
|
|
|
|
free = EXT4_INODES_PER_GROUP(sb) -
|
|
|
|
ext4_itable_unused_count(sb, gdp);
|
|
|
|
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
|
|
|
|
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
|
|
|
|
free = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check the relative inode number against the last used
|
|
|
|
* relative inode number in this group. if it is greater
|
|
|
|
* we need to update the bg_itable_unused count
|
|
|
|
*/
|
|
|
|
if (bit >= free)
|
|
|
|
ext4_itable_unused_set(sb, gdp,
|
|
|
|
(EXT4_INODES_PER_GROUP(sb) - bit - 1));
|
|
|
|
} else {
|
|
|
|
ext4_lock_group(sb, group);
|
|
|
|
}
|
|
|
|
|
|
|
|
ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1);
|
|
|
|
if (ext4_has_group_desc_csum(sb)) {
|
2024-08-20 21:22:32 +08:00
|
|
|
ext4_inode_bitmap_csum_set(sb, gdp, inode_bitmap_bh);
|
2020-10-15 13:37:59 -07:00
|
|
|
ext4_group_desc_csum_set(sb, group, gdp);
|
|
|
|
}
|
|
|
|
|
|
|
|
ext4_unlock_group(sb, group);
|
|
|
|
err = ext4_handle_dirty_metadata(NULL, NULL, group_desc_bh);
|
|
|
|
sync_dirty_buffer(group_desc_bh);
|
|
|
|
out:
|
2024-08-20 21:22:28 +08:00
|
|
|
brelse(inode_bitmap_bh);
|
2020-10-15 13:37:59 -07:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2020-09-16 21:11:25 -07:00
|
|
|
static int ext4_xattr_credits_for_new_inode(struct inode *dir, mode_t mode,
|
|
|
|
bool encrypt)
|
|
|
|
{
|
|
|
|
struct super_block *sb = dir->i_sb;
|
|
|
|
int nblocks = 0;
|
|
|
|
#ifdef CONFIG_EXT4_FS_POSIX_ACL
|
2022-09-22 17:17:00 +02:00
|
|
|
struct posix_acl *p = get_inode_acl(dir, ACL_TYPE_DEFAULT);
|
2020-09-16 21:11:25 -07:00
|
|
|
|
|
|
|
if (IS_ERR(p))
|
|
|
|
return PTR_ERR(p);
|
|
|
|
if (p) {
|
|
|
|
int acl_size = p->a_count * sizeof(ext4_acl_entry);
|
|
|
|
|
|
|
|
nblocks += (S_ISDIR(mode) ? 2 : 1) *
|
|
|
|
__ext4_xattr_set_credits(sb, NULL /* inode */,
|
|
|
|
NULL /* block_bh */, acl_size,
|
|
|
|
true /* is_create */);
|
|
|
|
posix_acl_release(p);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_SECURITY
|
|
|
|
{
|
|
|
|
int num_security_xattrs = 1;
|
|
|
|
|
|
|
|
#ifdef CONFIG_INTEGRITY
|
|
|
|
num_security_xattrs++;
|
|
|
|
#endif
|
|
|
|
/*
|
|
|
|
* We assume that security xattrs are never more than 1k.
|
|
|
|
* In practice they are under 128 bytes.
|
|
|
|
*/
|
|
|
|
nblocks += num_security_xattrs *
|
|
|
|
__ext4_xattr_set_credits(sb, NULL /* inode */,
|
|
|
|
NULL /* block_bh */, 1024,
|
|
|
|
true /* is_create */);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
if (encrypt)
|
|
|
|
nblocks += __ext4_xattr_set_credits(sb,
|
|
|
|
NULL /* inode */,
|
|
|
|
NULL /* block_bh */,
|
|
|
|
FSCRYPT_SET_CONTEXT_MAX_SIZE,
|
|
|
|
true /* is_create */);
|
|
|
|
return nblocks;
|
|
|
|
}
|
|
|
|
|
2006-10-11 01:20:50 -07:00
|
|
|
/*
|
|
|
|
* There are two policies for allocating an inode. If the new inode is
|
|
|
|
* a directory, then a forward search is made for a block group with both
|
|
|
|
* free space and a low directory-to-inode ratio; if that fails, then of
|
|
|
|
* the groups with above-average free space, that group with the fewest
|
|
|
|
* directories already is chosen.
|
|
|
|
*
|
|
|
|
* For other inodes, search forward from the parent directory's block
|
|
|
|
* group to find a free inode.
|
|
|
|
*/
|
2023-01-13 12:49:25 +01:00
|
|
|
struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
|
2021-01-21 14:19:57 +01:00
|
|
|
handle_t *handle, struct inode *dir,
|
2013-02-09 16:27:09 -05:00
|
|
|
umode_t mode, const struct qstr *qstr,
|
2017-06-21 21:21:39 -04:00
|
|
|
__u32 goal, uid_t *owner, __u32 i_flags,
|
|
|
|
int handle_type, unsigned int line_no,
|
|
|
|
int nblocks)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
|
|
|
struct super_block *sb;
|
2009-01-03 22:33:39 -05:00
|
|
|
struct buffer_head *inode_bitmap_bh = NULL;
|
|
|
|
struct buffer_head *group_desc_bh;
|
2009-05-01 08:50:38 -04:00
|
|
|
ext4_group_t ngroups, group = 0;
|
2006-10-11 01:20:50 -07:00
|
|
|
unsigned long ino = 0;
|
2008-09-08 22:25:24 -04:00
|
|
|
struct inode *inode;
|
|
|
|
struct ext4_group_desc *gdp = NULL;
|
2006-10-11 01:20:53 -07:00
|
|
|
struct ext4_inode_info *ei;
|
|
|
|
struct ext4_sb_info *sbi;
|
2015-06-29 16:22:54 +02:00
|
|
|
int ret2, err;
|
2006-10-11 01:20:50 -07:00
|
|
|
struct inode *ret;
|
2008-01-28 23:58:27 -05:00
|
|
|
ext4_group_t i;
|
2008-07-11 19:27:31 -04:00
|
|
|
ext4_group_t flex_group;
|
2020-10-15 13:37:59 -07:00
|
|
|
struct ext4_group_info *grp = NULL;
|
2020-09-16 21:11:26 -07:00
|
|
|
bool encrypt = false;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
/* Cannot create files in a deleted directory */
|
|
|
|
if (!dir || !dir->i_nlink)
|
|
|
|
return ERR_PTR(-EPERM);
|
|
|
|
|
2017-07-06 00:01:59 -04:00
|
|
|
sb = dir->i_sb;
|
|
|
|
sbi = EXT4_SB(sb);
|
|
|
|
|
2023-06-16 18:50:49 +02:00
|
|
|
if (unlikely(ext4_forced_shutdown(sb)))
|
2017-02-05 01:28:48 -05:00
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
|
2009-05-01 08:50:38 -04:00
|
|
|
ngroups = ext4_get_groups_count(sb);
|
2009-06-17 11:48:11 -04:00
|
|
|
trace_ext4_request_inode(dir, mode);
|
2006-10-11 01:20:50 -07:00
|
|
|
inode = new_inode(sb);
|
|
|
|
if (!inode)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2006-10-11 01:20:53 -07:00
|
|
|
ei = EXT4_I(inode);
|
2008-07-11 19:27:31 -04:00
|
|
|
|
2013-04-19 13:38:14 -04:00
|
|
|
/*
|
2016-03-09 23:49:05 -05:00
|
|
|
* Initialize owners and quota early so that we don't have to account
|
2013-04-19 13:38:14 -04:00
|
|
|
* for quota initialization worst case in standard inode creating
|
|
|
|
* transaction
|
|
|
|
*/
|
|
|
|
if (owner) {
|
|
|
|
inode->i_mode = mode;
|
|
|
|
i_uid_write(inode, owner[0]);
|
|
|
|
i_gid_write(inode, owner[1]);
|
|
|
|
} else if (test_opt(sb, GRPID)) {
|
|
|
|
inode->i_mode = mode;
|
2023-01-13 12:49:31 +01:00
|
|
|
inode_fsuid_set(inode, idmap);
|
2013-04-19 13:38:14 -04:00
|
|
|
inode->i_gid = dir->i_gid;
|
|
|
|
} else
|
2023-01-13 12:49:25 +01:00
|
|
|
inode_init_owner(idmap, inode, dir, mode);
|
2016-01-08 16:01:21 -05:00
|
|
|
|
2016-09-05 23:11:58 -04:00
|
|
|
if (ext4_has_feature_project(sb) &&
|
2016-01-08 16:01:21 -05:00
|
|
|
ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT))
|
|
|
|
ei->i_projid = EXT4_I(dir)->i_projid;
|
|
|
|
else
|
|
|
|
ei->i_projid = make_kprojid(&init_user_ns, EXT4_DEF_PROJID);
|
|
|
|
|
2020-09-16 21:11:26 -07:00
|
|
|
if (!(i_flags & EXT4_EA_INODE_FL)) {
|
|
|
|
err = fscrypt_prepare_new_inode(dir, inode, &encrypt);
|
|
|
|
if (err)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2015-06-29 16:22:54 +02:00
|
|
|
err = dquot_initialize(inode);
|
|
|
|
if (err)
|
|
|
|
goto out;
|
2013-04-19 13:38:14 -04:00
|
|
|
|
2020-09-16 21:11:26 -07:00
|
|
|
if (!handle && sbi->s_journal && !(i_flags & EXT4_EA_INODE_FL)) {
|
|
|
|
ret2 = ext4_xattr_credits_for_new_inode(dir, mode, encrypt);
|
|
|
|
if (ret2 < 0) {
|
|
|
|
err = ret2;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
nblocks += ret2;
|
|
|
|
}
|
|
|
|
|
2009-06-13 11:45:35 -04:00
|
|
|
if (!goal)
|
|
|
|
goal = sbi->s_inode_goal;
|
|
|
|
|
2009-07-05 23:45:11 -04:00
|
|
|
if (goal && goal <= le32_to_cpu(sbi->s_es->s_inodes_count)) {
|
2009-06-13 11:45:35 -04:00
|
|
|
group = (goal - 1) / EXT4_INODES_PER_GROUP(sb);
|
|
|
|
ino = (goal - 1) % EXT4_INODES_PER_GROUP(sb);
|
|
|
|
ret2 = 0;
|
|
|
|
goto got_group;
|
|
|
|
}
|
|
|
|
|
2011-10-08 14:34:47 -04:00
|
|
|
if (S_ISDIR(mode))
|
|
|
|
ret2 = find_group_orlov(sb, dir, &group, mode, qstr);
|
|
|
|
else
|
2009-03-12 12:18:34 -04:00
|
|
|
ret2 = find_group_other(sb, dir, &group, mode);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2008-07-11 19:27:31 -04:00
|
|
|
got_group:
|
2009-03-12 12:18:34 -04:00
|
|
|
EXT4_I(dir)->i_last_alloc_group = group;
|
2006-10-11 01:20:50 -07:00
|
|
|
err = -ENOSPC;
|
2008-01-28 23:58:27 -05:00
|
|
|
if (ret2 == -1)
|
2006-10-11 01:20:50 -07:00
|
|
|
goto out;
|
|
|
|
|
2012-02-06 20:12:03 -05:00
|
|
|
/*
|
|
|
|
* Normally we will only go through one pass of this loop,
|
|
|
|
* unless we get unlucky and it turns out the group we selected
|
|
|
|
* had its last inode grabbed by someone else.
|
|
|
|
*/
|
2009-06-13 11:45:35 -04:00
|
|
|
for (i = 0; i < ngroups; i++, ino = 0) {
|
2006-10-11 01:20:50 -07:00
|
|
|
err = -EIO;
|
|
|
|
|
2009-01-03 22:33:39 -05:00
|
|
|
gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
|
2006-10-11 01:20:50 -07:00
|
|
|
if (!gdp)
|
2013-04-19 13:38:14 -04:00
|
|
|
goto out;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2012-09-23 23:16:03 -04:00
|
|
|
/*
|
|
|
|
* Check free inodes count before loading bitmap.
|
|
|
|
*/
|
2017-08-24 11:58:18 -04:00
|
|
|
if (ext4_free_inodes_count(sb, gdp) == 0)
|
|
|
|
goto next_group;
|
2012-09-23 23:16:03 -04:00
|
|
|
|
2020-10-15 13:37:59 -07:00
|
|
|
if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
|
|
|
|
grp = ext4_get_group_info(sb, group);
|
|
|
|
/*
|
|
|
|
* Skip groups with already-known suspicious inode
|
|
|
|
* tables
|
|
|
|
*/
|
ext4: allow ext4_get_group_info() to fail
Previously, ext4_get_group_info() would treat an invalid group number
as BUG(), since in theory it should never happen. However, if a
malicious attaker (or fuzzer) modifies the superblock via the block
device while it is the file system is mounted, it is possible for
s_first_data_block to get set to a very large number. In that case,
when calculating the block group of some block number (such as the
starting block of a preallocation region), could result in an
underflow and very large block group number. Then the BUG_ON check in
ext4_get_group_info() would fire, resutling in a denial of service
attack that can be triggered by root or someone with write access to
the block device.
For a quality of implementation perspective, it's best that even if
the system administrator does something that they shouldn't, that it
will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info()
will call ext4_error and return NULL. We also add fallback code in
all of the callers of ext4_get_group_info() that it might NULL.
Also, since ext4_get_group_info() was already borderline to be an
inline function, un-inline it. The results in a next reduction of the
compiled text size of ext4 by roughly 2k.
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2023-04-29 00:06:28 -04:00
|
|
|
if (!grp || EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
|
2020-10-15 13:37:59 -07:00
|
|
|
goto next_group;
|
|
|
|
}
|
2013-08-28 18:32:58 -04:00
|
|
|
|
2009-01-03 22:33:39 -05:00
|
|
|
brelse(inode_bitmap_bh);
|
|
|
|
inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
|
2013-08-28 18:32:58 -04:00
|
|
|
/* Skip groups with suspicious inode tables */
|
2024-08-20 21:22:29 +08:00
|
|
|
if (IS_ERR(inode_bitmap_bh)) {
|
2015-10-17 21:33:24 -04:00
|
|
|
inode_bitmap_bh = NULL;
|
2017-08-24 11:58:18 -04:00
|
|
|
goto next_group;
|
2013-08-28 18:32:58 -04:00
|
|
|
}
|
2024-08-20 21:22:29 +08:00
|
|
|
if (!(sbi->s_mount_state & EXT4_FC_REPLAY) &&
|
|
|
|
EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
|
|
|
|
goto next_group;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
|
|
|
|
if (!ret2)
|
2013-07-26 15:15:46 -04:00
|
|
|
goto next_group;
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
|
|
|
|
if (group == 0 && (ino + 1) < EXT4_FIRST_INO(sb)) {
|
2012-02-06 20:12:03 -05:00
|
|
|
ext4_error(sb, "reserved inode found cleared - "
|
|
|
|
"inode=%lu", ino + 1);
|
2018-05-12 12:15:21 -04:00
|
|
|
ext4_mark_group_bitmap_corrupted(sb, group,
|
|
|
|
EXT4_GROUP_INFO_IBITMAP_CORRUPT);
|
2017-08-24 11:58:18 -04:00
|
|
|
goto next_group;
|
2012-02-06 20:12:03 -05:00
|
|
|
}
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
|
2020-10-15 13:37:59 -07:00
|
|
|
if ((!(sbi->s_mount_state & EXT4_FC_REPLAY)) && !handle) {
|
2013-02-09 16:27:09 -05:00
|
|
|
BUG_ON(nblocks <= 0);
|
2022-10-08 20:05:18 +08:00
|
|
|
handle = __ext4_journal_start_sb(NULL, dir->i_sb,
|
|
|
|
line_no, handle_type, nblocks, 0,
|
ext4: reserve revoke credits in __ext4_new_inode
It's possible that __ext4_new_inode will release the xattr block, so
it will trigger a warning since there is revoke credits will be 0 if
the handle == NULL. The below scripts can reproduce it easily.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 3861 at fs/jbd2/revoke.c:374 jbd2_journal_revoke+0x30e/0x540 fs/jbd2/revoke.c:374
...
__ext4_forget+0x1d7/0x800 fs/ext4/ext4_jbd2.c:248
ext4_free_blocks+0x213/0x1d60 fs/ext4/mballoc.c:4743
ext4_xattr_release_block+0x55b/0x780 fs/ext4/xattr.c:1254
ext4_xattr_block_set+0x1c2c/0x2c40 fs/ext4/xattr.c:2112
ext4_xattr_set_handle+0xa7e/0x1090 fs/ext4/xattr.c:2384
__ext4_set_acl+0x54d/0x6c0 fs/ext4/acl.c:214
ext4_init_acl+0x218/0x2e0 fs/ext4/acl.c:293
__ext4_new_inode+0x352a/0x42b0 fs/ext4/ialloc.c:1151
ext4_mkdir+0x2e9/0xbd0 fs/ext4/namei.c:2774
vfs_mkdir+0x386/0x5f0 fs/namei.c:3811
do_mkdirat+0x11c/0x210 fs/namei.c:3834
do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:294
...
-------------------------------------
scripts:
mkfs.ext4 /dev/vdb
mount /dev/vdb /mnt
cd /mnt && mkdir dir && for i in {1..8}; do setfacl -dm "u:user_"$i":rx" dir; done
mkdir dir/dir1 && mv dir/dir1 ./
sh repro.sh && add some user
[root@localhost ~]# cat repro.sh
while [ 1 -eq 1 ]; do
rm -rf dir
rm -rf dir1/dir1
mkdir dir
for i in {1..8}; do setfacl -dm "u:test"$i":rx" dir; done
setfacl -m "u:user_9:rx" dir &
mkdir dir1/dir1 &
done
Before exec repro.sh, dir1 has inherit the default acl from dir, and
xattr block of dir1 dir is not the same, so the h_refcount of these
two dir's xattr block will be 1. Then repro.sh can trigger the warning
with the situation show as below. The last h_refcount can be clear
with mkdir, and __ext4_new_inode has not reserved revoke credits, so
the warning will happened, fix it by reserve revoke credits in
__ext4_new_inode.
Thread 1 Thread 2
mkdir dir
set default acl(will create
a xattr block blk1 and the
refcount of ext4_xattr_header
will be 1)
...
mkdir dir1/dir1
->....->ext4_init_acl
->__ext4_set_acl(set default acl,
will reuse blk1, and h_refcount
will be 2)
setfacl->ext4_set_acl->...
->ext4_xattr_block_set(will create
new block blk2 to store xattr)
->__ext4_set_acl(set access acl, since
h_refcount of blk1 is 2, will create
blk3 to store xattr)
->ext4_xattr_release_block(dec
h_refcount of blk1 to 1)
->ext4_xattr_release_block(dec
h_refcount and since it is 0,
will release the block and trigger
the warning)
Link: https://lore.kernel.org/r/20191213014900.47228-1-yangerkun@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-13 09:49:00 +08:00
|
|
|
ext4_trans_default_revoke_credits(sb));
|
2013-02-09 16:27:09 -05:00
|
|
|
if (IS_ERR(handle)) {
|
|
|
|
err = PTR_ERR(handle);
|
2013-04-19 13:38:14 -04:00
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
2013-02-09 16:27:09 -05:00
|
|
|
}
|
|
|
|
}
|
2012-10-28 22:24:57 -04:00
|
|
|
BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
|
2021-08-16 11:57:04 +02:00
|
|
|
err = ext4_journal_get_write_access(handle, sb, inode_bitmap_bh,
|
|
|
|
EXT4_JTR_NONE);
|
2013-04-19 13:38:14 -04:00
|
|
|
if (err) {
|
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
|
|
|
}
|
2012-02-06 20:12:03 -05:00
|
|
|
ext4_lock_group(sb, group);
|
|
|
|
ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
if (ret2) {
|
|
|
|
/* Someone already took the bit. Repeat the search
|
|
|
|
* with lock held.
|
|
|
|
*/
|
|
|
|
ret2 = find_inode_bit(sb, group, inode_bitmap_bh, &ino);
|
|
|
|
if (ret2) {
|
|
|
|
ext4_set_bit(ino, inode_bitmap_bh->b_data);
|
|
|
|
ret2 = 0;
|
|
|
|
} else {
|
|
|
|
ret2 = 1; /* we didn't grab the inode */
|
|
|
|
}
|
|
|
|
}
|
2012-02-06 20:12:03 -05:00
|
|
|
ext4_unlock_group(sb, group);
|
|
|
|
ino++; /* the inode bitmap is zero-based */
|
|
|
|
if (!ret2)
|
|
|
|
goto got; /* we grabbed the inode! */
|
ext4: reduce lock contention in __ext4_new_inode
While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:
FUNC TOTAL_TIME(us) COUNT AVG(us)
ext4_create 1707443399 1440000 1185.72
_raw_spin_lock 1317641501 180899929 7.28
jbd2__journal_start 287821030 1453950 197.96
jbd2_journal_get_write_access 33441470 73077185 0.46
ext4_add_nondir 29435963 1440000 20.44
ext4_add_entry 26015166 1440049 18.07
ext4_dx_add_entry 25729337 1432814 17.96
ext4_mark_inode_dirty 12302433 5774407 2.13
most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.
Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
Read Intensive SSD)
format command:
mkfs.ext4 -J size=4096
test command:
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 1 -v -p 10 -u #first run to load inode
mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
-r -i 3 -v -p 10 -u
Kernel version: 4.13.0-rc3
Test 1,440,000 files with 48 directories by 48 processes:
Without patch:
File Creation File removal
79,033 289,569 ops/per second
81,463 285,359
79,875 288,475
With patch:
File Creation File removal
810669 301694
812805 302711
813965 297670
Creation performance is improved more than 10X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.
However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.
Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-24 12:56:35 -04:00
|
|
|
|
2013-07-26 15:15:46 -04:00
|
|
|
next_group:
|
|
|
|
if (++group == ngroups)
|
|
|
|
group = 0;
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
err = -ENOSPC;
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
got:
|
2012-10-28 22:24:57 -04:00
|
|
|
BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
|
|
|
|
err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
|
2013-04-19 13:38:14 -04:00
|
|
|
if (err) {
|
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
|
|
|
}
|
2012-10-28 22:24:57 -04:00
|
|
|
|
2014-07-05 16:28:35 -04:00
|
|
|
BUFFER_TRACE(group_desc_bh, "get_write_access");
|
2021-08-16 11:57:04 +02:00
|
|
|
err = ext4_journal_get_write_access(handle, sb, group_desc_bh,
|
|
|
|
EXT4_JTR_NONE);
|
2014-07-05 16:28:35 -04:00
|
|
|
if (err) {
|
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
/* We may have to initialize the block bitmap if it isn't already */
|
2012-04-29 18:45:10 -04:00
|
|
|
if (ext4_has_group_desc_csum(sb) &&
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
|
2009-01-03 22:33:39 -05:00
|
|
|
struct buffer_head *block_bitmap_bh;
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
|
2009-01-03 22:33:39 -05:00
|
|
|
block_bitmap_bh = ext4_read_block_bitmap(sb, group);
|
2015-10-17 21:33:24 -04:00
|
|
|
if (IS_ERR(block_bitmap_bh)) {
|
|
|
|
err = PTR_ERR(block_bitmap_bh);
|
2014-10-30 10:53:16 -04:00
|
|
|
goto out;
|
|
|
|
}
|
2009-01-03 22:33:39 -05:00
|
|
|
BUFFER_TRACE(block_bitmap_bh, "get block bitmap access");
|
2021-08-16 11:57:04 +02:00
|
|
|
err = ext4_journal_get_write_access(handle, sb, block_bitmap_bh,
|
|
|
|
EXT4_JTR_NONE);
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
if (err) {
|
2009-01-03 22:33:39 -05:00
|
|
|
brelse(block_bitmap_bh);
|
2013-04-19 13:38:14 -04:00
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
}
|
|
|
|
|
2011-09-09 18:42:51 -04:00
|
|
|
BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap");
|
|
|
|
err = ext4_handle_dirty_metadata(handle, NULL, block_bitmap_bh);
|
|
|
|
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
/* recheck and clear flag under lock if we still need to */
|
2011-09-09 18:42:51 -04:00
|
|
|
ext4_lock_group(sb, group);
|
2018-06-14 00:58:00 -04:00
|
|
|
if (ext4_has_group_desc_csum(sb) &&
|
|
|
|
(gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
|
2009-01-03 22:33:39 -05:00
|
|
|
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
|
2011-09-09 19:08:51 -04:00
|
|
|
ext4_free_group_clusters_set(sb, gdp,
|
2011-09-09 19:12:51 -04:00
|
|
|
ext4_free_clusters_after_init(sb, group, gdp));
|
2023-02-22 04:30:27 +08:00
|
|
|
ext4_block_bitmap_csum_set(sb, gdp, block_bitmap_bh);
|
2012-04-29 18:45:10 -04:00
|
|
|
ext4_group_desc_csum_set(sb, group, gdp);
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
}
|
2009-05-02 20:35:09 -04:00
|
|
|
ext4_unlock_group(sb, group);
|
2012-11-29 21:21:22 -05:00
|
|
|
brelse(block_bitmap_bh);
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
|
2013-04-19 13:38:14 -04:00
|
|
|
if (err) {
|
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
|
|
|
}
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
}
|
2012-02-06 20:12:03 -05:00
|
|
|
|
|
|
|
/* Update the relevant bg descriptor fields */
|
2012-04-29 18:33:10 -04:00
|
|
|
if (ext4_has_group_desc_csum(sb)) {
|
2012-02-06 20:12:03 -05:00
|
|
|
int free;
|
2020-10-15 13:37:59 -07:00
|
|
|
struct ext4_group_info *grp = NULL;
|
|
|
|
|
|
|
|
if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
|
|
|
|
grp = ext4_get_group_info(sb, group);
|
ext4: allow ext4_get_group_info() to fail
Previously, ext4_get_group_info() would treat an invalid group number
as BUG(), since in theory it should never happen. However, if a
malicious attaker (or fuzzer) modifies the superblock via the block
device while it is the file system is mounted, it is possible for
s_first_data_block to get set to a very large number. In that case,
when calculating the block group of some block number (such as the
starting block of a preallocation region), could result in an
underflow and very large block group number. Then the BUG_ON check in
ext4_get_group_info() would fire, resutling in a denial of service
attack that can be triggered by root or someone with write access to
the block device.
For a quality of implementation perspective, it's best that even if
the system administrator does something that they shouldn't, that it
will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info()
will call ext4_error and return NULL. We also add fallback code in
all of the callers of ext4_get_group_info() that it might NULL.
Also, since ext4_get_group_info() was already borderline to be an
inline function, un-inline it. The results in a next reduction of the
compiled text size of ext4 by roughly 2k.
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2023-04-29 00:06:28 -04:00
|
|
|
if (!grp) {
|
|
|
|
err = -EFSCORRUPTED;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-10-15 13:37:59 -07:00
|
|
|
down_read(&grp->alloc_sem); /*
|
|
|
|
* protect vs itable
|
|
|
|
* lazyinit
|
|
|
|
*/
|
|
|
|
}
|
2012-02-06 20:12:03 -05:00
|
|
|
ext4_lock_group(sb, group); /* while we modify the bg desc */
|
|
|
|
free = EXT4_INODES_PER_GROUP(sb) -
|
|
|
|
ext4_itable_unused_count(sb, gdp);
|
|
|
|
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
|
|
|
|
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
|
|
|
|
free = 0;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Check the relative inode number against the last used
|
|
|
|
* relative inode number in this group. if it is greater
|
|
|
|
* we need to update the bg_itable_unused count
|
|
|
|
*/
|
|
|
|
if (ino > free)
|
|
|
|
ext4_itable_unused_set(sb, gdp,
|
|
|
|
(EXT4_INODES_PER_GROUP(sb) - ino));
|
2020-10-15 13:37:59 -07:00
|
|
|
if (!(sbi->s_mount_state & EXT4_FC_REPLAY))
|
|
|
|
up_read(&grp->alloc_sem);
|
ext4: protect group inode free counting with group lock
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:
Free inodes count wrong for group #1 (1, counted=0).
Fix? no
Free inodes count wrong for group #2 (3, counted=0).
Fix? no
Directories count wrong for group #2 (780, counted=779).
Fix? no
Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no
So this patch try to protect it with the ext4_lock_group.
btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 18:20:59 -04:00
|
|
|
} else {
|
|
|
|
ext4_lock_group(sb, group);
|
2012-02-06 20:12:03 -05:00
|
|
|
}
|
ext4: protect group inode free counting with group lock
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:
Free inodes count wrong for group #1 (1, counted=0).
Fix? no
Free inodes count wrong for group #2 (3, counted=0).
Fix? no
Directories count wrong for group #2 (780, counted=779).
Fix? no
Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no
So this patch try to protect it with the ext4_lock_group.
btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 18:20:59 -04:00
|
|
|
|
2012-02-06 20:12:03 -05:00
|
|
|
ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1);
|
|
|
|
if (S_ISDIR(mode)) {
|
|
|
|
ext4_used_dirs_set(sb, gdp, ext4_used_dirs_count(sb, gdp) + 1);
|
|
|
|
if (sbi->s_log_groups_per_flex) {
|
|
|
|
ext4_group_t f = ext4_flex_group(sbi, group);
|
|
|
|
|
2020-02-18 19:08:51 -08:00
|
|
|
atomic_inc(&sbi_array_rcu_deref(sbi, s_flex_groups,
|
|
|
|
f)->used_dirs);
|
2012-02-06 20:12:03 -05:00
|
|
|
}
|
|
|
|
}
|
2012-04-29 18:33:10 -04:00
|
|
|
if (ext4_has_group_desc_csum(sb)) {
|
2024-08-20 21:22:32 +08:00
|
|
|
ext4_inode_bitmap_csum_set(sb, gdp, inode_bitmap_bh);
|
2012-04-29 18:45:10 -04:00
|
|
|
ext4_group_desc_csum_set(sb, group, gdp);
|
2012-02-06 20:12:03 -05:00
|
|
|
}
|
ext4: protect group inode free counting with group lock
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:
Free inodes count wrong for group #1 (1, counted=0).
Fix? no
Free inodes count wrong for group #2 (3, counted=0).
Fix? no
Directories count wrong for group #2 (780, counted=779).
Fix? no
Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no
So this patch try to protect it with the ext4_lock_group.
btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 18:20:59 -04:00
|
|
|
ext4_unlock_group(sb, group);
|
2012-02-06 20:12:03 -05:00
|
|
|
|
2009-01-03 22:33:39 -05:00
|
|
|
BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata");
|
|
|
|
err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh);
|
2013-04-19 13:38:14 -04:00
|
|
|
if (err) {
|
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto out;
|
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
percpu_counter_dec(&sbi->s_freeinodes_counter);
|
|
|
|
if (S_ISDIR(mode))
|
|
|
|
percpu_counter_inc(&sbi->s_dirs_counter);
|
|
|
|
|
2008-07-11 19:27:31 -04:00
|
|
|
if (sbi->s_log_groups_per_flex) {
|
|
|
|
flex_group = ext4_flex_group(sbi, group);
|
2020-02-18 19:08:51 -08:00
|
|
|
atomic_dec(&sbi_array_rcu_deref(sbi, s_flex_groups,
|
|
|
|
flex_group)->free_inodes);
|
2008-07-11 19:27:31 -04:00
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
|
Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use. This is this the most time consuming part
of the filesystem check. The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.
With this feature, there is a a high water mark of used inodes for each block
group. Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time. A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.
The feature is enabled through a mkfs option
mke2fs /dev/ -O uninit_groups
A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.
The patches have been stress tested with fsstress and fsx. In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem. In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users. With performance improvement of 2-20
times, depending on how full the filesystem is.
The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.
In each group descriptor if we have
EXT4_BG_INODE_UNINIT set in bg_flags:
Inode table is not initialized/used in this group. So we can skip
the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
No block in the group is used. So we can skip the block bitmap
verification for this group.
We also add two new fields to group descriptor as a part of
uninitialized group patch.
__le16 bg_itable_unused; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
bg_itable_unused:
If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.
bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-16 18:38:25 -04:00
|
|
|
inode->i_ino = ino + group * EXT4_INODES_PER_GROUP(sb);
|
2006-10-11 01:20:50 -07:00
|
|
|
/* This is the optimal IO size (for stat), not the fs block size */
|
|
|
|
inode->i_blocks = 0;
|
2023-10-04 14:52:20 -04:00
|
|
|
simple_inode_init_ts(inode);
|
|
|
|
ei->i_crtime = inode_get_mtime(inode);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
memset(ei->i_data, 0, sizeof(ei->i_data));
|
|
|
|
ei->i_dir_start_lookup = 0;
|
|
|
|
ei->i_disksize = 0;
|
|
|
|
|
2011-10-31 18:21:29 -04:00
|
|
|
/* Don't inherit extent flag from directory, amongst others. */
|
2009-02-15 18:09:20 -05:00
|
|
|
ei->i_flags =
|
|
|
|
ext4_mask_flags(mode, EXT4_I(dir)->i_flags & EXT4_FL_INHERITED);
|
2017-06-21 21:21:39 -04:00
|
|
|
ei->i_flags |= i_flags;
|
2006-10-11 01:20:50 -07:00
|
|
|
ei->i_file_acl = 0;
|
|
|
|
ei->i_dtime = 0;
|
|
|
|
ei->i_block_group = group;
|
2009-03-12 12:18:34 -04:00
|
|
|
ei->i_last_alloc_group = ~0;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2020-05-28 07:59:59 -07:00
|
|
|
ext4_set_inode_flags(inode, true);
|
2006-10-11 01:20:50 -07:00
|
|
|
if (IS_DIRSYNC(inode))
|
2009-01-07 00:06:22 -05:00
|
|
|
ext4_handle_sync(handle);
|
2008-12-30 02:03:31 -05:00
|
|
|
if (insert_inode_locked(inode) < 0) {
|
2011-12-18 17:37:02 -05:00
|
|
|
/*
|
|
|
|
* Likely a bitmap corruption causing inode to be allocated
|
|
|
|
* twice.
|
|
|
|
*/
|
|
|
|
err = -EIO;
|
2013-04-19 13:38:14 -04:00
|
|
|
ext4_error(sb, "failed to insert inode %lu: doubly allocated?",
|
|
|
|
inode->i_ino);
|
2018-05-12 12:15:21 -04:00
|
|
|
ext4_mark_group_bitmap_corrupted(sb, group,
|
|
|
|
EXT4_GROUP_INFO_IBITMAP_CORRUPT);
|
2013-04-19 13:38:14 -04:00
|
|
|
goto out;
|
2008-12-30 02:03:31 -05:00
|
|
|
}
|
2022-10-05 17:43:22 +02:00
|
|
|
inode->i_generation = get_random_u32();
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2012-04-29 18:31:10 -04:00
|
|
|
/* Precompute checksum seed for inode metadata */
|
2014-10-13 03:36:16 -04:00
|
|
|
if (ext4_has_metadata_csum(sb)) {
|
2012-04-29 18:31:10 -04:00
|
|
|
__u32 csum;
|
|
|
|
__le32 inum = cpu_to_le32(inode->i_ino);
|
|
|
|
__le32 gen = cpu_to_le32(inode->i_generation);
|
|
|
|
csum = ext4_chksum(sbi, sbi->s_csum_seed, (__u8 *)&inum,
|
|
|
|
sizeof(inum));
|
|
|
|
ei->i_csum_seed = ext4_chksum(sbi, csum, (__u8 *)&gen,
|
|
|
|
sizeof(gen));
|
|
|
|
}
|
|
|
|
|
2011-01-10 12:18:25 -05:00
|
|
|
ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
|
2010-01-24 14:34:07 -05:00
|
|
|
ext4_set_inode_state(inode, EXT4_STATE_NEW);
|
2007-07-18 09:15:20 -04:00
|
|
|
|
2018-01-11 13:17:49 -05:00
|
|
|
ei->i_extra_isize = sbi->s_want_extra_isize;
|
2012-12-10 14:06:03 -05:00
|
|
|
ei->i_inline_off = 0;
|
2021-04-12 17:19:00 -04:00
|
|
|
if (ext4_has_feature_inline_data(sb) &&
|
|
|
|
(!(ei->i_flags & EXT4_DAX_FL) || S_ISDIR(mode)))
|
2012-12-10 14:06:03 -05:00
|
|
|
ext4_set_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA);
|
2006-10-11 01:20:50 -07:00
|
|
|
ret = inode;
|
2010-03-03 09:05:01 -05:00
|
|
|
err = dquot_alloc_inode(inode);
|
|
|
|
if (err)
|
2006-10-11 01:20:50 -07:00
|
|
|
goto fail_drop;
|
|
|
|
|
ext4: inherit encryption xattr before other xattrs
When using both encryption and SELinux (or another feature that requires
an xattr per file) on a filesystem with 256-byte inodes, each file's
xattrs usually spill into an external xattr block. Currently, the
xattrs are inherited in the order ACL, security, then encryption.
Therefore, if spillage occurs, the encryption xattr will always end up
in the external block. This is not ideal because the encryption xattrs
contain a nonce, so they will always be unique and will prevent the
external xattr blocks from being deduplicated.
To improve the situation, change the inheritance order to encryption,
ACL, then security. This gives the encryption xattr a better chance to
be stored in-inode, allowing the other xattr(s) to be deduplicated.
Note that it may be better for userspace to format the filesystem with
512-byte inodes in this case. However, it's not the default.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-02 00:49:54 -04:00
|
|
|
/*
|
|
|
|
* Since the encryption xattr will always be unique, create it first so
|
|
|
|
* that it's less likely to end up in an external xattr block and
|
|
|
|
* prevent its deduplication.
|
|
|
|
*/
|
|
|
|
if (encrypt) {
|
2020-09-16 21:11:26 -07:00
|
|
|
err = fscrypt_set_context(inode, handle);
|
ext4: inherit encryption xattr before other xattrs
When using both encryption and SELinux (or another feature that requires
an xattr per file) on a filesystem with 256-byte inodes, each file's
xattrs usually spill into an external xattr block. Currently, the
xattrs are inherited in the order ACL, security, then encryption.
Therefore, if spillage occurs, the encryption xattr will always end up
in the external block. This is not ideal because the encryption xattrs
contain a nonce, so they will always be unique and will prevent the
external xattr blocks from being deduplicated.
To improve the situation, change the inheritance order to encryption,
ACL, then security. This gives the encryption xattr a better chance to
be stored in-inode, allowing the other xattr(s) to be deduplicated.
Note that it may be better for userspace to format the filesystem with
512-byte inodes in this case. However, it's not the default.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-02 00:49:54 -04:00
|
|
|
if (err)
|
|
|
|
goto fail_free_drop;
|
|
|
|
}
|
|
|
|
|
2017-06-21 21:21:39 -04:00
|
|
|
if (!(ei->i_flags & EXT4_EA_INODE_FL)) {
|
|
|
|
err = ext4_init_acl(handle, inode, dir);
|
|
|
|
if (err)
|
|
|
|
goto fail_free_drop;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2017-07-06 00:00:59 -04:00
|
|
|
err = ext4_init_security(handle, inode, dir, qstr);
|
|
|
|
if (err)
|
|
|
|
goto fail_free_drop;
|
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2015-10-17 16:18:43 -04:00
|
|
|
if (ext4_has_feature_extents(sb)) {
|
2008-07-11 19:27:31 -04:00
|
|
|
/* set extent flag only for directory, file and normal symlink*/
|
2008-04-29 08:11:12 -04:00
|
|
|
if (S_ISDIR(mode) || S_ISREG(mode) || S_ISLNK(mode)) {
|
2010-05-16 22:00:00 -04:00
|
|
|
ext4_set_inode_flag(inode, EXT4_INODE_EXTENTS);
|
2008-02-25 16:38:03 -05:00
|
|
|
ext4_ext_tree_init(handle, inode);
|
|
|
|
}
|
2006-10-11 01:21:03 -07:00
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2024-05-27 17:14:47 +01:00
|
|
|
ext4_update_inode_fsync_trans(handle, inode, 1);
|
2011-03-16 17:16:31 -04:00
|
|
|
|
2008-04-29 22:00:36 -04:00
|
|
|
err = ext4_mark_inode_dirty(handle, inode);
|
|
|
|
if (err) {
|
|
|
|
ext4_std_error(sb, err);
|
|
|
|
goto fail_free_drop;
|
|
|
|
}
|
|
|
|
|
2006-10-11 01:20:53 -07:00
|
|
|
ext4_debug("allocating inode %lu\n", inode->i_ino);
|
2009-06-17 11:48:11 -04:00
|
|
|
trace_ext4_allocate_inode(inode, dir, mode);
|
2009-01-03 22:33:39 -05:00
|
|
|
brelse(inode_bitmap_bh);
|
2006-10-11 01:20:50 -07:00
|
|
|
return ret;
|
|
|
|
|
|
|
|
fail_free_drop:
|
2010-03-03 09:05:01 -05:00
|
|
|
dquot_free_inode(inode);
|
2006-10-11 01:20:50 -07:00
|
|
|
fail_drop:
|
2011-10-28 14:13:28 +02:00
|
|
|
clear_nlink(inode);
|
2008-12-30 02:03:31 -05:00
|
|
|
unlock_new_inode(inode);
|
2013-04-19 13:38:14 -04:00
|
|
|
out:
|
|
|
|
dquot_drop(inode);
|
|
|
|
inode->i_flags |= S_NOQUOTA;
|
2006-10-11 01:20:50 -07:00
|
|
|
iput(inode);
|
2009-01-03 22:33:39 -05:00
|
|
|
brelse(inode_bitmap_bh);
|
2006-10-11 01:20:50 -07:00
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Verify that we are loading a valid orphan from disk */
|
2006-10-11 01:20:53 -07:00
|
|
|
struct inode *ext4_orphan_get(struct super_block *sb, unsigned long ino)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
2006-10-11 01:20:53 -07:00
|
|
|
unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count);
|
2008-01-28 23:58:27 -05:00
|
|
|
ext4_group_t block_group;
|
2006-10-11 01:20:50 -07:00
|
|
|
int bit;
|
2016-04-30 00:49:54 -04:00
|
|
|
struct buffer_head *bitmap_bh = NULL;
|
2006-10-11 01:20:50 -07:00
|
|
|
struct inode *inode = NULL;
|
2016-04-30 00:49:54 -04:00
|
|
|
int err = -EFSCORRUPTED;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2016-04-30 00:49:54 -04:00
|
|
|
if (ino < EXT4_FIRST_INO(sb) || ino > max_ino)
|
|
|
|
goto bad_orphan;
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2006-10-11 01:20:53 -07:00
|
|
|
block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
|
|
|
|
bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
|
2008-08-02 21:21:02 -04:00
|
|
|
bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
|
2018-05-12 12:15:21 -04:00
|
|
|
if (IS_ERR(bitmap_bh))
|
2018-10-10 16:41:40 -04:00
|
|
|
return ERR_CAST(bitmap_bh);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
|
|
|
/* Having the inode bit set should be a 100% indicator that this
|
|
|
|
* is a valid orphan (no e2fsck run on fs). Orphans also include
|
|
|
|
* inodes that were being truncated, so we can't check i_nlink==0.
|
|
|
|
*/
|
2008-02-07 00:15:37 -08:00
|
|
|
if (!ext4_test_bit(bit, bitmap_bh->b_data))
|
|
|
|
goto bad_orphan;
|
|
|
|
|
ext4: avoid declaring fs inconsistent due to invalid file handles
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-12-19 12:29:13 -05:00
|
|
|
inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
|
2016-04-30 00:49:54 -04:00
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
err = PTR_ERR(inode);
|
2020-03-28 19:33:43 -04:00
|
|
|
ext4_error_err(sb, -err,
|
|
|
|
"couldn't read orphan inode %lu (err %d)",
|
|
|
|
ino, err);
|
2020-04-23 13:09:27 +08:00
|
|
|
brelse(bitmap_bh);
|
2016-04-30 00:49:54 -04:00
|
|
|
return inode;
|
|
|
|
}
|
2008-02-07 00:15:37 -08:00
|
|
|
|
2008-07-11 19:27:31 -04:00
|
|
|
/*
|
2016-04-30 00:48:54 -04:00
|
|
|
* If the orphans has i_nlinks > 0 then it should be able to
|
|
|
|
* be truncated, otherwise it won't be removed from the orphan
|
|
|
|
* list during processing and an infinite loop will result.
|
|
|
|
* Similarly, it must not be a bad inode.
|
2008-07-11 19:27:31 -04:00
|
|
|
*/
|
2016-04-30 00:48:54 -04:00
|
|
|
if ((inode->i_nlink && !ext4_can_truncate(inode)) ||
|
|
|
|
is_bad_inode(inode))
|
2008-07-11 19:27:31 -04:00
|
|
|
goto bad_orphan;
|
|
|
|
|
2008-02-07 00:15:37 -08:00
|
|
|
if (NEXT_ORPHAN(inode) > max_ino)
|
|
|
|
goto bad_orphan;
|
|
|
|
brelse(bitmap_bh);
|
|
|
|
return inode;
|
|
|
|
|
|
|
|
bad_orphan:
|
2016-04-30 00:49:54 -04:00
|
|
|
ext4_error(sb, "bad orphan inode %lu", ino);
|
|
|
|
if (bitmap_bh)
|
|
|
|
printk(KERN_ERR "ext4_test_bit(bit=%d, block=%llu) = %d\n",
|
|
|
|
bit, (unsigned long long)bitmap_bh->b_blocknr,
|
|
|
|
ext4_test_bit(bit, bitmap_bh->b_data));
|
2008-02-07 00:15:37 -08:00
|
|
|
if (inode) {
|
2016-04-30 00:49:54 -04:00
|
|
|
printk(KERN_ERR "is_bad_inode(inode)=%d\n",
|
2008-02-07 00:15:37 -08:00
|
|
|
is_bad_inode(inode));
|
2016-04-30 00:49:54 -04:00
|
|
|
printk(KERN_ERR "NEXT_ORPHAN(inode)=%u\n",
|
2008-02-07 00:15:37 -08:00
|
|
|
NEXT_ORPHAN(inode));
|
2016-04-30 00:49:54 -04:00
|
|
|
printk(KERN_ERR "max_ino=%lu\n", max_ino);
|
|
|
|
printk(KERN_ERR "i_nlink=%u\n", inode->i_nlink);
|
2006-10-11 01:20:50 -07:00
|
|
|
/* Avoid freeing blocks if we got a bad deleted inode */
|
2008-02-07 00:15:37 -08:00
|
|
|
if (inode->i_nlink == 0)
|
2006-10-11 01:20:50 -07:00
|
|
|
inode->i_blocks = 0;
|
|
|
|
iput(inode);
|
|
|
|
}
|
|
|
|
brelse(bitmap_bh);
|
2008-02-07 00:15:37 -08:00
|
|
|
return ERR_PTR(err);
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
|
2008-09-08 22:25:24 -04:00
|
|
|
unsigned long ext4_count_free_inodes(struct super_block *sb)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
|
|
|
unsigned long desc_count;
|
2006-10-11 01:20:53 -07:00
|
|
|
struct ext4_group_desc *gdp;
|
2009-05-01 08:50:38 -04:00
|
|
|
ext4_group_t i, ngroups = ext4_get_groups_count(sb);
|
2006-10-11 01:20:53 -07:00
|
|
|
#ifdef EXT4FS_DEBUG
|
|
|
|
struct ext4_super_block *es;
|
2006-10-11 01:20:50 -07:00
|
|
|
unsigned long bitmap_count, x;
|
|
|
|
struct buffer_head *bitmap_bh = NULL;
|
|
|
|
|
2006-10-11 01:20:53 -07:00
|
|
|
es = EXT4_SB(sb)->s_es;
|
2006-10-11 01:20:50 -07:00
|
|
|
desc_count = 0;
|
|
|
|
bitmap_count = 0;
|
|
|
|
gdp = NULL;
|
2009-05-01 08:50:38 -04:00
|
|
|
for (i = 0; i < ngroups; i++) {
|
2008-09-08 22:25:24 -04:00
|
|
|
gdp = ext4_get_group_desc(sb, i, NULL);
|
2006-10-11 01:20:50 -07:00
|
|
|
if (!gdp)
|
|
|
|
continue;
|
2009-01-05 22:20:24 -05:00
|
|
|
desc_count += ext4_free_inodes_count(sb, gdp);
|
2006-10-11 01:20:50 -07:00
|
|
|
brelse(bitmap_bh);
|
2008-08-02 21:21:02 -04:00
|
|
|
bitmap_bh = ext4_read_inode_bitmap(sb, i);
|
2015-10-17 21:33:24 -04:00
|
|
|
if (IS_ERR(bitmap_bh)) {
|
|
|
|
bitmap_bh = NULL;
|
2006-10-11 01:20:50 -07:00
|
|
|
continue;
|
2015-10-17 21:33:24 -04:00
|
|
|
}
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2012-06-30 19:14:57 -04:00
|
|
|
x = ext4_count_free(bitmap_bh->b_data,
|
|
|
|
EXT4_INODES_PER_GROUP(sb) / 8);
|
2008-01-28 23:58:27 -05:00
|
|
|
printk(KERN_DEBUG "group %lu: stored = %d, counted = %lu\n",
|
2009-07-27 21:44:40 -04:00
|
|
|
(unsigned long) i, ext4_free_inodes_count(sb, gdp), x);
|
2006-10-11 01:20:50 -07:00
|
|
|
bitmap_count += x;
|
|
|
|
}
|
|
|
|
brelse(bitmap_bh);
|
2008-09-08 23:00:52 -04:00
|
|
|
printk(KERN_DEBUG "ext4_count_free_inodes: "
|
|
|
|
"stored = %u, computed = %lu, %lu\n",
|
|
|
|
le32_to_cpu(es->s_free_inodes_count), desc_count, bitmap_count);
|
2006-10-11 01:20:50 -07:00
|
|
|
return desc_count;
|
|
|
|
#else
|
|
|
|
desc_count = 0;
|
2009-05-01 08:50:38 -04:00
|
|
|
for (i = 0; i < ngroups; i++) {
|
2008-09-08 22:25:24 -04:00
|
|
|
gdp = ext4_get_group_desc(sb, i, NULL);
|
2006-10-11 01:20:50 -07:00
|
|
|
if (!gdp)
|
|
|
|
continue;
|
2009-01-05 22:20:24 -05:00
|
|
|
desc_count += ext4_free_inodes_count(sb, gdp);
|
2006-10-11 01:20:50 -07:00
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
return desc_count;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Called at mount-time, super-block is locked */
|
2008-09-08 22:25:24 -04:00
|
|
|
unsigned long ext4_count_dirs(struct super_block * sb)
|
2006-10-11 01:20:50 -07:00
|
|
|
{
|
|
|
|
unsigned long count = 0;
|
2009-05-01 08:50:38 -04:00
|
|
|
ext4_group_t i, ngroups = ext4_get_groups_count(sb);
|
2006-10-11 01:20:50 -07:00
|
|
|
|
2009-05-01 08:50:38 -04:00
|
|
|
for (i = 0; i < ngroups; i++) {
|
2008-09-08 22:25:24 -04:00
|
|
|
struct ext4_group_desc *gdp = ext4_get_group_desc(sb, i, NULL);
|
2006-10-11 01:20:50 -07:00
|
|
|
if (!gdp)
|
|
|
|
continue;
|
2009-01-05 22:20:24 -05:00
|
|
|
count += ext4_used_dirs_count(sb, gdp);
|
2006-10-11 01:20:50 -07:00
|
|
|
}
|
|
|
|
return count;
|
|
|
|
}
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Zeroes not yet zeroed inode table - just write zeroes through the whole
|
|
|
|
* inode table. Must be called without any spinlock held. The only place
|
|
|
|
* where it is called from on active part of filesystem is ext4lazyinit
|
|
|
|
* thread, so we do not need any special locks, however we have to prevent
|
|
|
|
* inode allocation from the current group, so we take alloc_sem lock, to
|
2012-02-06 20:12:03 -05:00
|
|
|
* block ext4_new_inode() until we are finished.
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
*/
|
2011-10-18 10:57:51 -04:00
|
|
|
int ext4_init_inode_table(struct super_block *sb, ext4_group_t group,
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
int barrier)
|
|
|
|
{
|
|
|
|
struct ext4_group_info *grp = ext4_get_group_info(sb, group);
|
|
|
|
struct ext4_sb_info *sbi = EXT4_SB(sb);
|
|
|
|
struct ext4_group_desc *gdp = NULL;
|
|
|
|
struct buffer_head *group_desc_bh;
|
|
|
|
handle_t *handle;
|
|
|
|
ext4_fsblk_t blk;
|
|
|
|
int num, ret = 0, used_blks = 0;
|
2021-03-31 20:15:16 +08:00
|
|
|
unsigned long used_inos = 0;
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
|
|
|
|
gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
|
ext4: allow ext4_get_group_info() to fail
Previously, ext4_get_group_info() would treat an invalid group number
as BUG(), since in theory it should never happen. However, if a
malicious attaker (or fuzzer) modifies the superblock via the block
device while it is the file system is mounted, it is possible for
s_first_data_block to get set to a very large number. In that case,
when calculating the block group of some block number (such as the
starting block of a preallocation region), could result in an
underflow and very large block group number. Then the BUG_ON check in
ext4_get_group_info() would fire, resutling in a denial of service
attack that can be triggered by root or someone with write access to
the block device.
For a quality of implementation perspective, it's best that even if
the system administrator does something that they shouldn't, that it
will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info()
will call ext4_error and return NULL. We also add fallback code in
all of the callers of ext4_get_group_info() that it might NULL.
Also, since ext4_get_group_info() was already borderline to be an
inline function, un-inline it. The results in a next reduction of the
compiled text size of ext4 by roughly 2k.
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2023-04-29 00:06:28 -04:00
|
|
|
if (!gdp || !grp)
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We do not need to lock this, because we are the only one
|
|
|
|
* handling this flag.
|
|
|
|
*/
|
|
|
|
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_ZEROED))
|
|
|
|
goto out;
|
|
|
|
|
2013-02-08 21:59:22 -05:00
|
|
|
handle = ext4_journal_start_sb(sb, EXT4_HT_MISC, 1);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
if (IS_ERR(handle)) {
|
|
|
|
ret = PTR_ERR(handle);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
down_write(&grp->alloc_sem);
|
|
|
|
/*
|
|
|
|
* If inode bitmap was already initialized there may be some
|
|
|
|
* used inodes so we need to skip blocks with used inodes in
|
|
|
|
* inode table.
|
|
|
|
*/
|
2021-03-31 20:15:16 +08:00
|
|
|
if (!(gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))) {
|
|
|
|
used_inos = EXT4_INODES_PER_GROUP(sb) -
|
|
|
|
ext4_itable_unused_count(sb, gdp);
|
|
|
|
used_blks = DIV_ROUND_UP(used_inos, sbi->s_inodes_per_block);
|
|
|
|
|
|
|
|
/* Bogus inode unused count? */
|
|
|
|
if (used_blks < 0 || used_blks > sbi->s_itb_per_group) {
|
|
|
|
ext4_error(sb, "Something is wrong with group %u: "
|
|
|
|
"used itable blocks: %d; "
|
|
|
|
"itable unused count: %u",
|
|
|
|
group, used_blks,
|
|
|
|
ext4_itable_unused_count(sb, gdp));
|
|
|
|
ret = 1;
|
|
|
|
goto err_out;
|
|
|
|
}
|
|
|
|
|
|
|
|
used_inos += group * EXT4_INODES_PER_GROUP(sb);
|
|
|
|
/*
|
|
|
|
* Are there some uninitialized inodes in the inode table
|
|
|
|
* before the first normal inode?
|
|
|
|
*/
|
|
|
|
if ((used_blks != sbi->s_itb_per_group) &&
|
|
|
|
(used_inos < EXT4_FIRST_INO(sb))) {
|
|
|
|
ext4_error(sb, "Something is wrong with group %u: "
|
|
|
|
"itable unused count: %u; "
|
|
|
|
"itables initialized count: %ld",
|
|
|
|
group, ext4_itable_unused_count(sb, gdp),
|
|
|
|
used_inos);
|
|
|
|
ret = 1;
|
|
|
|
goto err_out;
|
|
|
|
}
|
2010-10-27 21:30:05 -04:00
|
|
|
}
|
|
|
|
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
blk = ext4_inode_table(sb, gdp) + used_blks;
|
|
|
|
num = sbi->s_itb_per_group - used_blks;
|
|
|
|
|
|
|
|
BUFFER_TRACE(group_desc_bh, "get_write_access");
|
2021-08-16 11:57:04 +02:00
|
|
|
ret = ext4_journal_get_write_access(handle, sb, group_desc_bh,
|
|
|
|
EXT4_JTR_NONE);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
if (ret)
|
|
|
|
goto err_out;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Skip zeroout if the inode table is full. But we set the ZEROED
|
|
|
|
* flag anyway, because obviously, when it is full it does not need
|
|
|
|
* further zeroing.
|
|
|
|
*/
|
|
|
|
if (unlikely(num == 0))
|
|
|
|
goto skip_zeroout;
|
|
|
|
|
|
|
|
ext4_debug("going to zero out inode table in group %d\n",
|
|
|
|
group);
|
2010-10-27 23:44:47 -04:00
|
|
|
ret = sb_issue_zeroout(sb, blk, num, GFP_NOFS);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
if (ret < 0)
|
|
|
|
goto err_out;
|
2010-10-27 23:44:47 -04:00
|
|
|
if (barrier)
|
2021-01-26 15:52:35 +01:00
|
|
|
blkdev_issue_flush(sb->s_bdev);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
|
|
|
|
skip_zeroout:
|
|
|
|
ext4_lock_group(sb, group);
|
|
|
|
gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
|
2012-04-29 18:45:10 -04:00
|
|
|
ext4_group_desc_csum_set(sb, group, gdp);
|
ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out. The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.
Hence, it is important for the inode tables to be initialized as soon
as possble. This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.
This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed. There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.
This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10). We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).
We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.
This can be suppresed using the mount option no_init_itable.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
|
|
|
ext4_unlock_group(sb, group);
|
|
|
|
|
|
|
|
BUFFER_TRACE(group_desc_bh,
|
|
|
|
"call ext4_handle_dirty_metadata");
|
|
|
|
ret = ext4_handle_dirty_metadata(handle, NULL,
|
|
|
|
group_desc_bh);
|
|
|
|
|
|
|
|
err_out:
|
|
|
|
up_write(&grp->alloc_sem);
|
|
|
|
ext4_journal_stop(handle);
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|