[ Upstream commit 128630e1db ]
Update this log message since cached fids may represent things other
than the root of a mount.
Fixes: e4029e0726 ("cifs: find and use the dentry for cached non-root directories also")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit adc77b19f6 ]
Syzbot has reported the following KMSAN splat:
BUG: KMSAN: uninit-value in ocfs2_file_read_iter+0x9a4/0xf80
ocfs2_file_read_iter+0x9a4/0xf80
__io_read+0x8d4/0x20f0
io_read+0x3e/0xf0
io_issue_sqe+0x42b/0x22c0
io_wq_submit_work+0xaf9/0xdc0
io_worker_handle_work+0xd13/0x2110
io_wq_worker+0x447/0x1410
ret_from_fork+0x6f/0x90
ret_from_fork_asm+0x1a/0x30
Uninit was created at:
__alloc_pages_noprof+0x9a7/0xe00
alloc_pages_mpol_noprof+0x299/0x990
alloc_pages_noprof+0x1bf/0x1e0
allocate_slab+0x33a/0x1250
___slab_alloc+0x12ef/0x35e0
kmem_cache_alloc_bulk_noprof+0x486/0x1330
__io_alloc_req_refill+0x84/0x560
io_submit_sqes+0x172f/0x2f30
__se_sys_io_uring_enter+0x406/0x41c0
__x64_sys_io_uring_enter+0x11f/0x1a0
x64_sys_call+0x2b54/0x3ba0
do_syscall_64+0xcd/0x1e0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Since an instance of 'struct kiocb' may be passed from the block layer
with 'private' field uninitialized, introduce 'ocfs2_iocb_init_rw_locked()'
and use it from where 'ocfs2_dio_end_io()' might take care, i.e. in
'ocfs2_file_read_iter()' and 'ocfs2_file_write_iter()'.
Link: https://lkml.kernel.org/r/20241029091736.1501946-1-dmantipov@yandex.ru
Fixes: 7cdfc3a1c3 ("ocfs2: Remember rw lock level during direct io")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reported-by: syzbot+a73e253cca4f0230a5a5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a73e253cca4f0230a5a5
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 82e33f249f ]
Coccinelle complains about the nested reuse of the pointer `iter' with
different pointer type:
./fs/proc/kcore.c:515:26-30: ERROR: invalid reference to the index variable of the iterator on line 499
./fs/proc/kcore.c:534:23-27: ERROR: invalid reference to the index variable of the iterator on line 499
./fs/proc/kcore.c:550:40-44: ERROR: invalid reference to the index variable of the iterator on line 499
./fs/proc/kcore.c:568:27-31: ERROR: invalid reference to the index variable of the iterator on line 499
./fs/proc/kcore.c:581:28-32: ERROR: invalid reference to the index variable of the iterator on line 499
./fs/proc/kcore.c:599:27-31: ERROR: invalid reference to the index variable of the iterator on line 499
./fs/proc/kcore.c:607:38-42: ERROR: invalid reference to the index variable of the iterator on line 499
./fs/proc/kcore.c:614:26-30: ERROR: invalid reference to the index variable of the iterator on line 499
Replacing `struct kcore_list *iter' with `struct kcore_list *tmp' doesn't change the
scope and the functionality is the same and coccinelle seems happy.
NOTE: There was an issue with using `struct kcore_list *pos' as the nested iterator.
The build did not work!
[akpm@linux-foundation.org: s/tmp/pos/]
Link: https://lkml.kernel.org/r/20241029054651.86356-2-mtodorovac69@gmail.com
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1]
Link: https://lkml.kernel.org/r/20220331223700.902556-1-jakobkoschel@gmail.com
Fixes: 04d168c6d4 ("fs/proc/kcore.c: remove check of list iterator against head past the loop body")
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Brian Johannesmeyer" <bjohannesmeyer@gmail.com>
Cc: Cristiano Giuffrida <c.giuffrida@vu.nl>
Cc: "Bos, H.J." <h.j.bos@vu.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Yang Li <yang.lee@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Yan Zhen <yanzhen@vivo.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 156bb2c569 ]
utf8_load() requests the symbol "utf8_data_table" and then checks if the
requested UTF-8 version is supported. If it's unsupported, it tries to
put the data table using symbol_put(). If an unsupported version is
requested, symbol_put() fails like this:
kernel BUG at kernel/module/main.c:786!
RIP: 0010:__symbol_put+0x93/0xb0
Call Trace:
<TASK>
? __die_body.cold+0x19/0x27
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x65/0x80
? __symbol_put+0x93/0xb0
? exc_invalid_op+0x51/0x70
? __symbol_put+0x93/0xb0
? asm_exc_invalid_op+0x1a/0x20
? __pfx_cmp_name+0x10/0x10
? __symbol_put+0x93/0xb0
? __symbol_put+0x62/0xb0
utf8_load+0xf8/0x150
That happens because symbol_put() expects the unique string that
identify the symbol, instead of a pointer to the loaded symbol. Fix that
by using such string.
Fixes: 2b3d047870 ("unicode: Add utf8-data module")
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20240902225511.757831-2-andrealmeid@igalia.com
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 200b977ebb ]
If dlm_recover_members() fails we don't drop the references of the
previous created root_list that holds and keep all rsbs alive during the
recovery. It might be not an unlikely event because ping_members() could
run into an -EINTR if another recovery progress was triggered again.
Fixes: 3a747f4a2e ("dlm: move rsb root_list to ls_recover() stack")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bae0854160 ]
Adjust sb->s_blocksize{,_bits} directly for file-backed
mounts when the fs block size is smaller than PAGE_SIZE.
Previously, EROFS used sb_set_blocksize(), which caused
a panic if bdev-backed mounts is not used.
Fixes: fb17675026 ("erofs: add file-backed mount support")
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241015103836.3757438-1-hongzhen@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0b5bbeee4d ]
A memleak was found as below:
unreferenced object 0xffff0000d10164d8 (size 8):
comm "pool-udisksd", pid 108217, jiffies 4295408555
hex dump (first 8 bytes):
75 74 66 38 00 cc cc cc utf8....
backtrace (crc de430d31):
[<ffff800081046e6c>] kmemleak_alloc+0xb8/0xc8
[<ffff8000803e6c3c>] __kmalloc_node_track_caller_noprof+0x380/0x474
[<ffff800080363b74>] kstrdup+0x70/0xfc
[<ffff80007bb3c6a4>] isofs_parse_param+0x228/0x2c0 [isofs]
[<ffff8000804d7f68>] vfs_parse_fs_param+0xf4/0x164
[<ffff8000804d8064>] vfs_parse_fs_string+0x8c/0xd4
[<ffff8000804d815c>] vfs_parse_monolithic_sep+0xb0/0xfc
[<ffff8000804d81d8>] generic_parse_monolithic+0x30/0x3c
[<ffff8000804d8bfc>] parse_monolithic_mount_data+0x40/0x4c
[<ffff8000804b6a64>] path_mount+0x6c4/0x9ec
[<ffff8000804b6e38>] do_mount+0xac/0xc4
[<ffff8000804b7494>] __arm64_sys_mount+0x16c/0x2b0
[<ffff80008002b8dc>] invoke_syscall+0x7c/0x104
[<ffff80008002ba44>] el0_svc_common.constprop.1+0xe0/0x104
[<ffff80008002ba94>] do_el0_svc+0x2c/0x38
[<ffff800081041108>] el0_svc+0x3c/0x1b8
The opt->iocharset is freed inside the isofs_fill_super function,
But there may be situations where it's not possible to
enter this function.
For example, in the get_tree_bdev_flags function,when
encountering the situation where "Can't mount, would change RO state,"
In such a case, isofs_fill_super will not have the opportunity
to be called,which means that opt->iocharset will not have the chance
to be freed,ultimately leading to a memory leak.
Let's move the memory freeing of opt->iocharset into
isofs_free_fc function.
Fixes: 1b17a46c92 ("isofs: convert isofs to use the new mount API")
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241106082841.51773-1-hao.ge@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6d59f2fbfb ]
The arguments got swapped by commit 986ae3c2a8 ("dlm: fix race between
final callback and remove") fixing this now.
Fixes: 986ae3c2a8 ("dlm: fix race between final callback and remove")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2f3d93e210 ]
When I enabled ext4 debug for fault injection testing, I encountered the
following warning:
EXT4-fs error (device sda): ext4_read_inode_bitmap:201: comm fsstress:
Cannot read inode bitmap - block_group = 8, inode_bitmap = 1051
WARNING: CPU: 0 PID: 511 at fs/buffer.c:1181 mark_buffer_dirty+0x1b3/0x1d0
The root cause of the issue lies in the improper implementation of ext4's
buffer_head read fault injection. The actual completion of buffer_head
read and the buffer_head fault injection are not atomic, which can lead
to the uptodate flag being cleared on normally used buffer_heads in race
conditions.
[CPU0] [CPU1] [CPU2]
ext4_read_inode_bitmap
ext4_read_bh()
<bh read complete>
ext4_read_inode_bitmap
if (buffer_uptodate(bh))
return bh
jbd2_journal_commit_transaction
__jbd2_journal_refile_buffer
__jbd2_journal_unfile_buffer
__jbd2_journal_temp_unlink_buffer
ext4_simulate_fail_bh()
clear_buffer_uptodate
mark_buffer_dirty
<report warning>
WARN_ON_ONCE(!buffer_uptodate(bh))
The best approach would be to perform fault injection in the IO completion
callback function, rather than after IO completion. However, the IO
completion callback function cannot get the fault injection code in sb.
Fix it by passing the result of fault injection into the bh read function,
we simulate faults within the bh read function itself. This requires adding
an extra parameter to the bh read functions that need fault injection.
Fixes: 46f870d690 ("ext4: simulate various I/O and checksum errors when reading metadata")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Link: https://patch.msgid.link/20240906091746.510163-1-leo.lilong@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 22f9400a6f ]
In fscache_create_volume(), there is a missing memory barrier between the
bit-clearing operation and the wake-up operation. This may cause a
situation where, after a wake-up, the bit-clearing operation hasn't been
detected yet, leading to an indefinite wait. The triggering process is as
follows:
[cookie1] [cookie2] [volume_work]
fscache_perform_lookup
fscache_create_volume
fscache_perform_lookup
fscache_create_volume
fscache_create_volume_work
cachefiles_acquire_volume
clear_and_wake_up_bit
test_and_set_bit
test_and_set_bit
goto maybe_wait
goto no_wait
In the above process, cookie1 and cookie2 has the same volume. When cookie1
enters the -no_wait- process, it will clear the bit and wake up the waiting
process. If a barrier is missing, it may cause cookie2 to remain in the
-wait- process indefinitely.
In commit 3288666c72 ("fscache: Use clear_and_wake_up_bit() in
fscache_create_volume_work()"), barriers were added to similar operations
in fscache_create_volume_work(), but fscache_create_volume() was missed.
By combining the clear and wake operations into clear_and_wake_up_bit() to
fix this issue.
Fixes: bfa22da3ed ("fscache: Provide and use cache methods to lookup/create/free a volume")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-6-wozizhi@huawei.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 31ad74b202 ]
At present, the object->file has the NULL pointer dereference problem in
ondemand-mode. The root cause is that the allocated fd and object->file
lifetime are inconsistent, and the user-space invocation to anon_fd uses
object->file. Following is the process that triggers the issue:
[write fd] [umount]
cachefiles_ondemand_fd_write_iter
fscache_cookie_state_machine
cachefiles_withdraw_cookie
if (!file) return -ENOBUFS
cachefiles_clean_up_object
cachefiles_unmark_inode_in_use
fput(object->file)
object->file = NULL
// file NULL pointer dereference!
__cachefiles_write(..., file, ...)
Fix this issue by add an additional reference count to the object->file
before write/llseek, and decrement after it finished.
Fixes: c838305450 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-5-wozizhi@huawei.com
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 56f4856b42 ]
In the erofs on-demand loading scenario, read and write operations are
usually delivered through "off" and "len" contained in read req in user
mode. Naturally, pwrite is used to specify a specific offset to complete
write operations.
However, if the write(not pwrite) syscall is called multiple times in the
read-ahead scenario, we need to manually update ki_pos after each write
operation to update file->f_pos.
This step is currently missing from the cachefiles_ondemand_fd_write_iter
function, added to address this issue.
Fixes: c838305450 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-3-wozizhi@huawei.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 10c35abd35 ]
cachefiles_ondemand_fd_write_iter() function first aligns "pos" and "len"
to block boundaries. When calling __cachefiles_write(), the aligned "pos"
is passed in, but "len" is the original unaligned value(iter->count).
Additionally, the returned length of the write operation is the modified
"len" aligned by block size, which is unreasonable.
The alignment of "pos" and "len" is intended only to check whether the
cache has enough space. But the modified len should not be used as the
return value of cachefiles_ondemand_fd_write_iter() because the length we
passed to __cachefiles_write() is the previous "len". Doing so would result
in a mismatch in the data written on-demand. For example, if the length of
the user state passed in is not aligned to the block size (the preread
scene/DIO writes only need 512 alignment/Fault injection), the length of
the write will differ from the actual length of the return.
To solve this issue, since the __cachefiles_prepare_write() modifies the
size of "len", we pass "aligned_len" to __cachefiles_prepare_write() to
calculate the free blocks and use the original "len" as the return value of
cachefiles_ondemand_fd_write_iter().
Fixes: c838305450 ("cachefiles: notify the user daemon when looking up cookie")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Link: https://lore.kernel.org/r/20241107110649.3980193-2-wozizhi@huawei.com
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 76486b1041 ]
When we remount filesystem with 'abort' mount option while changing
other mount options as well (as is LTP test doing), we can return error
from the system call after commit d3476f3dad ("ext4: don't set
SB_RDONLY after filesystem errors") because the application of mount
option changes detects shutdown filesystem and refuses to do anything.
The behavior of application of other mount options in presence of
'abort' mount option is currently rather arbitary as some mount option
changes are handled before 'abort' and some after it.
Move aborting of the filesystem to the end of remount handling so all
requested changes are properly applied before the filesystem is shutdown
to have a reasonably consistent behavior.
Fixes: d3476f3dad ("ext4: don't set SB_RDONLY after filesystem errors")
Reported-by: Jan Stancek <jstancek@redhat.com>
Link: https://lore.kernel.org/all/Zvp6L+oFnfASaoHl@t14s
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Jan Stancek <jstancek@redhat.com>
Link: https://patch.msgid.link/20241004221556.19222-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c3be7ebbbc ]
Currently FMODE_CAN_ATOMIC_WRITE is set if the bdev can atomic write and
the file is open for direct IO. This does not work if the file is not
opened for direct IO, yet fcntl(O_DIRECT) is used on the fd later.
Change to check for direct IO on a per-IO basis in
generic_atomic_write_valid(). Since we want to report -EOPNOTSUPP for
non-direct IO for an atomic write, change to return an error code.
Relocate the block fops atomic write checks to the common write path, as to
catch non-direct IO.
Fixes: c34fc6f26a ("fs: Initial atomic write support")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241019125113.369994-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 51ceeb1a81 ]
Commit 39a6c668e4 (efs: convert efs to use the new mount api)
did not include anything from v2 and v3 that were also submitted.
Fix this by bringing in those changes that were proposed in v2 and
v3.
Fixes: 39a6c668e4 efs: convert efs to use the new mount api.
Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Link: https://lore.kernel.org/r/20241014190241.4093825-1-bodonnel@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzkr6AAKCRDdBJ7gKXxA
jsb2AP9HCOI4w9rQTmBdnaefXytS7fiiPq+LVNpjJ0NGXX2FSgD/e1NM0wi8KevQ
npcvlqTcXtRSJvYNF904aTNyDn+Kuw0=
=KFGY
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"10 hotfixes, 7 of which are cc:stable. All singletons, please see the
changelogs for details"
* tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: revert "mm: shmem: fix data-race in shmem_getattr()"
ocfs2: uncache inode which has failed entering the group
mm: fix NULL pointer dereference in alloc_pages_bulk_noprof
mm, doc: update read_ahead_kb for MADV_HUGEPAGE
fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args()
sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers
crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32
mm/mremap: fix address wraparound in move_page_tables()
tools/mm: fix compile error
mm, swap: fix allocation and scanning race with swapoff
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc2aGIACgkQxWXV+ddt
WDu0AQ/9FLfC/e3X2GjZ0auna/7A/rF8MPoATUdAyHn75Md6Hc8PXpi1YvMph+ba
pvoufOqrU66/g0UNeacgsp6rF4rKJHg0q9Id+7wueLnDr41g9paXjsLYItq4j26w
GusDiZvUDwuDmb70vlTXrgAfnjooIdSwqJTlIzJxvl4wrNzOiUlUJtTMzmUrwn/9
Lf/iByWlGcPKKBc+1ZzFz4HlVOZZSt9YePeJw2/Aul2OMtuI3RTTAL/NtjaFIlYc
pb+NHVqFrrfgC+xo68hLBmnsBfS41EGR58rYRjEuQo0+hARa8WbxL3DNA/E/Vi5X
dsq/wQVlD7IVIWCoF9J94/iyDdwlDOGFMoL6FUrJwDtPGN/v/xxtA6ruvuC7k5zy
bHCR8ZVrJWVaxE7u0Gtl+hFPpDTwNTR7SfvK69gxPfci1cN0m2wCNK02SEUJwV09
N82N2ENGGwyWS+nOl/ERB+7A0QxViMr3JpUrPzSYqsmn8bwDvovSjK2fFouJoSey
bpAzbFWj+OS0O9nnRqabTJDM/Tk9O0s0Ye76aUS+Vfk9d5EuVfAg6pHiOBcFDhsK
UEG9QbPltfh6LPDHCdV93HOOsC0uNxCTCSpbQ9LFGKBICQsPIX/vZeV45fNFJDLX
j5kEtHFVU3snU+jA97nvYXPRANDnnNx/EzXv7zo0Ye8L+plecBs=
=ssYj
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix that seems urgent and good to have in 6.12 final.
It could potentially lead to unexpected transaction aborts, due to
wrong comparison and order of processing of delayed refs"
* tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix incorrect comparison for delayed refs
Syzbot has reported the following BUG:
kernel BUG at fs/ocfs2/uptodate.c:509!
...
Call Trace:
<TASK>
? __die_body+0x5f/0xb0
? die+0x9e/0xc0
? do_trap+0x15a/0x3a0
? ocfs2_set_new_buffer_uptodate+0x145/0x160
? do_error_trap+0x1dc/0x2c0
? ocfs2_set_new_buffer_uptodate+0x145/0x160
? __pfx_do_error_trap+0x10/0x10
? handle_invalid_op+0x34/0x40
? ocfs2_set_new_buffer_uptodate+0x145/0x160
? exc_invalid_op+0x38/0x50
? asm_exc_invalid_op+0x1a/0x20
? ocfs2_set_new_buffer_uptodate+0x2e/0x160
? ocfs2_set_new_buffer_uptodate+0x144/0x160
? ocfs2_set_new_buffer_uptodate+0x145/0x160
ocfs2_group_add+0x39f/0x15a0
? __pfx_ocfs2_group_add+0x10/0x10
? __pfx_lock_acquire+0x10/0x10
? mnt_get_write_access+0x68/0x2b0
? __pfx_lock_release+0x10/0x10
? rcu_read_lock_any_held+0xb7/0x160
? __pfx_rcu_read_lock_any_held+0x10/0x10
? smack_log+0x123/0x540
? mnt_get_write_access+0x68/0x2b0
? mnt_get_write_access+0x68/0x2b0
? mnt_get_write_access+0x226/0x2b0
ocfs2_ioctl+0x65e/0x7d0
? __pfx_ocfs2_ioctl+0x10/0x10
? smack_file_ioctl+0x29e/0x3a0
? __pfx_smack_file_ioctl+0x10/0x10
? lockdep_hardirqs_on_prepare+0x43d/0x780
? __pfx_lockdep_hardirqs_on_prepare+0x10/0x10
? __pfx_ocfs2_ioctl+0x10/0x10
__se_sys_ioctl+0xfb/0x170
do_syscall_64+0xf3/0x230
entry_SYSCALL_64_after_hwframe+0x77/0x7f
...
</TASK>
When 'ioctl(OCFS2_IOC_GROUP_ADD, ...)' has failed for the particular
inode in 'ocfs2_verify_group_and_input()', corresponding buffer head
remains cached and subsequent call to the same 'ioctl()' for the same
inode issues the BUG() in 'ocfs2_set_new_buffer_uptodate()' (trying
to cache the same buffer head of that inode). Fix this by uncaching
the buffer head with 'ocfs2_remove_from_cache()' on error path in
'ocfs2_group_add()'.
Link: https://lkml.kernel.org/r/20241114043844.111847-1-dmantipov@yandex.ru
Fixes: 7909f2bf83 ("[PATCH 2/2] ocfs2: Implement group add for online resize")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reported-by: syzbot+453873f1588c2d75b447@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=453873f1588c2d75b447
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Dmitry Antipov <dmantipov@yandex.ru>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The "arg->vec_len" variable is a u64 that comes from the user at the start
of the function. The "arg->vec_len * sizeof(struct page_region))"
multiplication can lead to integer wrapping. Use size_mul() to avoid
that.
Also the size_add/mul() functions work on unsigned long so for 32bit
systems we need to ensure that "arg->vec_len" fits in an unsigned long.
Link: https://lkml.kernel.org/r/39d41335-dd4d-48ed-8a7f-402c57d8ea84@stanley.mountain
Fixes: 52526ca7fd ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Andrei Vagin <avagin@google.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Assorted tiny syzbot fixes
- Shutdown path fix: "bch2_btree_write_buffer_flush_going_ro()"
The shutdown path wasn't flushing the btree write buffer, leading to
shutting down while we still had operations in flight. This fixes a
whole slew of syzbot bugs, and undoubtedly other strange heisenbugs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmc1ceQACgkQE6szbY3K
bna56w//V4On/z8q8r4vyj44NXfFNXL87QNB7e5BAP8lmow8S1KnqpHlpUq+MaJm
/LI1OPPaDJITokrQ4eUKyfRLQaH9PlDk7WafgzixnYmaXKVIraqGzWfISSbNhLdC
Hw/EjYdqmuQMAxTgj1upyx2UijvBRaT0dvjeaTX2OgQiqIlBUWGdUDFHWzc1sRgO
8PrJf9CYscdS+P+XrDBKLLCiKGn73Z93lSsFTwVzDictfrqUT8X5J5/a+ofCwp8W
sEiL/13bp05ABb3YjVrBupApyY6szC6YrKNwJ8RoHxBGo0zl5L/gd9FnBJX0T1wc
10dDMv4gZWf9mF2KePDH5EYBBV62N8qEwRAHrjkGu/87mNBRf6sUZ3fbTiMMmVds
nSTJUHH6t6uiwtezPPzDKtPYQBcGsP95DBdvL4un97jTlUtszHLU8OqVErojTYzZ
sZLilUUMz+CK/qfoJjUcSJwslevHeiri5Hto4jOvf8M7uH0Xx29LvmwqWLdZP2uT
RNjslFLvqzZoWGsbJEfe2YihJwJd9kkiE/0xQK60169bzoNEI6oWNmOK2Ts1Dr0X
tEkdHmWqNI65+nbjomGOa3u1wysrACuaRwu6fq0562+IvlDtVXLUWbmlS58+bqz2
KwUmZp4Y2OibeKFT+y0iqxpw9WBPQb7dwj9xwpwk3xgph1uVwbY=
=L6kv
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"This fixes one minor regression from the btree cache fixes (in the
scan_for_btree_nodes repair path) - and the shutdown path fix is the
big one here, in terms of bugs closed:
- Assorted tiny syzbot fixes
- Shutdown path fix: "bch2_btree_write_buffer_flush_going_ro()"
The shutdown path wasn't flushing the btree write buffer, leading
to shutting down while we still had operations in flight. This
fixes a whole slew of syzbot bugs, and undoubtedly other strange
heisenbugs.
* tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix assertion pop in bch2_ptr_swab()
bcachefs: Fix journal_entry_dev_usage_to_text() overrun
bcachefs: Allow for unknown key types in backpointers fsck
bcachefs: Fix assertion pop in topology repair
bcachefs: Fix hidden btree errors when reading roots
bcachefs: Fix validate_bset() repair path
bcachefs: Fix missing validation for bch_backpointer.level
bcachefs: Fix bch_member.btree_bitmap_shift validation
bcachefs: bch2_btree_write_buffer_flush_going_ro()
When I reworked delayed ref comparison in cf4f04325b ("btrfs: move
->parent and ->ref_root into btrfs_delayed_ref_node"), I made a mistake
and returned -1 for the case where ref1->ref_root was > than
ref2->ref_root. This is a subtle bug that can result in improper
delayed ref running order, which can result in transaction aborts.
Fixes: cf4f04325b ("btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
singletons.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzP1ZAAKCRDdBJ7gKXxA
jmBUAP9n2zTKoNeF/WpS0aSg+SpG78mtyMIwSUW2PPfGObYTBwD/bncG9U3fnno1
v6Sey0OjAKwGdV+gTd+5ymWJKPSQbgA=
=HxTA
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All
singletons"
* tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: swapfile: fix cluster reclaim work crash on rotational devices
selftests: hugetlb_dio: fixup check for initial conditions to skip in the start
mm/thp: fix deferred split queue not partially_mapped: fix
mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
nommu: pass NULL argument to vma_iter_prealloc()
ocfs2: fix UBSAN warning in ocfs2_verify_volume()
nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
mm: count zeromap read and set for swapout and swapin
This runs on extents that haven't yet been validated, so we don't want
to assert that we have a valid entry type.
Reported-by: syzbot+4f29c3f12f864d8a8d17@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the jset_entry_dev_usage is malformed, and too small, our nr_entries
calculation will be incorrect - just bail out.
Reported-by: syzbot+05d7520be047c9be86e0@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty()
may cause a NULL pointer dereference, or a general protection fault when
KASAN is enabled.
This happens because, since the tracepoint was added in
mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev
regardless of whether the buffer head has a pointer to a block_device
structure.
In the current implementation, nilfs_grab_buffer(), which grabs a buffer
to read (or create) a block of metadata, including b-tree node blocks,
does not set the block device, but instead does so only if the buffer is
not in the "uptodate" state for each of its caller block reading
functions. However, if the uptodate flag is set on a folio/page, and the
buffer heads are detached from it by try_to_free_buffers(), and new buffer
heads are then attached by create_empty_buffers(), the uptodate flag may
be restored to each buffer without the block device being set to
bh->b_bdev, and mark_buffer_dirty() may be called later in that state,
resulting in the bug mentioned above.
Fix this issue by making nilfs_grab_buffer() always set the block device
of the super block structure to the buffer head, regardless of the state
of the buffer's uptodate flag.
Link: https://lkml.kernel.org/r/20241106160811.3316-3-konishi.ryusuke@gmail.com
Fixes: 5305cb8308 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ubisectech Sirius <bugreport@valiantsec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints".
This series fixes null pointer dereference bugs that occur when using
nilfs2 and two block-related tracepoints.
This patch (of 2):
It has been reported that when using "block:block_touch_buffer"
tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a
NULL pointer dereference, or a general protection fault when KASAN is
enabled.
This happens because since the tracepoint was added in touch_buffer(), it
references the dev_t member bh->b_bdev->bd_dev regardless of whether the
buffer head has a pointer to a block_device structure. In the current
implementation, the block_device structure is set after the function
returns to the caller.
Here, touch_buffer() is used to mark the folio/page that owns the buffer
head as accessed, but the common search helper for folio/page used by the
caller function was optimized to mark the folio/page as accessed when it
was reimplemented a long time ago, eliminating the need to call
touch_buffer() here in the first place.
So this solves the issue by eliminating the touch_buffer() call itself.
Link: https://lkml.kernel.org/r/20241106160811.3316-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20241106160811.3316-2-konishi.ryusuke@gmail.com
Fixes: 5305cb8308 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: Ubisectech Sirius <bugreport@valiantsec.com>
Closes: https://lkml.kernel.org/r/86bd3013-887e-4e38-960f-ca45c657f032.bugreport@valiantsec.com
Reported-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9982fb8d18eba905abe2
Tested-by: syzbot+9982fb8d18eba905abe2@syzkaller.appspotmail.com
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We can't assume that btrees only contain keys of a given type - even if
they only have a single key type listed in the allowed key types for
that btree; this is a forwards compatibility issue.
Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Three affect DAMON. Lorenzo's five-patch series to address the
mmap_region error handling is here also.
Apart from that, various singletons.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzBVmAAKCRDdBJ7gKXxA
ju42AQD0EEnzW+zFyI+E7x5FwCmLL6ofmzM8Sw9YrKjaeShdZgEAhcyS2Rc/AaJq
Uty2ZvVMDF2a9p9gqHfKKARBXEbN2w0=
=n+lO
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes, 14 of which are cc:stable.
Three affect DAMON. Lorenzo's five-patch series to address the
mmap_region error handling is here also.
Apart from that, various singletons"
* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add entry for Thorsten Blum
ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
signal: restore the override_rlimit logic
fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
ucounts: fix counter leak in inc_rlimit_get_ucounts()
selftests: hugetlb_dio: check for initial conditions to skip in the start
mm: fix docs for the kernel parameter ``thp_anon=``
mm/damon/core: avoid overflow in damon_feed_loop_next_input()
mm/damon/core: handle zero schemes apply interval
mm/damon/core: handle zero {aggregation,ops_update} intervals
mm/mlock: set the correct prev on failure
objpool: fix to make percpu slot allocation more robust
mm/page_alloc: keep track of free highatomic
mm: resolve faulty mmap_region() error path behaviour
mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
mm: refactor map_deny_write_exec()
mm: unconditionally close VMAs on error
mm: avoid unsafe VMA hook invocation when error arises on mmap hook
mm/thp: fix deferred split unqueue naming and locking
mm/thp: fix deferred split queue not partially_mapped
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcuWXMACgkQiiy9cAdy
T1Eu8gv+LUAmrvvv8PDoLUT50QZb6aAY2SeulgTdeG8OzImXH5VUSjptRYwP46Dk
KNLh85A4C39w/guxm3FX2qjeesZZD5DDubSJNATLy75jorq7z+1uTNg8oUZGpvJS
airmcv/0mcDZqVayCmiT7wPyhUSYa+VTvHrkFpsI20BrlyDybe5HGps77iCOJ5K0
uTRgM6VNxkKx+Z5NietpDyaUl2A5b6Yx/9J8vMq4ytBfEcSGi+ndpZNvG7kKg8gQ
3i/ND4O2+eScwvYclVP5mJbF71LW0Z/ljS4mEVH5UuRgLH2Ji35B9xaDFDSixI3x
EHFwnAX0QeGHIlIuFhRDdtR2gFqREAJOYxkDxfo7PXO5gOXLWZXru9F7v6lWsydN
varqSseBBucHOLn8NylvgJWwqYs+sIKQycYKsX3ZUnQfejaUwfV2H/ADJzccjFF8
PUzVQFyOZtUK3fdkoqvULr/zvwninhtLJYLIsPcUgSPCcxGxMApvtkCaJVV3JGfB
2acZPdMu
=ZzcZ
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
"Fix net namespace refcount use after free issue"
* tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Fix use-after-free of network namespace.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcsXSUACgkQiiy9cAdy
T1FyOgv+Ks1lfl+6D/G89zFl5XOtCm8njsedJu9y3jR7hzophX2osfmodACMVX6B
0VLu0jzquvUo18VNlL+wF7YFH+Mc6zrevEnjBay9Xa05YyRqK5c7qjpiWEgXPN7/
ROQfC2slCAFjymhw+9qY+PGZYg3x0fyGdJC/gBNSFnu2ufag367Li+0fTKQTXFwz
F24S5eI+M9OWNgMnMYoNt+77f0n0JkKbQznq9nTEvUsbTWZFSEfmVczfSY0ltdOH
RER9zoyTU3zbPuMZqK+Jb7c2247ahsLzDEBAUG0Wn77wSaiWXU5dmVD5bWsDTp25
5p9uLpkr3irDWwJGkCrkpm2Tva/50IHPEFQ4kllVlm6ffoao/dxBCwFf/MEvJXzI
OgU+HpXyZdq6NF1hcB4xUlcbHvGCa6pEcYkcM7PwLml+6SKIwEsEGpnJ23kxGR3+
MGYMCITatRuvZstfEDolNyrO2+gPMd3ODnLhfjfjT47Kh38e7yxrLr4cmxbPAA+s
EVdm2N08
=zTn6
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Four fixes, all also marked for stable:
- fix two potential use after free issues
- fix OOM issue with many simultaneous requests
- fix missing error check in RPC pipe handling"
* tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: check outstanding simultaneous SMB operations
ksmbd: fix slab-use-after-free in smb3_preauth_hash_rsp
ksmbd: fix slab-use-after-free in ksmbd_smb2_session_create
ksmbd: Fix the missing xa_store error check
We silence btree errors in btree_node_scan, since it's probing and
errors are expected: add a fake pass so that btree_node_scan is no
longer recovery pass 0, and we don't think we're in btree node scan when
reading btree roots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we truncate a bset (due to it extending past the end of the btree
node), we can't skip the rest of the validation for e.g. the packed
format (if it's the first bset in the node).
Reported-by: syzbot+4d722d3c539d77c7bc82@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmct+40ACgkQxWXV+ddt
WDvCtRAAp0rheEu14hpVvWE2//+6u9Gx7Wfjzbj0+o4zBRWdg7BigFxfeb6JsH/E
2TjuWdcoP/OMV9ghCBQAQxySAPtsxH7skkyNy2UcMk5byBIrNvhw9auP5GXXlrhK
jSKDD4yfOMb++8LhrLevgTrijNyjLqaKXruw9a1Pmc3gxpdNmnMEySsQaF62o2Sm
YC3jwi0KpNAhu2qyJ6TnPgd5zf3BTM0JAeuB019IZW4WoeRTOdcPe7S7gqqJwZ+e
lL0D2/lfIE1lKvLE266Fab4FAQiJV07rozYj25XHiDpqThCxnJVOZCEHasOQ1PRy
d6j3RmGPqJYAYfQL1L+FH2hsS1BVZfVyCV1V7A/cN+lAffBfnROnf13C3gJ15Nbx
3lTyjBPQQw2WpfdmeyF3ikbrjZ8AfahChQO+mMnLN7oAWdIwWX5MRB+cwfWTxzA/
P8upz6HSTpSwy8nXdq264q1KkyCjx0Wv+8iyU7LirN2fCcEchA12HAIaOBeHedgh
PrGZDqrkZccQQxAvU5H7hQv0hZkGK8qba381oYHO09g72VM6ysuBU7tGrPZrlZYB
CvYTCwNZ/lqI8ikrcHOyUO1SPR9SaaWej1mWgBJ69ZIfg+ZuMtOMl171DU4S/i2V
iYgYoN8eCqTQWdaX5kk+3LWmK8fSU7F/KSDtJtT1KxkaSwCacfY=
=TQzP
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more one-liners that fix some user visible problems:
- use correct range when clearing qgroup reservations after COW
- properly reset freed delayed ref list head
- fix ro/rw subvolume mounts to be backward compatible with old and
new mount API"
* tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix the length of reserved qgroup to free
btrfs: reinitialize delayed ref list after deleting it from the list
btrfs: fix per-subvolume RO/RW flags with new mount API
Some trivial syzbot fixes, two more serious btree fixes found by looping
single_devices.ktest small_nodes:
- Topology error on split after merge, where we accidentaly picked the
node being deleted for the pivot, resulting in an assertion pop
- New nodes being preallocated were left on the freedlist, unlocked,
resulting in them sometimes being accidentally freed: this dated from
pre-cycle detector, when we could leave them locked. This should have
resulted in more explosions and fireworks, but turned out to be
surprisingly hard to hit because the preallocated nodes were being
used right away.
the fix for this is bigger than we'd like - reworking btree list
handling was a bit invasive - but we've now got more assertions and
it's well tested.
- Also another mishandled transaction restart fix (in
btree_node_prefetch) - we're almost done with those.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmctRYoACgkQE6szbY3K
bnbkxRAArtqV9/qsKbSYAaa/+GaL7YdapuYbi/pmC9X96F9qbTdEJzW5rs66iiGE
zbkfFqo2I85nacSTk3b12E3QUXj+CEmSIWOPQtamYw/0AkmVsKepgGsXLazZ0rYi
X8UDVc6fuFkoO1aC/9V2NJEFG9QXIj8ru0m2kyUE9ZM6rgskugVN/ec9ipNQNZhY
4L8U7Z6Y9AX4vs/BeV3i6cLrTaMroUFYSM0hJalBJ24KZsZ1bWflC39C0dXSvy/O
gCmBCobZTT5aDEQai1kdyFr4GZZUgCJg4YEUDfyOdpPmhbcP4iwX/cJqJHXqxXVt
nMyLz5nLs0nYO791UlLHZuUUUe99nl+tC09b034n20peLnQwWW/obTrhn86SDDka
2eQv1Rk5C5i8r5b0k8UYjy5ogfiVlC/X1OwmLKkarKnC/wd0eFQI71Qq9s8KpXbo
VVASENYFV3hrIV8ZcxiqiJ18g6o7++jtTAmIfRljQrO6B8tU5g5uWCTZli+wciii
qWnt1k7P92er8lBzUnQGh9CEwLVbe9ZyBJv+fYVwTOxPES/TbJS7n5fb+1f1rF9w
j5llXVUiaLucXoCpBjEDflvhBTRQHEkKk3gJgy86NKgRjEjPhQT8D2dksT4kgyHb
RqgOSUN+oVqi/i+7RKf9x/jG4id0uvMH5xT7qiXTUiQXtUD+J9g=
=cn3u
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Some trivial syzbot fixes, two more serious btree fixes found by
looping single_devices.ktest small_nodes:
- Topology error on split after merge, where we accidentaly picked
the node being deleted for the pivot, resulting in an assertion pop
- New nodes being preallocated were left on the freedlist, unlocked,
resulting in them sometimes being accidentally freed: this dated
from pre-cycle detector, when we could leave them locked. This
should have resulted in more explosions and fireworks, but turned
out to be surprisingly hard to hit because the preallocated nodes
were being used right away.
The fix for this is bigger than we'd like - reworking btree list
handling was a bit invasive - but we've now got more assertions and
it's well tested.
- Also another mishandled transaction restart fix (in
btree_node_prefetch) - we're almost done with those"
* tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix UAF in __promote_alloc() error path
bcachefs: Change OPT_STR max to be 1 less than the size of choices array
bcachefs: btree_cache.freeable list fixes
bcachefs: check the invalid parameter for perf test
bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket
bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery
bcachefs: Fix topology errors on split after merge
bcachefs: Ancient versions with bad bkey_formats are no longer supported
bcachefs: Fix error handling in bch2_btree_node_prefetch()
bcachefs: Fix null ptr deref in bucket_gen_get()
This fixes an assertion pop where we try to navigate to the target of
the backpointer, and the path level isn't what we expect.
Reported-by: syzbot+b17df21b4d370f2dc330@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The write buffer needs to be specifically flushed when going RO: keys in
the journal that haven't yet been moved to the write buffer don't have a
journal pin yet.
This fixes numerous syzbot bugs, all with symptoms of still doing writes
after we've got RO.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When build with !CONFIG_MMU, the variable 'vmcore_mmap_ops'
is defined but not used:
>> fs/proc/vmcore.c:458:42: warning: unused variable 'vmcore_mmap_ops'
458 | static const struct vm_operations_struct vmcore_mmap_ops = {
Fix this by only defining it when CONFIG_MMU is enabled.
Link: https://lkml.kernel.org/r/20241101034803.9298-1-xiqi2@huawei.com
Fixes: 9cb218131d ("vmcore: introduce remap_oldmem_pfn_range()")
Signed-off-by: Qi Xi <xiqi2@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/lkml/202410301936.GcE8yUos-lkp@intel.com/
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wang ShaoBo <bobo.shaobowang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If we error in data_update_init() after adding to the rhashtable of
outstanding promotes, kfree_rcu() is required.
Reported-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Change OPT_STR max value to be 1 less than the "ARRAY_SIZE" of "_choices"
array. As a result, remove -1 from (opt->max-1) in bch2_opt_to_text.
The "_choices" array is a null-terminated array, so computing the maximum
using "ARRAY_SIZE" without subtracting 1 yields an incorrect result. Since
bch2_opt_validate don't subtract 1, as bch2_opt_to_text does, values
bigger than the actual maximum would pass through option validation.
Reported-by: syzbot+bee87a0c3291c06aa8c6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bee87a0c3291c06aa8c6
Fixes: 63c4b25453 ("bcachefs: Better superblock opt validation")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When allocating new btree nodes, we were leaving them on the freeable
list - unlocked - allowing them to be reclaimed: ouch.
Additionally, bch2_btree_node_free_never_used() ->
bch2_btree_node_hash_remove was putting it on the freelist, while
bch2_btree_node_free_never_used() was putting it back on the btree
update reserve list - ouch.
Originally, the code was written to always keep btree nodes on a list -
live or freeable - and this worked when new nodes were kept locked.
But now with the cycle detector, we can't keep nodes locked that aren't
tracked by the cycle detector; and this is fine as long as they're not
reachable.
We also have better and more robust leak detection now, with memory
allocation profiling, so the original justification no longer applies.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The perf_test does not check the number of iterations and threads
when it is zero. If nr_thread is 0, the perf test will keep
waiting for wakekup. If iteration is 0, it will cause exception
of division by zero. This can be reproduced by:
echo "rand_insert 0 1" > /sys/fs/bcachefs/${uuid}/perf_test
or
echo "rand_insert 1 0" > /sys/fs/bcachefs/${uuid}/perf_test
Fixes: 1c6fdbd8f2 ("bcachefs: Initial commit")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bio_kmalloc may return NULL, will cause NULL pointer dereference.
Add check NULL return for bio_kmalloc in journal_read_bucket.
Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Fixes: ac10a9611d ("bcachefs: Some fixes for building in userspace")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If BCH_FS_may_go_rw is not yet set, it indicates to the transaction
commit path that updates should be done via the list of journal replay
keys.
This must be set before multithreaded use commences.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a btree split picks a pivot that's being deleted by a btree node
merge, we're going to have problems.
Fix this by checking if the pivot is being deleted, the same as we check
for deletions in journal replay keys.
Found by single_devic.ktest small_nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Syzbot found an assertion pop, by generating an ancient filesystem
version with an invalid bkey_format (with fields that can overflow) as
well as packed keys that aren't representable unpacked.
This breaks key comparisons in all sorts of painful ways.
Filesystems have been automatically rewriting nodes with such invalid
formats for years; we can safely drop support for them.
Reported-by: syzbot+8a0109511de9d4b61217@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bucket_gen() checks if we're lookup up a valid bucket and returns NULL
otherwise, but bucket_gen_get() was failing to check; other callers were
correct.
Also do a bit of cleanup on callers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
seq_printf is costy, on a system with n CPUs, reading /proc/softirqs
would yield 10*n decimal values, and the extra cost parsing format string
grows linearly with number of cpus. Replace seq_printf with
seq_put_decimal_ull_width have significant performance improvement.
On an 8CPUs system, reading /proc/softirqs show ~40% performance
gain with this patch.
Signed-off-by: David Wang <00107082@163.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I noticed that recently, simple operations like "make" started
failing on NFSv3 mounts of ext4 exports. Network capture shows that
READDIRPLUS operated correctly but READDIR failed with
NFS3ERR_INVAL. The vfs_llseek() call returned EINVAL when it is
passed a non-zero starting directory cookie.
I bisected to commit c689bdd3bf ("nfsd: further centralize
protocol version checks.").
Turns out that nfsd3_proc_readdir() does not call fh_verify() before
it calls nfsd_readdir(), so the new fhp->fh_64bit_cookies boolean is
not set properly. This leaves the NFSD_MAY_64BIT_COOKIE unset when
the directory is opened.
For ext4, this causes the wrong "max file size" value to be used
when sanity checking the incoming directory cookie (which is a seek
offset value).
The fhp->fh_64bit_cookies boolean is /always/ properly initialized
after nfsd_open() returns. There doesn't seem to be a reason for the
generic NFSD open helper to handle the f_mode fix-up for
directories, so just move that to the one caller that tries to open
an S_IFDIR with NFSD_MAY_64BIT_COOKIE.
Suggested-by: NeilBrown <neilb@suse.de>
Fixes: c689bdd3bf ("nfsd: further centralize protocol version checks.")
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The dealloc flag may be cleared and the extent won't reach the disk in
cow_file_range when errors path. The reserved qgroup space is freed in
commit 30479f31d4 ("btrfs: fix qgroup reserve leaks in
cow_file_range"). However, the length of untouched region to free needs
to be adjusted with the correct remaining region size.
Fixes: 30479f31d4 ("btrfs: fix qgroup reserve leaks in cow_file_range")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At insert_delayed_ref() if we need to update the action of an existing
ref to BTRFS_DROP_DELAYED_REF, we delete the ref from its ref head's
ref_add_list using list_del(), which leaves the ref's add_list member
not reinitialized, as list_del() sets the next and prev members of the
list to LIST_POISON1 and LIST_POISON2, respectively.
If later we end up calling drop_delayed_ref() against the ref, which can
happen during merging or when destroying delayed refs due to a transaction
abort, we can trigger a crash since at drop_delayed_ref() we call
list_empty() against the ref's add_list, which returns false since
the list was not reinitialized after the list_del() and as a consequence
we call list_del() again at drop_delayed_ref(). This results in an
invalid list access since the next and prev members are set to poison
pointers, resulting in a splat if CONFIG_LIST_HARDENED and
CONFIG_DEBUG_LIST are set or invalid poison pointer dereferences
otherwise.
So fix this by deleting from the list with list_del_init() instead.
Fixes: 1d57ee9416 ("btrfs: improve delayed refs iterations")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With util-linux 2.40.2, the 'mount' utility is already utilizing the new
mount API. e.g:
# strace mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test/
...
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(3, FSMOUNT_CLOEXEC, 0) = 4
mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0
But this leads to a new problem, that per-subvolume RO/RW mount no
longer works, if the initial mount is RO:
# mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test
# mount -o rw,subvol=subv2 /dev/test/scratch1 /mnt/scratch
# mount | grep mnt
/dev/mapper/test-scratch1 on /mnt/test type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=256,subvol=/subv1)
/dev/mapper/test-scratch1 on /mnt/scratch type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=257,subvol=/subv2)
# touch /mnt/scratch/foobar
touch: cannot touch '/mnt/scratch/foobar': Read-only file system
This is a common use cases on distros.
[CAUSE]
We have a workaround for remount to handle the RO->RW change, but if the
mount is using the new mount API, we do not do that, and rely on the
mount tool NOT to set the ro flag.
But that's not how the mount tool is doing for the new API:
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 <<<< Setting RO flag for super block
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(3, FSMOUNT_CLOEXEC, 0) = 4
mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0
This means we will set the super block RO at the first mount.
Later RW mount will not try to reconfigure the fs to RW because the
mount tool is already using the new API.
This totally breaks the per-subvolume RO/RW mount behavior.
[FIX]
Do not skip the reconfiguration even if using the new API. The old
comments are just expecting any mount tool to properly skip the RO flag
set even if we specify "ro", which is not the reality.
Update the comments regarding the backward compatibility on the kernel
level so it works with old and new mount utilities.
CC: stable@vger.kernel.org # 6.8+
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable Fixes:
* Fix KMSAN warning in decode_getfattr_attrs()
Other Bugfixes:
* Handle -ENOTCONN in xs_tcp_setup_socked()
* NFSv3: only use NFS timeout for MOUNT when protocols are compatible
* Fix attribute delegation behavior on exclusive create and a/mtime changes
* Fix localio to cope with racing nfs_local_probe()
* Avoid i_lock contention in fs_clear_invalid_mapping()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmcr1HUACgkQ18tUv7Cl
QOtdYBAA0YohWDHflcHPbltJu0UyCyDDtowvpVacSDJwZwVEXnLQRTTqrdUnWVxx
Bc2Ae8tGsfcwo10yZ6LUIPjcyEqLQeYvKoKv2Awf0j7eubjRYZrQVypIKtmy8aC2
H5ETCyrbIubE06jX8EPO8LFxQ+T6nGD7kC8qJZL8z/aNVXGA2nRRCi7AzdE4o6Ht
0t6fC+W5vxJ4hQHYKb59nGvREMwpKSLg2U4wo1lyFvkDxEJ06DobGOKEtD333cI8
Mou/1UlSZ6RzgfwJNIPMMpCepIp2spaDeet0XVN+zqzxg55Jmk7LqpxP5pswTjLb
WsxErV9ZRXtwutCCf+IDoMCv/YS4g4ZG7CLKXQ4felKJVYIuiS4z0n659xRqLyyi
nW71vrRUdOBE3rCXUW6crZYwX/fHDvl6bsq9/h7cy2ZPnbGkVvXx+LIm0dJRenfb
MaxVM3CyrMnzL3UUk/caK/rVCOHrDD5q/dAtSNfizMWnqoX+gXby3ho6Zwn0Wj89
NiUZJIRI/s4V1WzMw4g+Daz7LUUwGblODTtphH2nnKRDfTiYXeT/r/waU6zUOVcS
7Jd285DF/tkQp2SJ3nvsM/ni7TD2UuG2BsKA3Urlht9i32lwyENeS3nNcx6aHo3i
blNpD+9mp3vZfWWZNVvLM/JldcIqEvd30+P6GWwS/Td8Zz4PYIM=
=9mwu
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"These are mostly fixes that came up during the nfs bakeathon the other
week.
Stable Fixes:
- Fix KMSAN warning in decode_getfattr_attrs()
Other Bugfixes:
- Handle -ENOTCONN in xs_tcp_setup_socked()
- NFSv3: only use NFS timeout for MOUNT when protocols are compatible
- Fix attribute delegation behavior on exclusive create and a/mtime
changes
- Fix localio to cope with racing nfs_local_probe()
- Avoid i_lock contention in fs_clear_invalid_mapping()"
* tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
nfs: avoid i_lock contention in nfs_clear_invalid_mapping
nfs_common: fix localio to cope with racing nfs_local_probe()
NFS: Further fixes to attribute delegation a/mtime changes
NFS: Fix attribute delegation behaviour on exclusive create
nfs: Fix KMSAN warning in decode_getfattr_attrs()
NFSv3: only use NFS timeout for MOUNT when protocols are compatible
sunrpc: handle -ENOTCONN in xs_tcp_setup_socket()
The commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
broke the gid setting when set by fstab or other mount utility.
It is ignored when it is set. Fix the code so that it recognises the
option again and will honor the settings on mount at boot up.
Update the internal documentation and create a selftest to make sure
it doesn't break again in the future.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZyuidRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsgQAQDuV0x4RLpCrrowDS/ITQw/eb/WjhR7
lhkXVROLN6RK6wD+JWmbaCP82q2S4A2Vx0Rjc72gUMmTzDb1HQflhQiLhwU=
=0dZF
-----END PGP SIGNATURE-----
Merge tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracefs fixes from Steven Rostedt:
"Fix tracefs mount options.
Commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
broke the gid setting when set by fstab or other mount utility. It is
ignored when it is set. Fix the code so that it recognises the option
again and will honor the settings on mount at boot up.
Update the internal documentation and create a selftest to make sure
it doesn't break again in the future"
* tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/selftests: Add tracefs mount options test
tracing: Document tracefs gid mount option
tracing: Fix tracefs mount options
If Client send simultaneous SMB operations to ksmbd, It exhausts too much
memory through the "ksmbd_work_cache”. It will cause OOM issue.
ksmbd has a credit mechanism but it can't handle this problem. This patch
add the check if it exceeds max credits to prevent this problem by assuming
that one smb request consumes at least one credit.
Cc: stable@vger.kernel.org # v5.15+
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd_user_session_put should be called under smb3_preauth_hash_rsp().
It will avoid freeing session before calling smb3_preauth_hash_rsp().
Cc: stable@vger.kernel.org # v5.15+
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
There is a race condition between ksmbd_smb2_session_create and
ksmbd_expire_session. This patch add missing sessions_table_lock
while adding/deleting session from global session table.
Cc: stable@vger.kernel.org # v5.15+
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Multi-threaded buffered reads to the same file exposed significant
inode spinlock contention in nfs_clear_invalid_mapping().
Eliminate this spinlock contention by checking flags without locking,
instead using smp_rmb and smp_load_acquire accordingly, but then take
spinlock and double-check these inode flags.
Also refactor nfs_set_cache_invalid() slightly to use
smp_store_release() to pair with nfs_clear_invalid_mapping()'s
smp_load_acquire().
While this fix is beneficial for all multi-threaded buffered reads
issued by an NFS client, this issue was identified in the context of
surprisingly low LOCALIO performance with 4K multi-threaded buffered
read IO. This fix dramatically speeds up LOCALIO performance:
before: read: IOPS=1583k, BW=6182MiB/s (6482MB/s)(121GiB/20002msec)
after: read: IOPS=3046k, BW=11.6GiB/s (12.5GB/s)(232GiB/20001msec)
Fixes: 17dfeb9113 ("NFS: Fix races in nfs_revalidate_mapping")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Fix the possibility of racing nfs_local_probe() resulting in:
list_add double add: new=ffff8b99707f9f58, prev=ffff8b99707f9f58, next=ffffffffc0f30000.
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:35!
Add nfs_uuid_init() to properly initialize all nfs_uuid_t members
(particularly its list_head).
Switch to returning bool from nfs_uuid_begin(), returns false if
nfs_uuid_t is already in-use (its list_head is on a list). Update
nfs_local_probe() to return early if the nfs_client's cl_uuid
(nfs_uuid_t) is in-use.
Also, switch nfs_uuid_begin() from using list_add_tail_rcu() to
list_add_tail() -- rculist was used in an earlier version of the
localio code that had a lockless nfs_uuid_lookup interface.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
When asked to set both an atime and an mtime to the current system time,
ensure that the setting is atomic by calling inode_update_timestamps()
only once with the appropriate flags.
Fixes: e12912d941 ("NFSv4: Add support for delegated atime and mtime attributes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
When the client does an exclusive create and the server decides to store
the verifier in the timestamps, a SETATTR is subsequently sent to fix up
those timestamps. When that is the case, suppress the exceptions for
attribute delegations in nfs4_bitmap_copy_adjust().
Fixes: 32215c1f89 ("NFSv4: Don't request atime/mtime/size if they are delegated to us")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
If a timeout is specified in the mount options, it currently applies to
both the NFS protocol and (with v3) the MOUNT protocol. This is
sensible when they both use the same underlying protocol, or those
protocols are compatible w.r.t timeouts as RDMA and TCP are.
However if, for example, NFS is using TCP and MOUNT is using UDP then
using the same timeout doesn't make much sense.
If you
mount -o vers=3,proto=tcp,mountproto=udp,timeo=600,retrans=5 \
server:/path /mountpoint
then the timeo=600 which was intended for the NFS/TCP request will
apply to the MOUNT/UDP requests with the result that there will only be
one request sent (because UDP has a maximum timeout of 60 seconds).
This is not what a reasonable person might expect.
This patch disables the sharing of timeout information in cases where
the underlying protocols are not compatible.
Fixes: c9301cb35b ("nfs: hornor timeo and retrans option when mounting NFSv3")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Recently, we got a customer report that CIFS triggers oops while
reconnecting to a server. [0]
The workload runs on Kubernetes, and some pods mount CIFS servers
in non-root network namespaces. The problem rarely happened, but
it was always while the pod was dying.
The root cause is wrong reference counting for network namespace.
CIFS uses kernel sockets, which do not hold refcnt of the netns that
the socket belongs to. That means CIFS must ensure the socket is
always freed before its netns; otherwise, use-after-free happens.
The repro steps are roughly:
1. mount CIFS in a non-root netns
2. drop packets from the netns
3. destroy the netns
4. unmount CIFS
We can reproduce the issue quickly with the script [1] below and see
the splat [2] if CONFIG_NET_NS_REFCNT_TRACKER is enabled.
When the socket is TCP, it is hard to guarantee the netns lifetime
without holding refcnt due to async timers.
Let's hold netns refcnt for each socket as done for SMC in commit
9744d2bf19 ("smc: Fix use-after-free in tcp_write_timer_handler().").
Note that we need to move put_net() from cifs_put_tcp_session() to
clean_demultiplex_info(); otherwise, __sock_create() still could touch a
freed netns while cifsd tries to reconnect from cifs_demultiplex_thread().
Also, maybe_get_net() cannot be put just before __sock_create() because
the code is not under RCU and there is a small chance that the same
address happened to be reallocated to another netns.
[0]:
CIFS: VFS: \\XXXXXXXXXXX has not responded in 15 seconds. Reconnecting...
CIFS: Serverclose failed 4 times, giving up
Unable to handle kernel paging request at virtual address 14de99e461f84a07
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
[14de99e461f84a07] address between user and kernel address ranges
Internal error: Oops: 0000000096000004 [#1] SMP
Modules linked in: cls_bpf sch_ingress nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver tcp_diag inet_diag veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel loop fuse configfs dmi_sysfs sha2_ce sha256_arm64 dm_mirror dm_region_hash dm_log dm_mod dax efivarfs
CPU: 5 PID: 2690970 Comm: cifsd Not tainted 6.1.103-109.184.amzn2023.aarch64 #1
Hardware name: Amazon EC2 r7g.4xlarge/, BIOS 1.0 11/1/2018
pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : fib_rules_lookup+0x44/0x238
lr : __fib_lookup+0x64/0xbc
sp : ffff8000265db790
x29: ffff8000265db790 x28: 0000000000000000 x27: 000000000000bd01
x26: 0000000000000000 x25: ffff000b4baf8000 x24: ffff00047b5e4580
x23: ffff8000265db7e0 x22: 0000000000000000 x21: ffff00047b5e4500
x20: ffff0010e3f694f8 x19: 14de99e461f849f7 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 3f92800abd010002
x11: 0000000000000001 x10: ffff0010e3f69420 x9 : ffff800008a6f294
x8 : 0000000000000000 x7 : 0000000000000006 x6 : 0000000000000000
x5 : 0000000000000001 x4 : ffff001924354280 x3 : ffff8000265db7e0
x2 : 0000000000000000 x1 : ffff0010e3f694f8 x0 : ffff00047b5e4500
Call trace:
fib_rules_lookup+0x44/0x238
__fib_lookup+0x64/0xbc
ip_route_output_key_hash_rcu+0x2c4/0x398
ip_route_output_key_hash+0x60/0x8c
tcp_v4_connect+0x290/0x488
__inet_stream_connect+0x108/0x3d0
inet_stream_connect+0x50/0x78
kernel_connect+0x6c/0xac
generic_ip_connect+0x10c/0x6c8 [cifs]
__reconnect_target_unlocked+0xa0/0x214 [cifs]
reconnect_dfs_server+0x144/0x460 [cifs]
cifs_reconnect+0x88/0x148 [cifs]
cifs_readv_from_socket+0x230/0x430 [cifs]
cifs_read_from_socket+0x74/0xa8 [cifs]
cifs_demultiplex_thread+0xf8/0x704 [cifs]
kthread+0xd0/0xd4
Code: aa0003f8 f8480f13 eb18027f 540006c0 (b9401264)
[1]:
CIFS_CRED="/root/cred.cifs"
CIFS_USER="Administrator"
CIFS_PASS="Password"
CIFS_IP="X.X.X.X"
CIFS_PATH="//${CIFS_IP}/Users/Administrator/Desktop/CIFS_TEST"
CIFS_MNT="/mnt/smb"
DEV="enp0s3"
cat <<EOF > ${CIFS_CRED}
username=${CIFS_USER}
password=${CIFS_PASS}
domain=EXAMPLE.COM
EOF
unshare -n bash -c "
mkdir -p ${CIFS_MNT}
ip netns attach root 1
ip link add eth0 type veth peer veth0 netns root
ip link set eth0 up
ip -n root link set veth0 up
ip addr add 192.168.0.2/24 dev eth0
ip -n root addr add 192.168.0.1/24 dev veth0
ip route add default via 192.168.0.1 dev eth0
ip netns exec root sysctl net.ipv4.ip_forward=1
ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE
mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1
touch ${CIFS_MNT}/a.txt
ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE
"
umount ${CIFS_MNT}
[2]:
ref_tracker: net notrefcnt@000000004bbc008d has 1/1 users at
sk_alloc (./include/net/net_namespace.h:339 net/core/sock.c:2227)
inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252)
__sock_create (net/socket.c:1576)
generic_ip_connect (fs/smb/client/connect.c:3075)
cifs_get_tcp_session.part.0 (fs/smb/client/connect.c:3160 fs/smb/client/connect.c:1798)
cifs_mount_get_session (fs/smb/client/trace.h:959 fs/smb/client/connect.c:3366)
dfs_mount_share (fs/smb/client/dfs.c:63 fs/smb/client/dfs.c:285)
cifs_mount (fs/smb/client/connect.c:3622)
cifs_smb3_do_mount (fs/smb/client/cifsfs.c:949)
smb3_get_tree (fs/smb/client/fs_context.c:784 fs/smb/client/fs_context.c:802 fs/smb/client/fs_context.c:794)
vfs_get_tree (fs/super.c:1800)
path_mount (fs/namespace.c:3508 fs/namespace.c:3834)
__x64_sys_mount (fs/namespace.c:3848 fs/namespace.c:4057 fs/namespace.c:4034 fs/namespace.c:4034)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
Fixes: 26abe14379 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The usual collection of singletons - please see the changelogs.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZyfGDAAKCRDdBJ7gKXxA
jr19AQD6bfDF/6L2Alq1QG26pgrgccEbKzDSzR6pBajwCbdrNQD/XPhiv3zRJfGf
lgt0Qkqwe/ApBhVYUnL8y1CePv3EDgA=
=W5W0
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"17 hotfixes. 9 are cc:stable. 13 are MM and 4 are non-MM.
The usual collection of singletons - please see the changelogs"
* tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()
mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
mm: shrinker: avoid memleak in alloc_shrinker_info
.mailmap: update e-mail address for Eugen Hristev
vmscan,migrate: fix page count imbalance on node stats when demoting pages
mailmap: update Jarkko's email addresses
mm: allow set/clear page_type again
nilfs2: fix potential deadlock with newly created symlinks
Squashfs: fix variable overflow in squashfs_readpage_block
kasan: remove vmalloc_percpu test
tools/mm: -Werror fixes in page-types/slabinfo
mm, swap: avoid over reclaim of full clusters
mm: fix PSWPIN counter for large folios swap-in
mm: avoid VM_BUG_ON when try to map an anon large folio to zero page.
mm/codetag: fix null pointer check logic for ref and tag
mm/gup: stop leaking pinned pages in low memory conditions
* fix a sysbot reported crash on filestreams
* Reduce cpu time spent searching for extents in
a very fragmented FS
* Check for delayed allocations before setting extsize
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZyIMDwAKCRBGdaER5Qtf
pllxAYCkk+mtDTD5xBfOVGZWO5MMFz8HqYcro5wrSCzgL8HDmW29kXTBYFviGn3R
3l/H6BEBgOk0EkI5qGOzijpzbsWyJeLzPzZtxQFPD8zFBdxSERCtbpqFDLLvLQWG
M+TLhUNkPQ==
=kKX4
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- fix a sysbot reported crash on filestreams
- Reduce cpu time spent searching for extents in a very fragmented FS
- Check for delayed allocations before setting extsize
* tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: streamline xfs_filestream_pick_ag
xfs: fix finding a last resort AG in xfs_filestream_pick_ag
xfs: Reduce unnecessary searches when searching for the best extents
xfs: Check for delayed allocations before setting extsize
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGVAAKCRCRxhvAZXjc
oltEAP9r8cWa3Tdv8DzMNWu/jezTUXoW/mX5Qe+c1L6faqj0WQD/dIVtBtG37Tfq
3Ci9F/GEWjKijtCQ5lwMGUq27jQJ1gk=
=/0iA
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull iomap fixes from Christian Brauner:
"Fixes for iomap to prevent data corruption bugs in the fallocate
unshare range implementation of fsdax and a small cleanup to turn
iomap_want_unshare_iter() into an inline function"
* tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
iomap: turn iomap_want_unshare_iter into an inline function
fsdax: dax_unshare_iter needs to copy entire blocks
fsdax: remove zeroing code from dax_unshare_iter
iomap: share iomap_unshare_iter predicate code with fsdax
xfs: don't allocate COW extents when unsharing a hole
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGAQAKCRCRxhvAZXjc
opd6AQCal4omyfS8FYe4VRRZ/0XHouagq99I0U0TAmKkvoKAsgD/XrdE+pSTEkPX
Pv4T9phh1cZRxcyKVu77UoYkuHJEDAg=
=Lu9R
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull filesystem fixes from Christian Brauner:
"VFS:
- Fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP=y is set
- Add a get_tree_bdev_flags() helper that allows to modify e.g.,
whether errors are logged into the filesystem context during
superblock creation. This is used by erofs to fix a userspace
regression where an error is currently logged when its used on a
regular file which is an new allowed mode in erofs.
netfs:
- Fix the sysfs debug path in the documentation.
- Fix iov_iter_get_pages*() for folio queues by skipping the page
extracation if we're at the end of a folio.
afs:
- Fix moving subdirectories to different parent directory.
autofs:
- Fix handling of AUTOFS_DEV_IOCTL_TIMEOUT_CMD ioctl in
validate_dev_ioctl(). The actual ioctl number, not the ioctl
command needs to be checked for autofs"
* tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP
autofs: fix thinko in validate_dev_ioctl()
iov_iter: Fix iov_iter_get_pages*() for folio_queue
afs: Fix missing subdir edit when renamed between parent dirs
doc: correcting the debug path for cachefiles
erofs: use get_tree_bdev_flags() to avoid misleading messages
fs/super.c: introduce get_tree_bdev_flags()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmck8eQACgkQxWXV+ddt
WDu05g/6AwrnvPkivC4iVOv4Wkzrpk4gm76smx91Y9B8tSDLI1pHaS27CvJz9iWl
vBKXPN3PQVQHwo6SPn+NjsFOSMkXlbBOVKpPU+MlZwH9Tuw66qcC+EnUCK2wEuAy
3TN7cUGIA4r/j+SkhgIz+Irlr5pjdb1KkPIMBEVGcVFqDIuvDaTEGBqTn2i/V5aa
dMn+gK+9rfngTOJ68t/pEFaX7SEWCvgMIcBpBB4/vs1gHm3ve2bcc1sBAdMxb1Se
SrxgZfq+Rc5tkMn540JaWGwkb0rLzwXlurK6ygTKDKCpH0IMX+pBvDkexh9Zj0ux
jejlRxiuDzTx3z2a7FjHDyp2sdZWMpq3sPsowpJ1Dsgi5EtSxTy4irmQuSAZY1Uj
/uo6YwV9aTGeiNDwZeKqKc/wOuAttaMZLr14s37pro9KxndFJ/XZBxeyB+euUCOw
B8AvAQVVIJAYQLyWINWruNKppqlgiO2RaN15RvvT2pX01d0TOx1KX1XFQku7YFxb
M/8ZNXzJ96XtkeyHL3euo3zj7N5jWtnCvPINugUG1ADQa+bc8aX336gld1neD6fs
QqIFIgzZG0l4N95viJilACrI6tW9zFnBqMyNFRhucKiX9aP9glOvhSfxfjcpDuQ/
i/LIyxVLwp8M3hPNvv8tC345+1C2ug9AD0OyhWjjIYPuiOxtTWs=
=alpB
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more stability fixes. There's one patch adding export of MIPS
cmpxchg helper, used in the error propagation fix.
- fix error propagation from split bios to the original btrfs bio
- fix merging of adjacent extents (normal operation, defragmentation)
- fix potential use after free after freeing btrfs device structures"
* tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix defrag not merging contiguous extents due to merged extent maps
btrfs: fix extent map merging not happening for adjacent extents
btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids()
btrfs: fix error propagation of split bios
MIPS: export __cmpxchg_small()
Various syzbot fixes, and the more notable ones:
- Fix for pointers in an extent overflowing the max (16) on a filesystem
with many devices: we were creating too many cached copies when moving
data around. Now, we only create at most one cached copy if there's a
promote target set.
Caching will be a bit broken for reflinked data until 6.13: I have
larger series queued up which significantly improves the plumbing for
data options down into the extent (bch_extent_rebalance) to fix this.
- Fix for deadlock on -ENOSPC on tiny filesystems
Allocation from the partial open_bucket list wasn't correctly
accounting partial open_buckets as free: this fixes the main cause of
tests timing out in the automated tests.
-----BEGIN PGP SIGNATURE-----
iQJOBAABCAA4FiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmckUrIaHGtlbnQub3Zl
cnN0cmVldEBsaW51eC5kZXYACgkQE6szbY3KbnaFYA//Qd9SD8v+ypavnaogWhqk
3bufCO8YJDV5DRQVuX/z36fia8zOKzWQGRAYvq0vF0mmgagwBE+AcBh6vvfDCxqZ
m1937IXcv/hHh2FFQau9gWEItTH9dGwyQjeDjB3xaTL5ZTGsAdA9558ygf8GAVOe
wD+W8Z8Qj09hAErnNS7y50t/PGbZDuG7AV2Dy2unp+fp6U0FVrZ3Z0bhFuhxcR7/
e3j49DoW4EZL7Gu1svn7nzehjWK4wx1wX7QhynPgSOVIhdj2Fc3XG76b3mBsuZF6
A/cBRKmSZsYL9MBK0vferqizqeuwlIJsvwpo/6zzukpyf8QOl+0IqPuAXFoz8vg3
vrdp9cdvzWvQNexTD2+7PYosCKoUswOvo0oIy8Iopkg4VGSreZib1sZeCPzw2FBK
AZcQaQSBLKojWpYsn9Dl2AlqEHHTvnopjr5wRXiimqKe/OcA3ugIvebUw2UE2ACp
/Z2ZQu615BtRYQM+dRIJJQ2CAy0F3EZxIXEXwc/yrH7kL2VBay8QCKp/k/9YYy4e
Nlxxw7alb/XGgT8GQgu24tho3yMKT621dLFOaAZ7x2HtLP8T56zL/L/wKWsocW/V
R8Kwqot6F1EVb3Q0LECUJottYQ+5I1Et7ZpVyOPxfqF1y7KOsuxKOmZFLO7i3Spc
fg0gOt/fyKrAF3zuSmWXne8=
=hzm/
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Various syzbot fixes, and the more notable ones:
- Fix for pointers in an extent overflowing the max (16) on a
filesystem with many devices: we were creating too many cached
copies when moving data around. Now, we only create at most one
cached copy if there's a promote target set.
Caching will be a bit broken for reflinked data until 6.13: I have
larger series queued up which significantly improves the plumbing
for data options down into the extent (bch_extent_rebalance) to fix
this.
- Fix for deadlock on -ENOSPC on tiny filesystems
Allocation from the partial open_bucket list wasn't correctly
accounting partial open_buckets as free: this fixes the main cause
of tests timing out in the automated tests"
* tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek
bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get()
bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets
bcachefs: Don't filter partial list buckets in open_buckets_to_text()
bcachefs: Don't keep tons of cached pointers around
bcachefs: init freespace inited bits to 0 in bch2_fs_initialize
bcachefs: Fix unhandled transaction restart in fallocate
bcachefs: Fix UAF in bch2_reconstruct_alloc()
bcachefs: fix null-ptr-deref in have_stripes()
bcachefs: fix shift oob in alloc_lru_idx_fragmentation
bcachefs: Fix invalid shift in validate_sb_layout()
Commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
converted tracefs to use the new mount APIs caused mount options
(e.g. gid=<gid>) to not take effect.
The tracefs superblock can be updated from multiple paths:
- on fs_initcall() to init_trace_printk_function_export()
- from a work queue to initialize eventfs
tracer_init_tracefs_work_func()
- fsconfig() syscall to mount or remount of tracefs
The tracefs superblock root inode gets created early on in
init_trace_printk_function_export().
With the new mount API, tracefs effectively uses get_tree_single() instead
of the old API mount_single().
Previously, mount_single() ensured that the options are always applied to
the superblock root inode:
(1) If the root inode didn't exist, call fill_super() to create it
and apply the options.
(2) If the root inode exists, call reconfigure_single() which
effectively calls tracefs_apply_options() to parse and apply
options to the subperblock's fs_info and inode and remount
eventfs (if necessary)
On the other hand, get_tree_single() effectively calls vfs_get_super()
which:
(3) If the root inode doesn't exists, calls fill_super() to create it
and apply the options.
(4) If the root inode already exists, updates the fs_context root
with the superblock's root inode.
(4) above is always the case for tracefs mounts, since the super block's
root inode will already be created by init_trace_printk_function_export().
This means that the mount options get ignored:
- Since it isn't applied to the superblock's root inode, it doesn't
get inherited by the children.
- Since eventfs is initialized from a separate work queue and
before call to mount with the options, and it doesn't get remounted
for mount.
Ensure that the mount options are applied to the super block and eventfs
is remounted to respect the mount options.
To understand this better, if fstab has the following:
tracefs /sys/kernel/tracing tracefs nosuid,nodev,noexec,gid=tracing 0 0
On boot up, permissions look like:
# ls -l /sys/kernel/tracing/trace
-rw-r----- 1 root root 0 Nov 1 08:37 /sys/kernel/tracing/trace
When it should look like:
# ls -l /sys/kernel/tracing/trace
-rw-r----- 1 root tracing 0 Nov 1 08:37 /sys/kernel/tracing/trace
Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com/
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ali Zahraee <ahzahraee@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 78ff640819 ("vfs: Convert tracefs to use the new mount API")
Link: https://lore.kernel.org/20241030171928.4168869-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When running defrag (manual defrag) against a file that has extents that
are contiguous and we already have the respective extent maps loaded and
merged, we end up not defragging the range covered by those contiguous
extents. This happens when we have an extent map that was the result of
merging multiple extent maps for contiguous extents and the length of the
merged extent map is greater than or equals to the defrag threshold
length.
The script below reproduces this scenario:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a 256K file with 4 extents of 64K each.
xfs_io -f -c "falloc 0 64K" \
-c "pwrite 0 64K" \
-c "falloc 64K 64K" \
-c "pwrite 64K 64K" \
-c "falloc 128K 64K" \
-c "pwrite 128K 64K" \
-c "falloc 192K 64K" \
-c "pwrite 192K 64K" \
$MNT/foo
umount $MNT
echo -n "Initial number of file extent items: "
btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l
mount $DEV $MNT
# Read the whole file in order to load and merge extent maps.
cat $MNT/foo > /dev/null
btrfs filesystem defragment -t 128K $MNT/foo
umount $MNT
echo -n "Number of file extent items after defrag with 128K threshold: "
btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l
mount $DEV $MNT
# Read the whole file in order to load and merge extent maps.
cat $MNT/foo > /dev/null
btrfs filesystem defragment -t 256K $MNT/foo
umount $MNT
echo -n "Number of file extent items after defrag with 256K threshold: "
btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l
Running it:
$ ./test.sh
Initial number of file extent items: 4
Number of file extent items after defrag with 128K threshold: 4
Number of file extent items after defrag with 256K threshold: 4
The 4 extents don't get merged because we have an extent map with a size
of 256K that is the result of merging the individual extent maps for each
of the four 64K extents and at defrag_lookup_extent() we have a value of
zero for the generation threshold ('newer_than' argument) since this is a
manual defrag. As a consequence we don't call defrag_get_extent() to get
an extent map representing a single file extent item in the inode's
subvolume tree, so we end up using the merged extent map at
defrag_collect_targets() and decide not to defrag.
Fix this by updating defrag_lookup_extent() to always discard extent maps
that were merged and call defrag_get_extent() regardless of the minimum
generation threshold ('newer_than' argument).
A test case for fstests will be sent along soon.
CC: stable@vger.kernel.org # 6.1+
Fixes: 199257a78b ("btrfs: defrag: don't use merged extent map for their generation check")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have 3 or more adjacent extents in a file, that is, consecutive file
extent items pointing to adjacent extents, within a contiguous file range
and compatible flags, we end up not merging all the extents into a single
extent map.
For example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -d -c "pwrite -b 64K 0 64K" \
-c "pwrite -b 64K 64K 64K" \
-c "pwrite -b 64K 128K 64K" \
-c "pwrite -b 64K 192K 64K" \
/mnt/sdc/foo
After all the ordered extents complete we unpin the extent maps and try
to merge them, but instead of getting a single extent map we get two
because:
1) When the first ordered extent completes (file range [0, 64K)) we
unpin its extent map and attempt to merge it with the extent map for
the range [64K, 128K), but we can't because that extent map is still
pinned;
2) When the second ordered extent completes (file range [64K, 128K)), we
unpin its extent map and merge it with the previous extent map, for
file range [0, 64K), but we can't merge with the next extent map, for
the file range [128K, 192K), because this one is still pinned.
The merged extent map for the file range [0, 128K) gets the flag
EXTENT_MAP_MERGED set;
3) When the third ordered extent completes (file range [128K, 192K)), we
unpin its extent map and attempt to merge it with the previous extent
map, for file range [0, 128K), but we can't because that extent map
has the flag EXTENT_MAP_MERGED set (mergeable_maps() returns false
due to different flags) while the extent map for the range [128K, 192K)
doesn't have that flag set.
We also can't merge it with the next extent map, for file range
[192K, 256K), because that one is still pinned.
At this moment we have 3 extent maps:
One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set.
One for file range [128K, 192K).
One for file range [192K, 256K) which is still pinned;
4) When the fourth and final extent completes (file range [192K, 256K)),
we unpin its extent map and attempt to merge it with the previous
extent map, for file range [128K, 192K), which succeeds since none
of these extent maps have the EXTENT_MAP_MERGED flag set.
So we end up with 2 extent maps:
One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set.
One for file range [128K, 256K), with the flag EXTENT_MAP_MERGED set.
Since after merging extent maps we don't attempt to merge again, that
is, merge the resulting extent map with the one that is now preceding
it (and the one following it), we end up with those two extent maps,
when we could have had a single extent map to represent the whole file.
Fix this by making mergeable_maps() ignore the EXTENT_MAP_MERGED flag.
While this doesn't present any functional issue, it prevents the merging
of extent maps which allows to save memory, and can make defrag not
merging extents too (that will be addressed in the next patch).
Fixes: 199257a78b ("btrfs: defrag: don't use merged extent map for their generation check")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Syzbot reported that page_symlink(), called by nilfs_symlink(), triggers
memory reclamation involving the filesystem layer, which can result in
circular lock dependencies among the reader/writer semaphore
nilfs->ns_segctor_sem, s_writers percpu_rwsem (intwrite) and the
fs_reclaim pseudo lock.
This is because after commit 21fc61c73c ("don't put symlink bodies in
pagecache into highmem"), the gfp flags of the page cache for symbolic
links are overwritten to GFP_KERNEL via inode_nohighmem().
This is not a problem for symlinks read from the backing device, because
the __GFP_FS flag is dropped after inode_nohighmem() is called. However,
when a new symlink is created with nilfs_symlink(), the gfp flags remain
overwritten to GFP_KERNEL. Then, memory allocation called from
page_symlink() etc. triggers memory reclamation including the FS layer,
which may call nilfs_evict_inode() or nilfs_dirty_inode(). And these can
cause a deadlock if they are called while nilfs->ns_segctor_sem is held:
Fix this issue by dropping the __GFP_FS flag from the page cache GFP flags
of newly created symlinks in the same way that nilfs_new_inode() and
__nilfs_read_inode() do, as a workaround until we adopt nofs allocation
scope consistently or improve the locking constraints.
Link: https://lkml.kernel.org/r/20241020050003.4308-1-konishi.ryusuke@gmail.com
Fixes: 21fc61c73c ("don't put symlink bodies in pagecache into highmem")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9ef37ac20608f4836256
Tested-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reports a slab out of bounds access in squashfs_readpage_block().
This is caused by an attempt to read page index 0x2000000000. This value
(start_index) is stored in an integer loop variable which overflows
producing a value of 0. This causes a loop which iterates over pages
start_index -> end_index to iterate over 0 -> end_index, which ultimately
causes an out of bounds page array access.
Fix by changing variable to a loff_t, and rename to index to make it
clearer it is a page index, and not a loop count.
Link: https://lkml.kernel.org/r/20241020232200.837231-1-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/ZwzcnCAosIPqQ9Ie@ly-workstation/
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The error flow in nfsd4_copy() calls cleanup_async_copy(), which
already decrements nn->pending_async_copies.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Directly return the error from xfs_bmap_longest_free_extent instead
of breaking from the loop and handling it there, and use a done
label to directly jump to the exist when we found a suitable perag
structure to reduce the indentation level and pag/max_pag check
complexity in the tail of the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the main loop in xfs_filestream_pick_ag fails to find a suitable
AG it tries to just pick the online AG. But the loop for that uses
args->pag as loop iterator while the later code expects pag to be
set. Fix this by reusing the max_pag case for this last resort, and
also add a check for impossible case of no AG just to make sure that
the uninitialized pag doesn't even escape in theory.
Reported-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Fixes: f8f1ed1ab3 ("xfs: return a referenced perag from filestreams allocator")
Cc: <stable@vger.kernel.org> # v6.3
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Recently, we found that the CPU spent a lot of time in
xfs_alloc_ag_vextent_size when the filesystem has millions of fragmented
spaces.
The reason is that we conducted much extra searching for extents that
could not yield a better result, and these searches would cost a lot of
time when there were millions of extents to search through. Even if we
get the same result length, we don't switch our choice to the new one,
so we can definitely terminate the search early.
Since the result length cannot exceed the found length, when the found
length equals the best result length we already have, we can conclude
the search.
We did a test in that filesystem:
[root@localhost ~]# xfs_db -c freesp /dev/vdb
from to extents blocks pct
1 1 215 215 0.01
2 3 994476 1988952 99.99
Before this patch:
0) | xfs_alloc_ag_vextent_size [xfs]() {
0) * 15597.94 us | }
After this patch:
0) | xfs_alloc_ag_vextent_size [xfs]() {
0) 19.176 us | }
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Extsize should only be allowed to be set on files with no data in it.
For this, we check if the files have extents but miss to check if
delayed extents are present. This patch adds that check.
While we are at it, also refactor this check into a helper since
it's used in some other places as well like xfs_inactive() or
xfs_ioctl_setattr_xflags()
**Without the patch (SUCCEEDS)**
$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'
wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
**With the patch (FAILS as expected)**
$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'
wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument
Fixes: e94af02a9c ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Mounting btrfs from two images (which have the same one fsid and two
different dev_uuids) in certain executing order may trigger an UAF for
variable 'device->bdev_file' in __btrfs_free_extra_devids(). And
following are the details:
1. Attach image_1 to loop0, attach image_2 to loop1, and scan btrfs
devices by ioctl(BTRFS_IOC_SCAN_DEV):
/ btrfs_device_1 → loop0
fs_device
\ btrfs_device_2 → loop1
2. mount /dev/loop0 /mnt
btrfs_open_devices
btrfs_device_1->bdev_file = btrfs_get_bdev_and_sb(loop0)
btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
btrfs_fill_super
open_ctree
fail: btrfs_close_devices // -ENOMEM
btrfs_close_bdev(btrfs_device_1)
fput(btrfs_device_1->bdev_file)
// btrfs_device_1->bdev_file is freed
btrfs_close_bdev(btrfs_device_2)
fput(btrfs_device_2->bdev_file)
3. mount /dev/loop1 /mnt
btrfs_open_devices
btrfs_get_bdev_and_sb(&bdev_file)
// EIO, btrfs_device_1->bdev_file is not assigned,
// which points to a freed memory area
btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
btrfs_fill_super
open_ctree
btrfs_free_extra_devids
if (btrfs_device_1->bdev_file)
fput(btrfs_device_1->bdev_file) // UAF !
Fix it by setting 'device->bdev_file' as 'NULL' after closing the
btrfs_device in btrfs_close_one_device().
Fixes: 1423881941 ("btrfs: do not background blkdev_put()")
CC: stable@vger.kernel.org # 4.19+
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219408
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Ensure the refcount and async_copies fields are initialized early.
cleanup_async_copy() will reference these fields if an error occurs
in nfsd4_copy(). If they are not correctly initialized, at the very
least, a refcount underflow occurs.
Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add NULL check for key returned from bch2_btree_and_journal_iter_peek in
btree_node_iter_and_journal_peek to avoid NULL ptr dereference in
bch2_bkey_buf_reassemble.
When key returned from bch2_btree_and_journal_iter_peek is NULL it means
that btree topology needs repair. Print topology error message with
position at which node wasn't found, its parent node information and
btree_id with level.
Return error code returned by bch2_topology_error to ensure that topology
error is handled properly by recovery.
Reported-by: syzbot+005ef9aa519f30d97657@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=005ef9aa519f30d97657
Fixes: 5222a4607c ("bcachefs: BTREE_ITER_WITH_JOURNAL")
Suggested-by: Alan Huang <mmpgouride@gmail.com>
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The function ec_new_stripe_head_alloc() returns nullptr if kzalloc()
fails. It is crucial to verify its return value before dereferencing
it to avoid a potential nullptr dereference.
Fixes: 035d72f72c ("bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Open buckets on the partial list should not count as allocated when
we're trying to allocate from the partial list.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a bug report where the data update path was creating an extent
that failed to validate because it had too many pointers; almost all of
them were cached.
To fix this, we have:
- want_cached_ptr(), a new helper that checks if we even want a cached
pointer (is on appropriate target, device is readable).
- bch2_extent_set_ptr_cached() now only sets a pointer cached if we want
it.
- bch2_extent_normalize_by_opts() now ensures that we only have a single
cached pointer that we want.
While working on this, it was noticed that this doesn't work well with
reflinked data and per-file options. Another patch series is coming that
plumbs through additional io path options through bch_extent_rebalance,
with improved option handling.
Reported-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Initialize freespace_initialized bits to 0 in member's flags and update
member's cached version for each device in bch2_fs_initialize.
It's possible for the bits to be set to 1 before fs is initialized and if
call to bch2_trans_mark_dev_sbs (just before bch2_fs_freespace_init) fails
bits remain to be 1 which can later indirectly trigger BUG condition in
bch2_bucket_alloc_freelist during shutdown.
Reported-by: syzbot+2b6a17991a6af64f9489@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2b6a17991a6af64f9489
Fixes: bbe682c767 ("bcachefs: Ensure devices are always correctly initialized")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Syzbot reported that in directory operations after nilfs2 detects
filesystem corruption and degrades to read-only,
__block_write_begin_int(), which is called to prepare block writes, may
fail the BUG_ON check for accesses exceeding the folio/page size,
triggering a kernel bug.
This was found to be because the "checked" flag of a page/folio was not
cleared when it was discarded by nilfs2's own routine, which causes the
sanity check of directory entries to be skipped when the directory
page/folio is reloaded. So, fix that.
This was necessary when the use of nilfs2's own page discard routine was
applied to more than just metadata files.
Link: https://lkml.kernel.org/r/20241017193359.5051-1-konishi.ryusuke@gmail.com
Fixes: 8c26c4e269 ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+d6ca2daf692c7a82f959@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d6ca2daf692c7a82f959
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reported a kernel BUG in ocfs2_truncate_inline. There are two
reasons for this: first, the parameter value passed is greater than
ocfs2_max_inline_data_with_xattr, second, the start and end parameters of
ocfs2_truncate_inline are "unsigned int".
So, we need to add a sanity check for byte_start and byte_len right before
ocfs2_truncate_inline() in ocfs2_remove_inode_range(), if they are greater
than ocfs2_max_inline_data_with_xattr return -EINVAL.
Link: https://lkml.kernel.org/r/tencent_D48DB5122ADDAEDDD11918CFB68D93258C07@qq.com
Fixes: 1afc32b952 ("ocfs2: Write support for inline data")
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reported-by: syzbot+81092778aac03460d6b7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=81092778aac03460d6b7
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "fork: do not expose incomplete mm on fork".
During fork we may place the virtual memory address space into an
inconsistent state before the fork operation is complete.
In addition, we may encounter an error during the fork operation that
indicates that the virtual memory address space is invalidated.
As a result, we should not be exposing it in any way to external machinery
that might interact with the mm or VMAs, machinery that is not designed to
deal with incomplete state.
We specifically update the fork logic to defer khugepaged and ksm to the
end of the operation and only to be invoked if no error arose, and
disallow uffd from observing fork events should an error have occurred.
This patch (of 2):
Currently on fork we expose the virtual address space of a process to
userland unconditionally if uffd is registered in VMAs, regardless of
whether an error arose in the fork.
This is performed in dup_userfaultfd_complete() which is invoked
unconditionally, and performs two duties - invoking registered handlers
for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down
userfaultfd_fork_ctx objects established in dup_userfaultfd().
This is problematic, because the virtual address space may not yet be
correctly initialised if an error arose.
The change in commit d240629148 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.
We address this by, on fork error, ensuring that we roll back state that
we would otherwise expect to clean up through the event being handled by
userland and perform the memory freeing duty otherwise performed by
dup_userfaultfd_complete().
We do this by implementing a new function, dup_userfaultfd_fail(), which
performs the same loop, only decrementing reference counts.
Note that we perform mmgrab() on the parent and child mm's, however
userfaultfd_ctx_put() will mmdrop() this once the reference count drops to
zero, so we will avoid memory leaks correctly here.
Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I was so sure the per-dentry expire timeout patch worked ok but my
testing was flawed.
In validate_dev_ioctl() the check for ioctl AUTOFS_DEV_IOCTL_TIMEOUT_CMD
should use the ioctl number not the passed in ioctl command.
Fixes: 433f9d76a0 ("autofs: add per dentry expire timeout")
Cc: <stable@vger.kernel.org> # mainline only
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/20241027224732.5507-1-raven@themaw.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
xa_store() can fail, it return xa_err(-EINVAL) if the entry cannot
be stored in an XArray, or xa_err(-ENOMEM) if memory allocation failed,
so check error for xa_store() to fix it.
Cc: stable@vger.kernel.org
Fixes: b685757c7b ("ksmbd: Implements sess->rpc_handle_list as xarray")
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
* fix recovery of allocator ops after a growfs
* Do not fail repairs on metadata files with no attr fork
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZxjo7gAKCRBGdaER5Qtf
pr26AYCUc9+Vlg5iReesrghYHJgeCaMYZm2i4WdNdI+BO8d+5+AA1oUO55ib3xWd
fX8A0MEBf32eeMR0E+K0NeKsmHnbGHXyWRg/27IlNRniL4/yldssEFB8X3b7Gkw5
/geUVdz99A==
=+NGs
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- Fix recovery of allocator ops after a growfs
- Do not fail repairs on metadata files with no attr fork
* tag 'xfs-6.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: update the pag for the last AG at recovery time
xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag
xfs: error out when a superblock buffer update reduces the agcount
xfs: update the file system geometry after recoverying superblock buffers
xfs: merge the perag freeing helpers
xfs: pass the exact range to initialize to xfs_initialize_perag
xfs: don't fail repairs on metadata files with no attr fork
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmccC7UACgkQq06b7GqY
5nBYbA/7BP/Me80+ofClszxd0F3nlXCGJPe31vC36EeFsap5p1TrIM7jvj0iR0Zr
HXLxyTwocrubFgSE28rUrSS2nhmk016g3EUEVeoaO+hYSbvtdrRhoTSCubEwb4N5
68vUz1ebGqZ35oT9oVZeAd+xdwxpKPktqrN6zCJsyWbSiZevE8YMkV+dgYrGde+Y
j7BMypcxtQ2+u8donJ1PnGz04k0RNhqcqGGED0VuNw1Sp6quY8n0EWvHEeYjrb+N
hFYB1lSeA245fVm8bmnUFHLJ2w47bxfJZ8hANbO9C0zLI98vlmH6xItXQnGofc4P
1VtCHxoZnIFMMjK/iTPjvi4JsanNr0mS1Aa5w8HAHqyHYeIoBCpn78XFf79IhFAl
pO3RV6nxE8fELXBM39Yyq7I3rQfyFmNBeob0UWIbazQ7cC54xBHpu0RDMj2gD6xy
TBbZL2euRfGEswKWfrdR32U1jzSCv+jTAWew9sdx+z0RX9inRyaNBHFVsjOnVFZS
iLJn20cotgw0WU2OhghWvgsV4o/Ckx4eJAC9oNxZ9pT1s7BqSvW1qg/Zlpn9wYgf
UGQGXrCdXGY6XZ3W53c2joS+QWtmU6N8mAC6jCeqQm8pCQQeiTHE1QycyNc0xPGd
ELhQYzCA1hEiNDki3e/vqzbPScpAer9dc/6hMKUT55SVfjcCBoc=
=O6o1
-----END PGP SIGNATURE-----
Merge tag '9p-for-6.12-rc5' of https://github.com/martinetd/linux
Pull more 9p reverts from Dominique Martinet:
"Revert patches causing inode collision problems.
The code simplification introduced significant regressions on servers
that do not remap inode numbers when exporting multiple underlying
filesystems with colliding inodes. See the top-most revert (commit
be2ca38253) for details.
This problem had been ignored for too long and the reverts will also
head to stable (6.9+).
I'm confident this set of patches gets us back to previous behaviour
(another related patch had already been reverted back in April and
we're almost back to square 1, and the rest didn't touch inode
lifecycle)"
* tag '9p-for-6.12-rc5' of https://github.com/martinetd/linux:
Revert "fs/9p: simplify iget to remove unnecessary paths"
Revert "fs/9p: fix uaf in in v9fs_stat2inode_dotl"
Revert "fs/9p: remove redundant pointer v9ses"
Revert " fs/9p: mitigate inode collisions"
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZxu02AAKCRDh3BK/laaZ
PLDHAPwMzz4c+wbqz8Qo2IEo3lxvgPjgzMNXQetCgZFKvxKRlwD+PaIeRixGwwmB
ON3IsScZjROphzb+ofroUpj7lLEM7Ag=
=9rYN
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
- Fix cached size after passthrough writes
This fix needed a trivial change in the backing-file API, which
resulted in some non-fuse files being touched.
- Revert a commit meant as a cleanup but which triggered a WARNING
- Remove a stray debug line left-over
* tag 'fuse-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: remove stray debug line
Revert "fuse: move initialization of fuse_file to fuse_writepages() instead of in callback"
fuse: update inode size after extending passthrough write
fs: pass offset and result to backing_file end_write() callback
- Fix a couple of use-after-free bugs
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcbp9YACgkQM2qzM29m
f5dzAg/4lCFRbsebia7qktW88ZDBbAoZyKolk2DHfjZXCuq5DHb/5I/Hk1rGyTYs
VaJmCU59ZdpyBFSdQhOYKf2xvgNPvJG02U8il5KWtMAY5cStXFjeU0FSDoC5O4Dl
9IoaVbtAWGMCjxWJ1WEGpU82JoM7moSVB4G718LlxF+4cUS7idq5se0uK31WQvft
DmJsOfmnch1Y/7+RRcDwbBu0HwP2ZQHS8zMYMQ2JGXPDZJFenTibezVb36YyzyZn
WQfvaW6MmdiVL9omZxvURL9WuBKA2L+Ly/92PyHaflcAXSngcpfu28IIQbzp/m7K
JT/3ad32lB7F3LrDP5I4gwVh8oGLYiI5r7RBWo0e98LPvAR/89gBVdZHhjasstCh
nAL7Kk6P/jQdbM/KR9T+yS7xTVScI5Wp4Xcitz2mlHgU4br67GO9gpo1e/tqXenm
Gasapkg5qCduz+ksj2vwpeFXKQi+qwJgfVGKMxELoV8qTazyr09Dfgouqe045ztl
/0khkOrLkw0bYLDNJKhj/XG0ZEV5V10c/0PEnivC2BHVQioBIDRQAaH2S2G/8vQH
MdWayGhNlTV0g4DdtMVPxf5uN+uQmfMsj5BIe17NIUYxiJnw0CSM/bzHUcJ15+DH
nTblaek6sa5BFpZ2fed8by4il4/tme3NOJEeHnSqe/3QKyZgOQ==
=1YOr
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix a couple of use-after-free bugs
* tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
nfsd: fix race between laundromat and free_stateid
c->btree_roots_known[i].b can be NULL. In this case, a NULL pointer dereference
occurs, so you need to add code to check the variable.
Reported-by: syzbot+b468b9fef56949c3b528@syzkaller.appspotmail.com
Fixes: 7773df19c3 ("bcachefs: metadata version bucket_stripe_sectors")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The size of a.data_type is set abnormally large, causing shift-out-of-bounds.
To fix this, we need to add validation on a.data_type in
alloc_lru_idx_fragmentation().
Reported-by: syzbot+7f45fa9805c40db3f108@syzkaller.appspotmail.com
Fixes: 260af1562e ("bcachefs: Kill alloc_v4.fragmentation_lru")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This reverts commit 724a08450f.
This code simplification introduced significant regressions on servers
that do not remap inode numbers when exporting multiple underlying
filesystems with colliding inodes, as can be illustrated with simple
tmpfs exports in qemu with remapping disabled:
```
# host side
cd /tmp/linux-test
mkdir m1 m2
mount -t tmpfs tmpfs m1
mount -t tmpfs tmpfs m2
mkdir m1/dir m2/dir
echo foo > m1/dir/foo
echo bar > m2/dir/bar
# guest side
# started with -virtfs local,path=/tmp/linux-test,mount_tag=tmp,security_model=mapped-file
mount -t 9p -o trans=virtio,debug=1 tmp /mnt/t
ls /mnt/t/m1/dir
# foo
ls /mnt/t/m2/dir
# bar (works ok if directry isn't open)
# cd to keep first dir's inode alive
cd /mnt/t/m1/dir
ls /mnt/t/m2/dir
# foo (should be bar)
```
Other examples can be crafted with regular files with fscache enabled,
in which case I/Os just happen to the wrong file leading to
corruptions, or guest failing to boot with:
| VFS: Lookup of 'com.android.runtime' in 9p 9p would have caused loop
In theory, we'd want the servers to be smart enough and ensure they
never send us two different files with the same 'qid.path', but while
qemu has an option to remap that is recommended (and qemu prints a
warning if this case happens), there are many other servers which do
not (kvmtool, nfs-ganesha, probably diod...), we should at least ensure
we don't cause regressions on this:
- assume servers can't be trusted and operations that should get a 'new'
inode properly do so. commit d05dcfdf5e (" fs/9p: mitigate inode
collisions") attempted to do this, but v9fs_fid_iget_dotl() was not
called so some higher level of caching got in the way; this needs to be
fixed properly before we can re-apply the patches.
- if we ever want to really simplify this code, we will need to add some
negotiation with the server at mount time where the server could claim
they handle this properly, at which point we could optimize this out.
(but that might not be needed at all if we properly handle the 'new'
check?)
Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/all/20240408141436.GA17022@redhat.com/
Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck
Cc: stable@vger.kernel.org # v6.9+
Message-ID: <20241024-revert_iget-v1-4-4cac63d25f72@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
This reverts commit 11763a8598.
This is a requirement to revert commit 724a08450f ("fs/9p: simplify
iget to remove unnecessary paths"), see that revert for details.
Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck
Cc: stable@vger.kernel.org # v6.9+
Message-ID: <20241024-revert_iget-v1-3-4cac63d25f72@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
This reverts commit 10211b4a23.
This is a requirement to revert commit 724a08450f ("fs/9p: simplify
iget to remove unnecessary paths"), see that revert for details.
Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck
Cc: stable@vger.kernel.org # v6.9+
Message-ID: <20241024-revert_iget-v1-2-4cac63d25f72@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
This reverts commit d05dcfdf5e.
This is a requirement to revert commit 724a08450f ("fs/9p: simplify
iget to remove unnecessary paths"), see that revert for details.
Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck
Cc: stable@vger.kernel.org # v6.9+
Message-ID: <20241024-revert_iget-v1-1-4cac63d25f72@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcZGqsACgkQxWXV+ddt
WDsZTg//dSLAAswT3dTAvWDt7rGxPhJC1V+gnzWdj/0Q3CUimNaw7zHp1QHtJDFT
S+3TVvJJ2JDvOnbi3u24s/bL5YdkWcvIyy87oVE4trJMbQPc/E45pkYFqF3TRQAH
wEYAVEnO9f9WUY+ekxX2XBzhmKp3xol93j9BBHDXOF9kHsDC5lI0D5YVYEhRY8Qu
D4GcqAhqUj5Hj6I6ppiqO47NCJBNzw1Se9QsgruPpmItRbB0/LYJFhUfevwosTKg
xVpkRVntFqQjkIdIdBfBv/ZGWfxJyM7K4M49QwLUqUQfxugu7BiGYuEjkBkiy07a
pZDEOF9s8wUZsxvRVohqKlhL0zHBF+/pAANowYuhKNW1sqKt4GVCdN3V34AbH8ST
JIbPvC2g1tUzIc2wckE8GO/NnsNR4r3k6iPB53MdCHrIo3jnENOeb9wF0GiVDb6s
OrhCa3ph2ps80YC1aCnc4Jr/yV2ONebSivnqvHCIUQEZfpjtAc07atm0i946/7nx
eiBE+9zSVJZB0LoooIVz2I3HX3lRnm3Wwi4nj8U/sNL/IbGaHNCurIETR23NivYP
yWVql2njwE+yMc8q9YZXs4MBdKsSGP6eGJW3ZwKi6ru2PJdf5eIib+ffvR4bLqXD
UUDfMyC1esGsQB24sc8wppk97wmmMrdqQUj9WnmNlTP8FUFggvQ=
=sYxX
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- mount option fixes:
- fix handling of compression mount options on remount
- reject rw remount in case there are options that don't work
in read-write mode (like rescue options)
- fix zone accounting of unusable space
- fix in-memory corruption when merging extent maps
- fix delalloc range locking for sector < page
- use more convenient default value of drop subtree threshold, clean
more subvolumes without the fallback to marking quotas inconsistent
- fix smatch warning about incorrect value passed to ERR_PTR
* tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
btrfs: reject ro->rw reconfiguration if there are hard ro requirements
btrfs: fix read corruption due to race with extent map merging
btrfs: fix the delalloc range locking if sector size < page size
btrfs: qgroup: set a more sane default value for subtree drop threshold
btrfs: clear force-compress on remount when compress mount option is given
btrfs: zoned: fix zone unusable accounting for freed reserved extent
Lots of hotfixes:
- transaction restart injection has been shaking out a few things
- fix a data corruption in the buffered write path on -ENOSPC, found by
xfstests generic/299
- Some small show_options fixes
- Repair mismatches in inode hash type, seed: different snapshot
versions of an inode must have the same hash/type seed, used for
directory entries and xattrs. We were checking the hash seed, but not
the type, and a user contributed a filesystem where the hash type on
one inode had somehow been flipped; these fixes allow his filesystem
to repair.
Additionally, the hash type flip made some directory entries
invisible, which were then recreated by userspace; so the hash check
code now checks for duplicate non dangling dirents, and renames one of
them if necessary.
- Don't use wait_event_interruptible() in recovery: this fixes some
filesystems failing to mount with -ERESTARTSYS
- Workaround for kvmalloc not supporting > INT_MAX allocations, causing
an -ENOMEM when allocating the sorted array of journal keys: this
allows a 75 TB filesystem to mount
- Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
compat path: this alllows Marcin's filesystem (in use since before
6.7) to repair and mount.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcX4vYACgkQE6szbY3K
bnbywxAArBfIJfshWq5Wk9WztenzUmyUmV2HIgntT/iN4ty4eIpZ26VSvHcGvgkU
j3wx+OuxMTPBGc3fjUS+gALf/BGcQEgh6oPZCV+6M3kasTzNzG2jYOCkLqKbpcO1
V5n/Le/SM1X2grkgTm/H+TulGHNgG9gJ2U4kjihroJrTbTesZhzcW/qlz6RWo7U1
02NvLop4WE9M6WaW9RzsHK2llRUAl2Z3oRMuwNz3IIijCpm98STGD4gyvGoMV2b8
qNsXjy7b2lkYObKI29yWF0caRzWK1LRz79afRlnNVSJb6DK1QB83ms5Qa8rprCU4
uOq0wsGWyg6lzwQ19X+2TvUYABopVk2HXLlzTO/lJrWeMTuYJVPZ7KZi3l6ubw5T
GIsAD5qMdCm8E5nXX8hG//0rOIl6QK288+zMQyRCvAkCL+iN2k0TU8qKAEEC44de
vj6ZyNqbuLR39LLz9K09ZhzIZGk09ELpxOJ2Wwwj4ZFriwphWDtFgBtBUpNo/KWA
inBfq2lZJsmNjfns9vCqOmNOStOJxXnyMOR25sTv7wM69QPGkl41dPY3oeuG8lRk
cU/qJQKlpTKJbFeXiEKWKDnMzWxOnovqLFC0tKu2qAYM6vAz+AtwTXgthVFGh21U
QoUDbsnQCCixMkS2AksCo7nivLrxmV/EeYm5pgeiU38VdA5ofBM=
=OpYN
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Lots of hotfixes:
- transaction restart injection has been shaking out a few things
- fix a data corruption in the buffered write path on -ENOSPC, found
by xfstests generic/299
- Some small show_options fixes
- Repair mismatches in inode hash type, seed: different snapshot
versions of an inode must have the same hash/type seed, used for
directory entries and xattrs. We were checking the hash seed, but
not the type, and a user contributed a filesystem where the hash
type on one inode had somehow been flipped; these fixes allow his
filesystem to repair.
Additionally, the hash type flip made some directory entries
invisible, which were then recreated by userspace; so the hash
check code now checks for duplicate non dangling dirents, and
renames one of them if necessary.
- Don't use wait_event_interruptible() in recovery: this fixes some
filesystems failing to mount with -ERESTARTSYS
- Workaround for kvmalloc not supporting > INT_MAX allocations,
causing an -ENOMEM when allocating the sorted array of journal
keys: this allows a 75 TB filesystem to mount
- Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
compat path: this alllows Marcin's filesystem (in use since before
6.7) to repair and mount"
* tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits)
bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path
bcachefs: Mark more errors as AUTOFIX
bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations
bcachefs: Don't use wait_event_interruptible() in recovery
bcachefs: Fix __bch2_fsck_err() warning
bcachefs: fsck: Improve hash_check_key()
bcachefs: bch2_hash_set_or_get_in_snapshot()
bcachefs: Repair mismatches in inode hash seed, type
bcachefs: Add hash seed, type to inode_to_text()
bcachefs: INODE_STR_HASH() for bch_inode_unpacked
bcachefs: Run in-kernel offline fsck without ratelimit errors
bcachefs: skip mount option handle for empty string.
bcachefs: fix incorrect show_options results
bcachefs: Fix data corruption on -ENOSPC in buffered write path
bcachefs: bch2_folio_reservation_get_partial() is now better behaved
bcachefs: fix disk reservation accounting in bch2_folio_reservation_get()
bcachefS: ec: fix data type on stripe deletion
bcachefs: Don't use commit_do() unnecessarily
bcachefs: handle restarts in bch2_bucket_io_time_reset()
bcachefs: fix restart handling in __bch2_resume_logged_op_finsert()
...
This reverts commit 1325e4a91a.
using multipage folios apparently break some madvise operations like
MADV_PAGEOUT which do not reliably unload the specified page anymore,
Revert the patch until that is figured out.
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Fixes: 1325e4a91a ("9p: Enable multipage folios")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When rename moves an AFS subdirectory between parent directories, the
subdir also needs a bit of editing: the ".." entry needs updating to point
to the new parent (though I don't make use of the info) and the DV needs
incrementing by 1 to reflect the change of content. The server also sends
a callback break notification on the subdirectory if we have one, but we
can take care of recovering the promise next time we access the subdir.
This can be triggered by something like:
mount -t afs %example.com:xfstest.test20 /xfstest.test/
mkdir /xfstest.test/{aaa,bbb,aaa/ccc}
touch /xfstest.test/bbb/ccc/d
mv /xfstest.test/{aaa/ccc,bbb/ccc}
touch /xfstest.test/bbb/ccc/e
When the pathwalk for the second touch hits "ccc", kafs spots that the DV
is incorrect and downloads it again (so the fix is not critical).
Fix this, if the rename target is a directory and the old and new
parents are different, by:
(1) Incrementing the DV number of the target locally.
(2) Editing the ".." entry in the target to refer to its new parent's
vnode ID and uniquifier.
Link: https://lore.kernel.org/r/3340431.1729680010@warthog.procyon.org.uk
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
cc: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The purpose of btrfs_bbio_propagate_error() shall be propagating an error
of split bio to its original btrfs_bio, and tell the error to the upper
layer. However, it's not working well on some cases.
* Case 1. Immediate (or quick) end_bio with an error
When btrfs sends btrfs_bio to mirrored devices, btrfs calls
btrfs_bio_end_io() when all the mirroring bios are completed. If that
btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function
is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error()
accesses the orig_bbio's bio context to increase the error count.
That works well in most cases. However, if the end_io is called enough
fast, orig_bbio's (remaining part after split) bio context may not be
properly set at that time. Since the bio context is set when the orig_bbio
(the last btrfs_bio) is sent to devices, that might be too late for earlier
split btrfs_bio's completion. That will result in NULL pointer
dereference.
That bug is easily reproducible by running btrfs/146 on zoned devices [1]
and it shows the following trace.
[1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS.
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-5)
RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs]
BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
RSP: 0018:ffffc9000006f248 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc
RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080
RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001
R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58
R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158
FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? __die_body.cold+0x19/0x26
? page_fault_oops+0x13e/0x2b0
? _printk+0x58/0x73
? do_user_addr_fault+0x5f/0x750
? exc_page_fault+0x76/0x240
? asm_exc_page_fault+0x22/0x30
? btrfs_bio_end_io+0xae/0xc0 [btrfs]
? btrfs_log_dev_io_error+0x7f/0x90 [btrfs]
btrfs_orig_write_end_io+0x51/0x90 [btrfs]
dm_submit_bio+0x5c2/0xa50 [dm_mod]
? find_held_lock+0x2b/0x80
? blk_try_enter_queue+0x90/0x1e0
__submit_bio+0xe0/0x130
? ktime_get+0x10a/0x160
? lockdep_hardirqs_on+0x74/0x100
submit_bio_noacct_nocheck+0x199/0x410
btrfs_submit_bio+0x7d/0x150 [btrfs]
btrfs_submit_chunk+0x1a1/0x6d0 [btrfs]
? lockdep_hardirqs_on+0x74/0x100
? __folio_start_writeback+0x10/0x2c0
btrfs_submit_bbio+0x1c/0x40 [btrfs]
submit_one_bio+0x44/0x60 [btrfs]
submit_extent_folio+0x13f/0x330 [btrfs]
? btrfs_set_range_writeback+0xa3/0xd0 [btrfs]
extent_writepage_io+0x18b/0x360 [btrfs]
extent_write_locked_range+0x17c/0x340 [btrfs]
? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
run_delalloc_cow+0x71/0xd0 [btrfs]
btrfs_run_delalloc_range+0x176/0x500 [btrfs]
? find_lock_delalloc_range+0x119/0x260 [btrfs]
writepage_delalloc+0x2ab/0x480 [btrfs]
extent_write_cache_pages+0x236/0x7d0 [btrfs]
btrfs_writepages+0x72/0x130 [btrfs]
do_writepages+0xd4/0x240
? find_held_lock+0x2b/0x80
? wbc_attach_and_unlock_inode+0x12c/0x290
? wbc_attach_and_unlock_inode+0x12c/0x290
__writeback_single_inode+0x5c/0x4c0
? do_raw_spin_unlock+0x49/0xb0
writeback_sb_inodes+0x22c/0x560
__writeback_inodes_wb+0x4c/0xe0
wb_writeback+0x1d6/0x3f0
wb_workfn+0x334/0x520
process_one_work+0x1ee/0x570
? lock_is_held_type+0xc6/0x130
worker_thread+0x1d1/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0xee/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl
CR2: 0000000000000020
* Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios
btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is
called last among split bios. In that case, btrfs_orig_write_end_io() sets
the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2].
Otherwise, the increased orig_bio's bioc->error is not checked by anyone
and return BLK_STS_OK to the upper layer.
[2] Actually, this is not true. Because we only increases orig_bioc->errors
by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors"
is still not met if only one split btrfs_bio fails.
* Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios
In contrast to the above case, btrfs_bbio_propagate_error() is not working
well if un-mirrored orig_bbio is completed last. It sets
orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily
over-written by orig_bbio's completion status. If the status is BLK_STS_OK,
the upper layer would not know the failure.
* Solution
Considering the above cases, we can only save the error status in the
orig_bbio (remaining part after split) itself as it is always
available. Also, the saved error status should be propagated when all the
split btrfs_bios are finished (i.e, bbio->pending_ios == 0).
This commit introduces "status" to btrfs_bbio and saves the first error of
split bios to original btrfs_bio's "status" variable. When all the split
bios are finished, the saved status is loaded into original btrfs_bio's
status.
With this commit, btrfs/146 on zoned devices does not hit the NULL pointer
dereference anymore.
Fixes: 852eee62d3 ("btrfs: allow btrfs_submit_bio to split bios")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a issue as follows:
WARNING: CPU: 1 PID: 27826 at mm/slub.c:4698 free_large_kmalloc+0xac/0xe0
RIP: 0010:free_large_kmalloc+0xac/0xe0
Call Trace:
<TASK>
? __warn+0xea/0x330
mempool_destroy+0x13f/0x1d0
init_cifs+0xa50/0xff0 [cifs]
do_one_initcall+0xdc/0x550
do_init_module+0x22d/0x6b0
load_module+0x4e96/0x5ff0
init_module_from_file+0xcd/0x130
idempotent_init_module+0x330/0x620
__x64_sys_finit_module+0xb3/0x110
do_syscall_64+0xc1/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Obviously, 'cifs_io_request_pool' is not created by mempool_create().
So just use mempool_exit() to revert 'cifs_io_request_pool'.
Fixes: edea94a697 ("cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Acked-by: David Howells <dhowells@redhat.com
Signed-off-by: Steve French <stfrench@microsoft.com>
In smb3_reconfigure(), after duplicating ctx->password and
ctx->password2 with kstrdup(), we need to check for allocation
failures.
If ses->password allocation fails, return -ENOMEM.
If ses->password2 allocation fails, free ses->password, set it
to NULL, and return -ENOMEM.
Fixes: c1eb537bf4 ("cifs: allow changing password during remount")
Reviewed-by: David Howells <dhowells@redhat.com
Signed-off-by: Haoxiang Li <make24@iscas.ac.cn>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
MAXAG is a legitimate value for bmp->db_numag
Fixes: e63866a475 ("jfs: fix out-of-bounds in dbNextAG() and diAlloc()")
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
The ret may be zero in btrfs_search_dir_index_item() and should not
passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to
this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0.
This fixes smatch warnings:
fs/btrfs/dir-item.c:353
btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR'
Fixes: 9dcbe16fcc ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reports the following crash:
BTRFS info (device loop0 state MCS): disabling free space tree
BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline]
RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041
Call Trace:
<TASK>
btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530
btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312
btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012
btrfs_remount_rw fs/btrfs/super.c:1309 [inline]
btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534
btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline]
btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline]
btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115
vfs_get_tree+0x90/0x2b0 fs/super.c:1800
do_new_mount+0x2be/0xb40 fs/namespace.c:3472
do_mount fs/namespace.c:3812 [inline]
__do_sys_mount fs/namespace.c:4020 [inline]
__se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
[CAUSE]
To support mounting different subvolume with different RO/RW flags for
the new mount APIs, btrfs introduced two workaround to support this feature:
- Skip mount option/feature checks if we are mounting a different
subvolume
- Reconfigure the fs to RW if the initial mount is RO
Combining these two, we can have the following sequence:
- Mount the fs ro,rescue=all,clear_cache,space_cache=v1
rescue=all will mark the fs as hard read-only, so no v2 cache clearing
will happen.
- Mount a subvolume rw of the same fs.
We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY
because our new fc is RW, different from the original fs.
Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag
first so that we can grab the existing fs_info.
Then we reconfigure the fs to RW.
- During reconfiguration, option/features check is skipped
This means we will restart the v2 cache clearing, and convert back to
v1 cache.
This will trigger fs writes, and since the original fs has "rescue=all"
option, it skips the csum tree read.
And eventually causing NULL pointer dereference in super block
writeback.
[FIX]
For reconfiguration caused by different subvolume RO/RW flags, ensure we
always run btrfs_check_options() to ensure we have proper hard RO
requirements met.
In fact the function btrfs_check_options() doesn't really do many
complex checks, but hard RO requirement and some feature dependency
checks, thus there is no special reason not to do the check for mount
reconfiguration.
Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In debugging some corrupt squashfs files, we observed symptoms of
corrupt page cache pages but correct on-disk contents. Further
investigation revealed that the exact symptom was a correct page
followed by an incorrect, duplicate, page. This got us thinking about
extent maps.
commit ac05ca913e ("Btrfs: fix race between using extent maps and merging them")
enforces a reference count on the primary `em` extent_map being merged,
as that one gets modified.
However, since,
commit 3d2ac99224 ("btrfs: introduce new members for extent_map")
both 'em' and 'merge' get modified, which started modifying 'merge'
and thus introduced the same race.
We were able to reproduce this by looping the affected squashfs workload
in parallel on a bunch of separate btrfs-es while also dropping caches.
We are still working on a simple enough reproducer to make into an fstest.
The simplest fix is to stop modifying 'merge', which is not essential,
as it is dropped immediately after the merge. This behavior is simply
a consequence of the order of the two extent maps being important in
computing the new values. Modify merge_ondisk_extents to take prev and
next by const* and also take a third merged parameter that it puts the
results in. Note that this introduces the rather odd behavior of passing
'em' to merge_ondisk_extents as a const * and as a regular ptr.
Fixes: 3d2ac99224 ("btrfs: introduce new members for extent_map")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:
- Set the writer locks without checking if the folio is still valid
We call btrfs_folio_start_writer_lock() just like it's folio_lock().
But since the folio may not even be the folio of the current mapping,
we can easily screw up the folio->private.
- The range is not clamped inside the page
This means we can over write other bitmaps if the start/len is not
properly handled, and trigger the btrfs_subpage_assert().
- @processed_end is always rounded up to page end
If the delalloc range is not page aligned, and we need to retry
(returning -EAGAIN), then we will unlock to the page end.
Thankfully this is not a huge problem, as now
btrfs_folio_end_writer_lock() can handle range larger than the locked
range, and only unlock what is already locked.
Fix all these problems by:
- Lock and check the folio first, then call
btrfs_folio_set_writer_lock()
So that if we got a folio not belonging to the inode, we won't
touch folio->private.
- Properly truncate the range inside the page
- Update @processed_end to the locked range end
Fixes: 1e1de38792 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 011b46c304 ("btrfs: skip subtree scan if it's too high to
avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can
automatically skip large subtree scan at the cost of marking qgroup
inconsistent.
It's designed to address the final performance problem of snapshot drop
with qgroup enabled, but to be safe the default value is
BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value
to make it work.
I'd say it's not a good idea to rely on user space tool to set this
default value, especially when some operations (snapshot dropping) can
be triggered immediately after mount, leaving a very small window to
that that sysfs interface.
So instead of disabling this new feature by default, enable it with a
low threshold (3), so that large subvolume tree drop at mount time won't
cause huge qgroup workload.
CC: stable@vger.kernel.org # 6.1
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the migration to use fs context for processing mount options we had
a slight change in the semantics for remounting a filesystem that was
mounted with compress-force. Before we could clear compress-force by
passing only "-o compress[=algo]" during a remount, but after that change
that does not work anymore, force-compress is still present and one needs
to pass "-o compress-force=no,compress[=algo]" to the mount command.
Example, when running on a kernel 6.8+:
$ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)
$ mount -o remount,compress=zlib:5 /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)
On a 6.7 kernel (or older):
$ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)
$ mount -o remount,compress=zlib:5 /mnt/sdi
$ mount | grep sdi
/dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)
So update btrfs_parse_param() to clear "compress-force" when "compress" is
given, providing the same semantics as kernel 6.7 and older.
Reported-by: Roman Mamedov <rm@romanrm.net>
Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/
CC: stable@vger.kernel.org # 6.8+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently log recovery never updates the in-core perag values for the
last allocation group when they were grown by growfs. This leads to
btree record validation failures for the alloc, ialloc or finotbt
trees if a transaction references this new space.
Found by Brian's new growfs recovery stress test.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
__GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail,
which isn't really helpful during log recovery. Remove the flag and
stick to the default GFP_KERNEL policies.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
XFS currently does not support reducing the agcount, so error out if
a logged sb buffer tries to shrink the agcount.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Primary superblock buffers that change the file system geometry after a
growfs operation can affect the operation of later CIL checkpoints that
make use of the newly added space and allocation groups.
Apply the changes to the in-memory structures as part of recovery pass 2,
to ensure recovery works fine for such cases.
In the future we should apply the logic to other updates such as features
bits as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
There is no good reason to have two different routines for freeing perag
structures for the unmount and error cases. Add two arguments to specify
the range of AGs to free to xfs_free_perag, and use that to replace
xfs_free_unused_perag_range.
The addition RCU grace period for the error case is harmless, and the
extra check for the AG to actually exist is not required now that the
callers pass the exact known allocated range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Currently only the new agcount is passed to xfs_initialize_perag, which
requires lookups of existing AGs to skip them and complicates error
handling. Also pass the previous agcount so that the range that
xfs_initialize_perag operates on is exactly defined. That way the
extra lookups can be avoided, and error handling can clean up the
exact range from the old count to the last added perag structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Fix a minor bug where we fail repairs on metadata files that do not have
attr forks because xrep_metadata_inode_subtype doesn't filter ENOENT.
Cc: stable@vger.kernel.org # v6.8
Fixes: 5a8e07e799 ("xfs: repair the inode core and forks of a metadata inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZxY6XAAKCRCRxhvAZXjc
opmUAQCu4KhzBBdZmFw3AfZFNJvYb1onT4FiU0pnyGgfvzEdEwD6AlnlgQ7DL3ZN
WBqBzUl+DpGYJfzhkqoEGH89Fagx7QM=
=mm68
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"afs:
- Fix a lock recursion in afs_wake_up_async_call() on ->notify_lock
netfs:
- Drop the references to a folio immediately after the folio has been
extracted to prevent races with future I/O collection
- Fix a documenation build error
- Downgrade the i_rwsem for buffered writes to fix a cifs reported
performance regression when switching to netfslib
vfs:
- Explicitly return -E2BIG from openat2() if the specified size is
unexpectedly large. This aligns openat2() with other extensible
struct based system calls
- When copying a mount namespace ensure that we only try to remove
the new copy from the mount namespace rbtree if it has already been
added to it
nilfs:
- Clear the buffer delay flag when clearing the buffer state clags
when a buffer head is discarded to prevent a kernel OOPs
ocfs2:
- Fix an unitialized value warning in ocfs2_setattr()
proc:
- Fix a kernel doc warning"
* tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
proc: Fix W=1 build kernel-doc warning
afs: Fix lock recursion
fs: Fix uninitialized value issue in from_kuid and from_kgid
fs: don't try and remove empty rbtree node
netfs: Downgrade i_rwsem for a buffered write
nilfs2: fix kernel bug due to missing clearing of buffer delay flag
openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)
netfs: fix documentation build error
netfs: In readahead, put the folio refs as soon extracted
iomap_want_unshare_iter currently sits in fs/iomap/buffered-io.c, which
depends on CONFIG_BLOCK. It is also in used in fs/dax.c whіch has no
such dependency. Given that it is a trivial check turn it into an inline
in include/linux/iomap.h to fix the DAX && !BLOCK build.
Fixes: 6ef6a0e821 ("iomap: share iomap_unshare_iter predicate code with fsdax")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241015041350.118403-1-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the
function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will
release all resources related to the hashed `nfs4_client`. If the
`nfsd_client_shrinker` is running concurrently, the `expire_client`
function will first unhash this client and then destroy it. This can
lead to the following warning. Additionally, numerous use-after-free
errors may occur as well.
nfsd_client_shrinker echo 0 > /proc/fs/nfsd/threads
expire_client nfsd_shutdown_net
unhash_client ...
nfs4_state_shutdown_net
/* won't wait shrinker exit */
/* cancel_work(&nn->nfsd_shrinker_work)
* nfsd_file for this /* won't destroy unhashed client1 */
* client1 still alive nfs4_state_destroy_net
*/
nfsd_file_cache_shutdown
/* trigger warning */
kmem_cache_destroy(nfsd_file_slab)
kmem_cache_destroy(nfsd_file_mark_slab)
/* release nfsd_file and mark */
__destroy_client
====================================================================
BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on
__kmem_cache_shutdown()
--------------------------------------------------------------------
CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3+ #1
dump_stack_lvl+0x53/0x70
slab_err+0xb0/0xf0
__kmem_cache_shutdown+0x15c/0x310
kmem_cache_destroy+0x66/0x160
nfsd_file_cache_shutdown+0xac/0x210 [nfsd]
nfsd_destroy_serv+0x251/0x2a0 [nfsd]
nfsd_svc+0x125/0x1e0 [nfsd]
write_threads+0x16a/0x2a0 [nfsd]
nfsctl_transaction_write+0x74/0xa0 [nfsd]
vfs_write+0x1a5/0x6d0
ksys_write+0xc1/0x160
do_syscall_64+0x5f/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
====================================================================
BUG nfsd_file_mark (Tainted: G B W ): Objects remaining
nfsd_file_mark on __kmem_cache_shutdown()
--------------------------------------------------------------------
dump_stack_lvl+0x53/0x70
slab_err+0xb0/0xf0
__kmem_cache_shutdown+0x15c/0x310
kmem_cache_destroy+0x66/0x160
nfsd_file_cache_shutdown+0xc8/0x210 [nfsd]
nfsd_destroy_serv+0x251/0x2a0 [nfsd]
nfsd_svc+0x125/0x1e0 [nfsd]
write_threads+0x16a/0x2a0 [nfsd]
nfsctl_transaction_write+0x74/0xa0 [nfsd]
vfs_write+0x1a5/0x6d0
ksys_write+0xc1/0x160
do_syscall_64+0x5f/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To resolve this issue, cancel `nfsd_shrinker_work` using synchronous
mode in nfs4_state_shutdown_net.
Fixes: 7c24fa2250 ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker")
Signed-off-by: Yang Erkun <yangerkun@huaweicloud.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As Allison reported [1], currently get_tree_bdev() will store
"Can't lookup blockdev" error message. Although it makes sense for
pure bdev-based fses, this message may mislead users who try to use
EROFS file-backed mounts since get_tree_nodev() is used as a fallback
then.
Add get_tree_bdev_flags() to specify extensible flags [2] and
GET_TREE_BDEV_QUIET_LOOKUP to silence "Can't lookup blockdev" message
since it's misleading to EROFS file-backed mounts now.
[1] https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com
[2] https://lore.kernel.org/r/ZwUkJEtwIpUA4qMz@infradead.org
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241009033151.2334888-1-hsiangkao@linux.alibaba.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This fixes a fsck bug on a very old filesystem (pre mainline merge).
Fixes: 72350ee0ea ("bcachefs: Kill snapshot arg to fsck_write_inode()")
Reported-by: Marcin Mirosław <marcin@mejor.pl>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
kvmalloc() doesn't support allocations > INT_MAX, but vmalloc() does -
the limit should be lifted, but we can work around this for now.
A user with a 75 TB filesystem reported the following journal replay
error:
https://github.com/koverstreet/bcachefs/issues/769
In journal replay we have to sort and dedup all the keys from the
journal, which means we need a large contiguous allocation. Given that
the user has 128GB of ram, the 2GB limit on allocation size has become
far too small.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Fix a bug where mount was failing with -ERESTARTSYS:
https://github.com/koverstreet/bcachefs/issues/741
We only want the interruptible wait when called from fsync.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- fix for multiple slabs created with the same name
- enable multipage folios
- theorical fix to also look for opened fids by inode if none
was found by dentry
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmcS81AACgkQq06b7GqY
5nACpBAAtXOGRjg+dushCwUVKBlnI3oTwE2G+ywnphNZg2A0emlMOxos7x1OTiM3
Fu0b10MCUWHIXo4jD6ALVPWITJTfjiXR8s90Q/ozypcIXXhkDDShhV31b2h6Iplr
YyKyjEehDFRiS7rqWC2a9mce99sOpwdQRmnssnWbjYvpJ4imFbl+50Z1I5Nc/Omu
j2y02eMuikiWF/shKj0Dx1mmpZ4InSv3kvlM+V2D2YdWKNonGZe/xFZhid95LXmr
Upt55R8k9qR2pn4VU22eKP6c34DIZGDlrcQdPUCNP5QuaAdGZov3TjNQdjE1bJmF
E2QdxvUNfvvHqlvaRrlWa27uMgXMcy7QV3LEKwmo3tmaYVw2PDMRbFXc9zQdxy91
zqXjjGasnwzE8ca36y79vZjFTHAyY5VK/3cHCL3ai+ysu4UL3k2QgmVegREG/xKk
G8Nz4UO/R6s8Wc2VqxKJdZS5NMLlADS+Aes0PG+9AxQz7iR9Ktgwrw39KDxMi+Lm
PeH3Gz2rP9+EPoa3usoBQtvvvmJKM/Wb9qdPW9vTtRbRJ7bVclJoizFoLMA/TiW1
Jru+HYGBO75s8RynwEDLMiJhkjZWHfVgDjPsY6YsGVH8W2gOcJ7egQ2J2EsuurN3
tzKz4uQilV+VeDuWs8pWKrX/c3Y3KpSYV+oayg7Je7LoTlQBmU8=
=VG4t
-----END PGP SIGNATURE-----
Merge tag '9p-for-6.12-rc4' of https://github.com/martinetd/linux
Pull 9p fixes from Dominique Martinet:
"Mashed-up update that I sat on too long:
- fix for multiple slabs created with the same name
- enable multipage folios
- theorical fix to also look for opened fids by inode if none was
found by dentry"
[ Enabling multi-page folios should have been done during the merge
window, but it's a one-liner, and the actual meat of the enablement
is in netfs and already in use for other filesystems... - Linus ]
* tag '9p-for-6.12-rc4' of https://github.com/martinetd/linux:
9p: Avoid creating multiple slab caches with the same name
9p: Enable multipage folios
9p: v9fs_fid_find: also lookup by inode if not found dentry
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.
Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:
kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20
This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).
First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.
Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.
Fixes: 2d4a532d38 ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcSdmYACgkQiiy9cAdy
T1EnnAwAoNbY+odLB9atHIuaBftpyINrhzRrzpwTfYNtPKUPGxxGk2fiP29YqMLb
OF4jnC87E3P/xhydoZHXXe3kKBQFVMAkJZKHiZBvJd+brk/EadfQnNmIio1pwOGh
zFNxSujFtsM/1HU/ZoI2kaHzrqj5KxWKWFytZ6umd8C3NyKK9Lo/lcqUBKv8MpJy
XXkMBh+7HGKRfDQlU+n6NQ5+dqFL5xDjTXlm9dM8LXuInKy5oKTGnRhLA7OA8lt7
EenFo8joy0IpXUByHt+ksQ8P88NCnU2h9kGp1UrGrBPh90+MokRr9GAcH8twK8jt
/bpL4yzAwuk1TAg+L9mSLT2OtWYsDpsQZmsBMbxBZGr2qmtjwgbxSgjf6DNiJZgn
jz15nFsuEsU5AbX4EAE67fwRWAo9AmQFyOOcYgkiIWOFHaRU6D/2NzCxCDZ+mfpy
Z5f7dF/sA158iY4wmB5BrQpFamxzpLADz6Qy4NA9hXjEKsbyFAuf22EjE64ruxZ4
8nMB3buh
=peum
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Fix possible double free setting xattrs
- Fix slab out of bounds with large ioctl payload
- Remove three unused functions, and an unused variable that could be
confusing
* tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Remove unused functions
smb/client: Fix logically dead code
smb: client: fix OOBs when building SMB2_IOCTL request
smb: client: fix possible double free in smb2_set_ea()
* Fix integer overflow in xrep_bmap
* Fix stale dealloc punching for COW IO
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZw5LIwAKCRBGdaER5Qtf
puRlAYDezbvs1dDSkKIGOt3inGdLptNAu4qniXBUkbYI9BzmtIVDueWP4Wo0dV3d
gu3xrWQBfjFXdmEuBlwLuAFrp07AN18BVMj+DWCiEShsPHSoSPcF/IrDiz4BHvGv
MKYq9CywFw==
=Gj9b
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- Fix integer overflow in xrep_bmap
- Fix stale dealloc punching for COW IO
* tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: punch delalloc extents from the COW fork for COW writes
xfs: set IOMAP_F_SHARED for all COW fork allocations
xfs: share more code in xfs_buffered_write_iomap_begin
xfs: support the COW fork in xfs_bmap_punch_delalloc_range
xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lock
xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
xfs: factor out a xfs_file_write_zero_eof helper
iomap: move locking out of iomap_write_delalloc_release
iomap: remove iomap_file_buffered_write_punch_delalloc
iomap: factor out a iomap_last_written_block helper
xfs: fix integer overflow in xrep_bmap
Building the kernel with W=1 generates the following warning:
fs/proc/fd.c:81: warning: This comment starts with '/**',
but isn't a kernel-doc comment.
Use a normal comment for the helper function proc_fdinfo_permission().
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20241018102705.92237-2-thorsten.blum@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
hash_check_key() checks and repairs the hash table btrees: dirents and
xattrs are open addressing hash tables.
We recently had a corruption reported where the hash type on an inode
somehow got flipped, which made the existing dirents invisible and
allowed new ones to be created with the same name.
Now, hash_check_key() can repair duplicates: it will delete one of them,
if it has an xattr or dangling dirent, but if it has two valid dirents
one of them gets renamed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Different versions of the same inode (same inode number, different
snapshot ID) must have the same hash seed and type - lookups require
this, since they see keys from different snapshots simultaneously.
To repair we only need to make the inodes consistent, hash_check_key()
will do the rest.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This helped with discovering some filesystem corruption fsck has having
trouble with: the str_hash type had gotten flipped on one snapshot's
version of an inode.
All versions of a given inode number have the same hash seed and hash
type, since lookups will be done with a single hash/seed and type and
see dirents/xattrs from multiple snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The options parse in get_tree will split the options buffer, it will
get the empty string for last one by strsep(). After commit
ea0eeb89b1d5 ("bcachefs: reject unknown mount options") is merged,
unknown mount options is not allowed (here is empty string), and this
causes this errors. This can be reproduced just by the following steps:
bcachefs format /dev/loop
mount -t bcachefs -o metadata_target=loop1 /dev/loop1 /mnt/bcachefs/
Fixes: ea0eeb89b1d5 ("bcachefs: reject unknown mount options")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When call show_options in bcachefs, the options buffer is appeneded
to the seq variable. In fact, it requires an additional comma to be
appended first. This will affect the remount process when reading
existing mount options.
Fixes: 9305cf91d05e ("bcachefs: bch2_opts_to_text()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Found by generic/299: When we have to truncate a write due to -ENOSPC,
we may have to read in the folio we're writing to if we're now no longer
doing a complete write to a !uptodate folio.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_folio_reservation_get_partial(), on partial success, will now
return a reservation that's aligned to the filesystem blocksize.
This is a partial fix for fstests generic/299 - fio verify is badly
behaved in the presence of short writes that aren't aligned to its
blocksize.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_disk_reservation_put() zeroes out the reservation - oops.
This fixes a disk reservation leak when getting a quota reservation
returned an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Using commit_do() to call alloc_sectors_start_trans() breaks when we're
randomly injecting transaction restarts - the restart in the commit
causes us to leak the lock that alloc_sectorS_start_trans() takes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bucket_io_time_reset() doesn't need to succeed, which is why it
didn't previously retry on transaction restart - but we're now treating
these as errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is ugly:
We may discover in alloc_write_key that the data type we calculated is
wrong, because BCH_DATA_need_discard is checked/set elsewhere, and the
disk accounting counters we calculated need to be updated.
But bch2_alloc_key_to_dev_counters(..., BTREE_TRIGGER_gc) is not safe
w.r.t. transaction restarts, so we need to propagate the fixup back to
our gc state in case we take a transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this one is fairly harmless since the invalidate worker will just run
again later if it needs to, but still worth fixing
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This should be impossible to hit in practice; the first lookup within a
transaction won't return a restart due to lock ordering, but we're
adding fault injection for transaction restarts and shaking out bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It is the usual shower of unrelated singletons - please see the individual
changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZxGY5wAKCRDdBJ7gKXxA
js6RAQC16zQ7WRV091i79cEi1C5648NbZjMCU626hZjuyfbzKgEA2v8PYtjj9w2e
UGLxMY+PYZki2XNEh75Sikdkiyl9Vgg=
=xcWT
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"28 hotfixes. 13 are cc:stable. 23 are MM.
It is the usual shower of unrelated singletons - please see the
individual changelogs for details"
* tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits)
maple_tree: add regression test for spanning store bug
maple_tree: correct tree corruption on spanning store
mm/mglru: only clear kswapd_failures if reclaimable
mm/swapfile: skip HugeTLB pages for unuse_vma
selftests: mm: fix the incorrect usage() info of khugepaged
MAINTAINERS: add Jann as memory mapping/VMA reviewer
mm: swap: prevent possible data-race in __try_to_reclaim_swap
mm: khugepaged: fix the incorrect statistics when collapsing large file folios
MAINTAINERS: kasan, kcov: add bugzilla links
mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()
Docs/damon/maintainer-profile: update deprecated awslabs GitHub URLs
Docs/damon/maintainer-profile: add missing '_' suffixes for external web links
maple_tree: check for MA_STATE_BULK on setting wr_rebalance
mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace point
mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets()
mm: remove unused stub for can_swapin_thp()
mailmap: add an entry for Andy Chiu
MAINTAINERS: add memory mapping/VMA co-maintainers
fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization
...
When btrfs reserves an extent and does not use it (e.g, by an error), it
calls btrfs_free_reserved_extent() to free the reserved extent. In the
process, it calls btrfs_add_free_space() and then it accounts the region
bytes as block_group->zone_unusable.
However, it leaves the space_info->bytes_zone_unusable side not updated. As
a result, ENOSPC can happen while a space_info reservation succeeded. The
reservation is fine because the freed region is not added in
space_info->bytes_zone_unusable, leaving that space as "free". OTOH,
corresponding block group counts it as zone_unusable and its allocation
pointer is not rewound, we cannot allocate an extent from that block group.
That will also negate space_info's async/sync reclaim process, and cause an
ENOSPC error from the extent allocation process.
Fix that by returning the space to space_info->bytes_zone_unusable.
Ideally, since a bio is not submitted for this reserved region, we should
return the space to free space and rewind the allocation pointer. But, it
needs rework on extent allocation handling, so let it work in this way for
now.
Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
afs_wake_up_async_call() can incur lock recursion. The problem is that it
is called from AF_RXRPC whilst holding the ->notify_lock, but it tries to
take a ref on the afs_call struct in order to pass it to a work queue - but
if the afs_call is already queued, we then have an extraneous ref that must
be put... calling afs_put_call() may call back down into AF_RXRPC through
rxrpc_kernel_shutdown_call(), however, which might try taking the
->notify_lock again.
This case isn't very common, however, so defer it to a workqueue. The oops
looks something like:
BUG: spinlock recursion on CPU#0, krxrpcio/7001/1646
lock: 0xffff888141399b30, .magic: dead4ead, .owner: krxrpcio/7001/1646, .owner_cpu: 0
CPU: 0 UID: 0 PID: 1646 Comm: krxrpcio/7001 Not tainted 6.12.0-rc2-build3+ #4351
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x70
do_raw_spin_lock+0x3c/0x90
rxrpc_kernel_shutdown_call+0x83/0xb0
afs_put_call+0xd7/0x180
rxrpc_notify_socket+0xa0/0x190
rxrpc_input_split_jumbo+0x198/0x1d0
rxrpc_input_data+0x14b/0x1e0
? rxrpc_input_call_packet+0xc2/0x1f0
rxrpc_input_call_event+0xad/0x6b0
rxrpc_input_packet_on_conn+0x1e1/0x210
rxrpc_input_packet+0x3f2/0x4d0
rxrpc_io_thread+0x243/0x410
? __pfx_rxrpc_io_thread+0x10/0x10
kthread+0xcf/0xe0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x24/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1394602.1729162732@warthog.procyon.org.uk
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
When copying a namespace we won't have added the new copy into the
namespace rbtree until after the copy succeeded. Calling free_mnt_ns()
will try to remove the copy from the rbtree which is invalid. Simply
free the namespace skeleton directly.
Link: https://lore.kernel.org/r/20241016-adapter-seilwinde-83c508a7bde1@brauner
Fixes: 1901c92497 ("fs: keep an index of current mount namespaces")
Tested-by: Brad Spengler <spender@grsecurity.net>
Cc: stable@vger.kernel.org # v6.11+
Reported-by: Brad Spengler <spender@grsecurity.net>
Suggested-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the I/O locking code borrowed from NFS into netfslib, i_rwsem is held
locked across a buffered write - but this causes a performance regression
in cifs as it excludes buffered reads for the duration (cifs didn't use any
locking for buffered reads).
Mitigate this somewhat by downgrading the i_rwsem to a read lock across the
buffered write. This at least allows parallel reads to occur whilst
excluding other writes, DIO, truncate and setattr.
Note that this shouldn't be a problem for a buffered write as a read
through an mmap can circumvent i_rwsem anyway.
Also note that we might want to make this change in NFS also.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1317958.1729096113@warthog.procyon.org.uk
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Trond Myklebust <trondmy@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
show show_smap_vma_flags() has been a using misspelled initializer in
mnemonics[] - it needed to initialize 2 element array of char and it used
NUL-padded 2 character string literals (i.e. 3-element initializer).
This has been spotted by gcc-15[*]; prior to that gcc quietly dropped the
3rd eleemnt of initializers. To fix this we are increasing the size of
mnemonics[] (from mnemonics[BITS_PER_LONG][2] to
mnemonics[BITS_PER_LONG][3]) to accomodate the NUL-padded string literals.
This also helps us in simplyfying the logic for printing of the flags as
instead of printing each character from the mnemonics[], we can just print
the mnemonics[] using seq_printf.
[*]: fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
917 | [0 ... (BITS_PER_LONG-1)] = "??",
| ^~~~
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
...
Stephen pointed out:
: The C standard explicitly allows for a string initializer to be too long
: due to the NUL byte at the end ... so this warning may be overzealous.
but let's make the warning go away anwyay.
Link: https://lkml.kernel.org/r/20241005063700.2241027-1-brahmajit.xyz@gmail.com
Link: https://lkml.kernel.org/r/20241003093040.47c08382@canb.auug.org.au
Signed-off-by: Brahmajit Das <brahmajit.xyz@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reported that a task hang occurs in vcs_open() during a fuzzing
test for nilfs2.
The root cause of this problem is that in nilfs_find_entry(), which
searches for directory entries, ignores errors when loading a directory
page/folio via nilfs_get_folio() fails.
If the filesystem images is corrupted, and the i_size of the directory
inode is large, and the directory page/folio is successfully read but
fails the sanity check, for example when it is zero-filled,
nilfs_check_folio() may continue to spit out error messages in bursts.
Fix this issue by propagating the error to the callers when loading a
page/folio fails in nilfs_find_entry().
The current interface of nilfs_find_entry() and its callers is outdated
and cannot propagate error codes such as -EIO and -ENOMEM returned via
nilfs_find_entry(), so fix it together.
Link: https://lkml.kernel.org/r/20241004033640.6841-1-konishi.ryusuke@gmail.com
Fixes: 2ba466d74e ("nilfs2: directory entry operations")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: Lizhi Xu <lizhi.xu@windriver.com>
Closes: https://lkml.kernel.org/r/20240927013806.3577931-1-lizhi.xu@windriver.com
Reported-by: syzbot+8a192e8d090fa9a31135@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8a192e8d090fa9a31135
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcPxtAACgkQxWXV+ddt
WDu9lA//WfB88fwEKnqBYDRo6aiSMIAzLDuXkJ9i8d7rcjZO1OIZkEnMOsxhvTcZ
KxgjNjkgzoTyUwoAUlG+ZpvMeSNMhBdr2NFkXmYzN9oanFE4zplpZiWx6tGSApRU
0ilngjXBsr8p03HmB88Yb05DVYQ2elMP6Jx3VETDBa0CNyp4//tGKzusNhZdA7KM
XLZmkKRk3ZKabNo+p2J5t8UGJCl2L18U0o/EphfSkODKadUnsBbAPZUt2EGQCZwv
uZhDFAUkgTFBkeRO7JwTfDrNi51M4zwmh+kEduzg4Ny4TdFb1UapU7K1N330WMru
4Qa953Met9I4NB/kKI+fZP1lN4NGuD2qEU6yoZVSy4UiqRp1gEg8kOUfVGFbNJa1
VFYcwdrBad0I4PjnQc5bpZVjzqJT5wWiZxjlWrB7VyIfdmnvQxe5h4DBwBhN5FJr
+MEtuY2QNFygjDAZ5z0Ss8hegqI+FYi562Cjy9QRLhb3qGD8STF2BIChaILIn3oA
UVJUlUP6CUmCu1RZRMFB4/WkeHO46FmZxJErGfFeXqJInThf0/rdSZOQgIP0JsUq
N8FINEgXFAMCkK1PT7MNvAYkSP0tR7B0JjGKcSlGS3v3F0URCNGvHSiqbLedAtXT
lc1MdXTZxub8h6xhIvgY1j7HRAFGrunn7LD6MIKRWX1SZPWwAGI=
=DEUA
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fix: dirty extents tracked in xarray for qgroups must be
adjusted for 32bit platforms
- fix potentially freeing uninitialized name in fscrypt structure
- fix warning about unneeded variable in a send callback
* tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix uninitialized pointer free on read_alloc_one_name() error
btrfs: send: cleanup unneeded return variable in changed_verity()
btrfs: fix uninitialized pointer free in add_inode_ref()
btrfs: use sector numbers as keys for the dirty extents xarray
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcPUXEACgkQiiy9cAdy
T1Gm7QwAlPW5//Cb4B0gpjzRcUws51IZ4yFhp4IQWmsd0RqdjZ4TxSCOPF3u3HR3
0OPxyLdbUn6h5g0S2ayzqomHx2VBOQTjgyuMtaTWzokToMNu8kqvxK1MTslkBior
9YEHUz9+5f0OJ+JBGNUzjfy4Plygr5y09udaLfqIknuY8+SeuooxNNUNfkIvrP7C
JsSAWJznN9VMpKJmszYc4ntyTiz1XVXyyjJmjhRQ27ah8LUghqZ0mamgigTS5UFa
U7eYBDfs6+9i5Lvkd4bJPdGyov9g/EPViLURZMfNaz3+p0TfosN8s2UZuhHC+zuv
BDQ+wHGRqzmteZspLanrGBt9y9svHXp1CD7MwqWeGR3GhKsfsxCMJpE931fBhsxM
vlJdd/xCs128fv48AvNyHA9abN0U1FpskOJhOzjDgvhKqDoIQ4TCC7QFDEttsPRv
ZiQmyOCPyZZY28EmfoltU4CFcMIwKQ81nPUSOJFgKmHBbSpc+Qtnv5QgRHZCzj7n
StJfaIMv
=WhJj
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- fix race between session setup and session logoff
- add supplementary group support
* tag 'v6.12-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: add support for supplementary groups
ksmbd: fix user-after-free from session log off
Syzbot reported that after nilfs2 reads a corrupted file system image
and degrades to read-only, the BUG_ON check for the buffer delay flag
in submit_bh_wbc() may fail, causing a kernel bug.
This is because the buffer delay flag is not cleared when clearing the
buffer state flags to discard a page/folio or a buffer head. So, fix
this.
This became necessary when the use of nilfs2's own page clear routine
was expanded. This state inconsistency does not occur if the buffer
is written normally by log writing.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Link: https://lore.kernel.org/r/20241015213300.7114-1-konishi.ryusuke@gmail.com
Fixes: 8c26c4e269 ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption")
Reported-by: syzbot+985ada84bf055a575c07@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=985ada84bf055a575c07
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
yangyun reported that libfuse test test_copy_file_range() copies zero
bytes from a newly written file when fuse passthrough is enabled.
The reason is that extending passthrough write is not updating the fuse
inode size and when vfs_copy_file_range() observes a zero size inode,
it returns without calling the filesystem copy_file_range() method.
Fix this by adjusting the fuse inode size after an extending passthrough
write.
This does not provide cache coherency of fuse inode attributes and
backing inode attributes, but it should prevent situations where fuse
inode size is too small, causing read/copy to be wrongly shortened.
Reported-by: yangyun <yangyun50@huawei.com>
Closes: https://github.com/libfuse/libfuse/issues/1048
Fixes: 57e1176e60 ("fuse: implement read/write passthrough")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
cifs_ses_find_chan() has been unused since commit
f486ef8e20 ("cifs: use the chans_need_reconnect bitmap for reconnect status")
cifs_read_page_from_socket() has been unused since commit
d08089f649 ("cifs: Change the I/O paths to use an iterator rather than a page list")
cifs_chan_in_reconnect() has been unused since commit
bc962159e8 ("cifs: avoid race conditions with parallel reconnects")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The if condition in collect_sample: can never be satisfied
because of a logical contradiction. The indicated dead code
may have performed some action; that action will never occur.
Fixes: 94ae8c3fee ("smb: client: compress: LZ77 code improvements cleanup")
Signed-off-by: Advait Dhamorikar <advaitdhamorikar@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When using encryption, either enforced by the server or when using
'seal' mount option, the client will squash all compound request buffers
down for encryption into a single iov in smb2_set_next_command().
SMB2_ioctl_init() allocates a small buffer (448 bytes) to hold the
SMB2_IOCTL request in the first iov, and if the user passes an input
buffer that is greater than 328 bytes, smb2_set_next_command() will
end up writing off the end of @rqst->iov[0].iov_base as shown below:
mount.cifs //srv/share /mnt -o ...,seal
ln -s $(perl -e "print('a')for 1..1024") /mnt/link
BUG: KASAN: slab-out-of-bounds in
smb2_set_next_command.cold+0x1d6/0x24c [cifs]
Write of size 4116 at addr ffff8881148fcab8 by task ln/859
CPU: 1 UID: 0 PID: 859 Comm: ln Not tainted 6.12.0-rc3 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-2.fc40 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x5d/0x80
? smb2_set_next_command.cold+0x1d6/0x24c [cifs]
print_report+0x156/0x4d9
? smb2_set_next_command.cold+0x1d6/0x24c [cifs]
? __virt_addr_valid+0x145/0x310
? __phys_addr+0x46/0x90
? smb2_set_next_command.cold+0x1d6/0x24c [cifs]
kasan_report+0xda/0x110
? smb2_set_next_command.cold+0x1d6/0x24c [cifs]
kasan_check_range+0x10f/0x1f0
__asan_memcpy+0x3c/0x60
smb2_set_next_command.cold+0x1d6/0x24c [cifs]
smb2_compound_op+0x238c/0x3840 [cifs]
? kasan_save_track+0x14/0x30
? kasan_save_free_info+0x3b/0x70
? vfs_symlink+0x1a1/0x2c0
? do_symlinkat+0x108/0x1c0
? __pfx_smb2_compound_op+0x10/0x10 [cifs]
? kmem_cache_free+0x118/0x3e0
? cifs_get_writable_path+0xeb/0x1a0 [cifs]
smb2_get_reparse_inode+0x423/0x540 [cifs]
? __pfx_smb2_get_reparse_inode+0x10/0x10 [cifs]
? rcu_is_watching+0x20/0x50
? __kmalloc_noprof+0x37c/0x480
? smb2_create_reparse_symlink+0x257/0x490 [cifs]
? smb2_create_reparse_symlink+0x38f/0x490 [cifs]
smb2_create_reparse_symlink+0x38f/0x490 [cifs]
? __pfx_smb2_create_reparse_symlink+0x10/0x10 [cifs]
? find_held_lock+0x8a/0xa0
? hlock_class+0x32/0xb0
? __build_path_from_dentry_optional_prefix+0x19d/0x2e0 [cifs]
cifs_symlink+0x24f/0x960 [cifs]
? __pfx_make_vfsuid+0x10/0x10
? __pfx_cifs_symlink+0x10/0x10 [cifs]
? make_vfsgid+0x6b/0xc0
? generic_permission+0x96/0x2d0
vfs_symlink+0x1a1/0x2c0
do_symlinkat+0x108/0x1c0
? __pfx_do_symlinkat+0x10/0x10
? strncpy_from_user+0xaa/0x160
__x64_sys_symlinkat+0xb9/0xf0
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f08d75c13bb
Reported-by: David Howells <dhowells@redhat.com>
Fixes: e77fe73c7e ("cifs: we can not use small padding iovs together with encryption")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Clang static checker(scan-build) warning:
fs/smb/client/smb2ops.c:1304:2: Attempt to free released memory.
1304 | kfree(ea);
| ^~~~~~~~~
There is a double free in such case:
'ea is initialized to NULL' -> 'first successful memory allocation for
ea' -> 'something failed, goto sea_exit' -> 'first memory release for ea'
-> 'goto replay_again' -> 'second goto sea_exit before allocate memory
for ea' -> 'second memory release for ea resulted in double free'.
Re-initialie 'ea' to NULL near to the replay_again label, it can fix this
double free problem.
Fixes: 4f1fffa237 ("cifs: commands that are retried should have replay flag set")
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- New metadata version inode_has_child_snapshots
This fixes bugs with handling of unlinked inodes + snapshots, in
particular when an inode is reattached after taking a snapshot;
deleted inodes now get correctly cleaned up across snapshots.
- Disk accounting rewrite fixes
- validation fixes for when a device has been removed
- fix journal replay failing with "journal_reclaim_would_deadlock"
- Some more small fixes for erasure coding + device removal
- Assorted small syzbot fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcNw4UACgkQE6szbY3K
bnbSzBAAmSCCQCqRwnFSp4OdNSlBK9q1e5WsbKOqHgtoXZU/mOUBe/5bnPPqm6Mg
GkTc7FqVOs/95/rEDKXw2LneFgxRrt8MriJCUdXZvV5fC2R4Kdl0TkwABtMtm2Ae
wp37n6iQO81j4uZHfOj67RzC2NRo7dMdun5HnQPRBTKzyuDaZXqwjMmF2LmaeODh
oiBFUvD5nFBo5XvXPABBin6xpdquHO+6ZWf6SFD4+iRe11NrJAOAIS/crJvxsFfr
I/X152Z+gzKPE+NhANKMxlHyNnVGo7iHUqhUjVuI4SSaXb9Ap6k4sXgfoIzncR17
GA5qWtaNS1W72+awT3R2EaF9Tqi+Vng2RVfxxQ04giImnBq0eziOjlZ26enOE0LU
0ZZrBFzqpItqYbNnzPissHuKb1mAQGPWy6kxoGIrqDKbichA7lzyWDz2lgEE85Sx
E1mvHwYbKhUuLC4c4460hueGVUgMWmjqM3E8oex+oNDpauPB+/bnYkcgZEG2RBla
+ZlDL28fg4fxtqlUrOQeonQ1RecGNdRMJz7xiGnkYU9rQpUuv8QwFiBZGAbLP6zn
6fbFZGxS/pO95sY7GmAtKz7ZgKxJQCzII4s+Oht5AgOvoBlPjAiol1UbwYadYQxz
HKF+WBaPC9z/L6JjP+gx+uUzTWRIfBmhHylhWbKr4vLGfx3Jc1g=
=Rkq2
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
- New metadata version inode_has_child_snapshots
This fixes bugs with handling of unlinked inodes + snapshots, in
particular when an inode is reattached after taking a snapshot;
deleted inodes now get correctly cleaned up across snapshots.
- Disk accounting rewrite fixes
- validation fixes for when a device has been removed
- fix journal replay failing with "journal_reclaim_would_deadlock"
- Some more small fixes for erasure coding + device removal
- Assorted small syzbot fixes
* tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs: (27 commits)
bcachefs: Fix sysfs warning in fstests generic/730,731
bcachefs: Handle race between stripe reuse, invalidate_stripe_to_dev
bcachefs: Fix kasan splat in new_stripe_alloc_buckets()
bcachefs: Add missing validation for bch_stripe.csum_granularity_bits
bcachefs: Fix missing bounds checks in bch2_alloc_read()
bcachefs: fix uaf in bch2_dio_write_done()
bcachefs: Improve check_snapshot_exists()
bcachefs: Fix bkey_nocow_lock()
bcachefs: Fix accounting replay flags
bcachefs: Fix invalid shift in member_to_text()
bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALID
bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry
bcachefs: Check if stuck in journal_res_get()
closures: Add closure_wait_event_timeout()
bcachefs: Fix state lock involved deadlock
bcachefs: Fix NULL pointer dereference in bch2_opt_to_text
bcachefs: Release transaction before wake up
bcachefs: add check for btree id against max in try read node
bcachefs: Disk accounting device validation fixes
bcachefs: bch2_inode_or_descendents_is_open()
...
When ->iomap_end is called on a short write to the COW fork it needs to
punch stale delalloc data from the COW fork and not the data fork.
Ensure that IOMAP_F_NEW is set for new COW fork allocations in
xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag
in xfs_buffered_write_delalloc_punch to decide which fork to punch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Change to always set xfs_buffered_write_iomap_begin for COW fork
allocations even if they don't overlap existing data fork extents,
which will allow the iomap_end callback to detect if it has to punch
stale delalloc blocks from the COW fork instead of the data fork. It
also means we sample the sequence counter for both the data and the COW
fork when writing to the COW fork, which ensures we properly revalidate
when only COW fork changes happens.
This is essentially a revert of commit 72a048c105 ("xfs: only set
IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because
the problem that the commit fixed has now been dealt with in iomap by
only looking at the actual srcmap and not the fallback to the write
iomap.
Note that the direct I/O path was never changed and has always set
IOMAP_F_SHARED for all COW fork allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Introduce a local iomap_flags variable so that the code allocating new
delalloc blocks in the data fork can fall through to the found_imap
label and reuse the code to unlock and fill the iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>