linux-stable/fs
Filipe Manana 7d6872ccbd btrfs: fix deadlock between transaction commits and extent locks
When running a workload with fsstress and duperemove (generic/561) we can
hit a deadlock related to transaction commits and locking extent ranges,
as described below.

Task A hanging during a transaction commit, waiting for all other writers
to complete:

  [178317.334817] INFO: task fsstress:555623 blocked for more than 120 seconds.
  [178317.335693]       Not tainted 6.12.0-rc6-btrfs-next-179+ #1
  [178317.336528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [178317.337673] task:fsstress        state:D stack:0     pid:555623 tgid:555623 ppid:555620 flags:0x00004002
  [178317.337679] Call Trace:
  [178317.337681]  <TASK>
  [178317.337685]  __schedule+0x364/0xbe0
  [178317.337691]  schedule+0x26/0xa0
  [178317.337695]  btrfs_commit_transaction+0x5c5/0x1050 [btrfs]
  [178317.337769]  ? start_transaction+0xc4/0x800 [btrfs]
  [178317.337815]  ? __pfx_autoremove_wake_function+0x10/0x10
  [178317.337819]  btrfs_mksubvol+0x381/0x640 [btrfs]
  [178317.337878]  btrfs_mksnapshot+0x7a/0xb0 [btrfs]
  [178317.337935]  __btrfs_ioctl_snap_create+0x1bb/0x1d0 [btrfs]
  [178317.337995]  btrfs_ioctl_snap_create_v2+0x103/0x130 [btrfs]
  [178317.338053]  btrfs_ioctl+0x29b/0x2a90 [btrfs]
  [178317.338118]  ? kmem_cache_alloc_noprof+0x5f/0x2c0
  [178317.338126]  ? getname_flags+0x45/0x1f0
  [178317.338133]  ? _raw_spin_unlock+0x15/0x30
  [178317.338145]  ? __x64_sys_ioctl+0x88/0xc0
  [178317.338149]  __x64_sys_ioctl+0x88/0xc0
  [178317.338152]  do_syscall_64+0x4a/0x110
  [178317.338160]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [178317.338190] RIP: 0033:0x7f13c28e271b

Which corresponds to line 2361 of transaction.c:

  $ cat -n fs/btrfs/transaction.c
  (...)
  2162  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
  2163  {
  (...)
  2349          spin_lock(&fs_info->trans_lock);
  2350          add_pending_snapshot(trans);
  2351          cur_trans->state = TRANS_STATE_COMMIT_DOING;
  2352          spin_unlock(&fs_info->trans_lock);
  2353
  2354          /*
  2355           * The thread has started/joined the transaction thus it holds the
  2356           * lockdep map as a reader. It has to release it before acquiring the
  2357           * lockdep map as a writer.
  2358           */
  2359          btrfs_lockdep_release(fs_info, btrfs_trans_num_writers);
  2360          btrfs_might_wait_for_event(fs_info, btrfs_trans_num_writers);
  2361          wait_event(cur_trans->writer_wait,
  2362                     atomic_read(&cur_trans->num_writers) == 1);
  (...)

The transaction is in the TRANS_STATE_COMMIT_DOING state and so it's
waiting for all other existing writers to complete and release their
transaction handle.

Task B is running ordered extent completion and blocked waiting to lock an
extent range in an inode's io tree:

  [178317.327411] INFO: task kworker/u48:8:554545 blocked for more than 120 seconds.
  [178317.328630]       Not tainted 6.12.0-rc6-btrfs-next-179+ #1
  [178317.329635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [178317.330872] task:kworker/u48:8   state:D stack:0     pid:554545 tgid:554545 ppid:2      flags:0x00004000
  [178317.330878] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
  [178317.330944] Call Trace:
  [178317.330945]  <TASK>
  [178317.330947]  __schedule+0x364/0xbe0
  [178317.330952]  schedule+0x26/0xa0
  [178317.330955]  __lock_extent+0x337/0x3a0 [btrfs]
  [178317.331014]  ? __pfx_autoremove_wake_function+0x10/0x10
  [178317.331017]  btrfs_finish_one_ordered+0x47a/0xaa0 [btrfs]
  [178317.331074]  ? psi_group_change+0x132/0x2d0
  [178317.331078]  btrfs_work_helper+0xbd/0x370 [btrfs]
  [178317.331140]  process_scheduled_works+0xd3/0x460
  [178317.331144]  ? __pfx_worker_thread+0x10/0x10
  [178317.331146]  worker_thread+0x121/0x250
  [178317.331149]  ? __pfx_worker_thread+0x10/0x10
  [178317.331151]  kthread+0xe9/0x120
  [178317.331154]  ? __pfx_kthread+0x10/0x10
  [178317.331157]  ret_from_fork+0x2d/0x50
  [178317.331159]  ? __pfx_kthread+0x10/0x10
  [178317.331162]  ret_from_fork_asm+0x1a/0x30

This extent range locking happens after joining the current transaction,
so task A is waiting for task B to release its transaction handle
(decrementing the transaction's num_writers counter).

Task C while doing a fiemap it tries to join the current transaction:

  [242682.812815] task:pool            state:D stack:0     pid:560767 tgid:560724 ppid:555622 flags:0x00004006
  [242682.812827] Call Trace:
  [242682.812856]  <TASK>
  [242682.812864]  __schedule+0x364/0xbe0
  [242682.812879]  ? _raw_spin_unlock_irqrestore+0x23/0x40
  [242682.812897]  schedule+0x26/0xa0
  [242682.812909]  wait_current_trans+0xd6/0x130 [btrfs]
  [242682.813148]  ? __pfx_autoremove_wake_function+0x10/0x10
  [242682.813162]  start_transaction+0x3d4/0x800 [btrfs]
  [242682.813399]  btrfs_is_data_extent_shared+0xd2/0x440 [btrfs]
  [242682.813723]  fiemap_process_hole+0x2a2/0x300 [btrfs]
  [242682.813995]  extent_fiemap+0x9b8/0xb80 [btrfs]
  [242682.814249]  btrfs_fiemap+0x78/0xc0 [btrfs]
  [242682.814501]  do_vfs_ioctl+0x2db/0xa50
  [242682.814519]  __x64_sys_ioctl+0x6a/0xc0
  [242682.814531]  do_syscall_64+0x4a/0x110
  [242682.814544]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [242682.814556] RIP: 0033:0x7efff595e71b

It tries to join the current transaction, but it can't because the
transaction is in the TRANS_STATE_COMMIT_DOING state, so
join_transaction() returns -EBUSY to start_transaction() and makes it
wait for the current transaction to complete. And while it's waiting
for the transaction to complete, it's holding an extent range locked
in the same inode that task B is operating, which causes a deadlock
between these 3 tasks. The extent range for the inode was locked at
the start of the fiemap operation, early at extent_fiemap().

In short these tasks deadlock because:

1) Task A is waiting for task B to release its transaction handle;

2) Task B is waiting to lock an extent range for an inode while holding a
   transaction handle open;

3) Task C is waiting for the current transaction to complete (for task A
   to finish the transaction commit) while holding the extent range for
   the inode locked, so task B can't progress and release its transaction
   handle.

This results in an ABBA deadlock involving transaction commits and extent
locks. Extent locks are higher level locks, like inode VFS locks, and
should always be acquired before joining or starting a transaction, but
recently commit 2206265f41 ("btrfs: remove code duplication in ordered
extent finishing") accidentally changed btrfs_finish_one_ordered() to do
the transaction join before locking the extent range.

Fix this by making sure that btrfs_finish_one_ordered() always locks the
extent before joining a transaction and add an explicit comment about the
need for this order.

Fixes: 2206265f41 ("btrfs: remove code duplication in ordered extent finishing")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-28 20:46:40 +01:00
..
9p Revert patches causing inode collision problems 2024-10-25 15:25:02 -07:00
adfs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
affs affs-for-6.12-tag 2024-09-16 13:07:59 +02:00
afs vfs-6.12-rc6.fixes 2024-11-01 07:37:10 -10:00
autofs autofs: fix thinko in validate_dev_ioctl() 2024-10-28 13:16:56 +01:00
bcachefs bcachefs: Fix UAF in __promote_alloc() error path 2024-11-07 16:48:21 -05:00
befs befs: Convert befs_symlink_read_folio() to use folio_end_read() 2024-05-31 12:31:39 +02:00
bfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
btrfs btrfs: fix deadlock between transaction commits and extent locks 2024-11-28 20:46:40 +01:00
cachefiles cachefiles: fix dentry leak in cachefiles_open_file() 2024-09-27 18:29:19 +02:00
ceph A fix from Patrick for a variety of CephFS lockup scenarios caused by 2024-10-04 10:10:23 -07:00
coda coda: use param->file for FSCONFIG_SET_FD 2024-08-19 13:45:03 +02:00
configfs fs/configfs: Add a callback to determine attribute visibility 2024-06-17 20:42:57 +02:00
cramfs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
crypto move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
debugfs [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
devpts
dlm [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
ecryptfs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
efivarfs [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
efs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
erofs erofs: use get_tree_bdev_flags() to avoid misleading messages 2024-10-21 14:30:27 +02:00
exfat move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
exportfs fhandle: relax open_by_handle_at() permission checks 2024-05-28 15:57:23 +02:00
ext2 vfs-6.12.file 2024-09-16 09:14:02 +02:00
ext4 ext4: fix off by one issue in alloc_flex_gd() 2024-10-04 17:36:28 -04:00
f2fs f2fs: allow parallel DIO reads 2024-10-11 15:12:07 +00:00
fat fat: fix uninitialized variable 2024-10-17 00:28:06 -07:00
freevxfs freevxfs: Convert freevxfs to the new mount API. 2024-03-26 09:04:53 +01:00
fuse fuse: remove stray debug line 2024-10-25 17:05:49 +02:00
gfs2 gfs2 changes 2024-09-23 11:55:17 -07:00
hfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
hfsplus move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
hostfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
hpfs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
hugetlbfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
iomap vfs-6.12-rc6.iomap 2024-11-01 07:45:00 -10:00
isofs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
jbd2 jbd2: remove unneeded check of ret in jbd2_fc_get_buf 2024-08-26 23:49:15 -04:00
jffs2 jffs2: Use a folio in jffs2_garbage_collect_dnode() 2024-08-19 13:40:00 +02:00
jfs jfs: Fix sanity check in dbMount 2024-10-22 09:40:37 -05:00
kernfs kernfs: mount: Remove unnecessary ‘NULL’ values from knparent 2024-05-04 19:02:39 +02:00
lockd move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
minix buffer: Convert __block_write_begin() to take a folio 2024-08-07 11:33:36 +02:00
netfs netfs: Downgrade i_rwsem for a buffered write 2024-10-17 15:33:42 +02:00
nfs nfs: avoid i_lock contention in nfs_clear_invalid_mapping 2024-11-04 10:24:19 -05:00
nfs_common nfs_common: fix localio to cope with racing nfs_local_probe() 2024-11-04 10:24:19 -05:00
nfsd nfsd-6.12 fixes: 2024-11-09 13:18:07 -08:00
nilfs2 nilfs2: fix potential deadlock with newly created symlinks 2024-10-30 20:14:12 -07:00
nls move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
notify inotify: Fix possible deadlock in fsnotify_destroy_mark 2024-10-02 15:14:29 +02:00
ntfs3 Changes for 6.12-rc3 2024-10-08 10:53:06 -07:00
ocfs2 ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove() 2024-11-07 14:14:59 -08:00
omfs fs: Convert aops->write_begin to take a folio 2024-08-07 11:33:21 +02:00
openpromfs openpromfs: add missing MODULE_DESCRIPTION() macro 2024-06-20 09:46:01 +02:00
orangefs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
overlayfs fs: pass offset and result to backing_file end_write() callback 2024-10-16 13:17:45 +02:00
proc 20 hotfixes, 14 of which are cc:stable. 2024-11-10 09:04:27 -08:00
pstore drm next for 6.12-rc1 2024-09-19 10:18:15 +02:00
qnx4 qnx4: add MODULE_DESCRIPTION() 2024-05-28 11:52:53 +02:00
qnx6 qnx6: Convert directory handling to use kmap_local 2024-08-07 11:31:56 +02:00
quota \n 2024-09-23 10:49:28 -07:00
ramfs mm: switch mm->get_unmapped_area() to a flag 2024-04-25 20:56:25 -07:00
reiserfs move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
romfs romfs: fix romfs_read_folio() 2024-08-21 22:32:58 +02:00
smb fix net namespace refcount issue 2024-11-09 12:58:23 -08:00
squashfs Squashfs: fix variable overflow in squashfs_readpage_block 2024-10-30 20:14:12 -07:00
sysfs Merge 6.9-rc5 into driver-core-next 2024-04-23 13:27:43 +02:00
sysv buffer: Convert __block_write_begin() to take a folio 2024-08-07 11:33:36 +02:00
tests execve: Move KUnit tests to tests/ subdirectory 2024-07-22 18:25:47 -07:00
tracefs tracing: Fix tracefs mount options 2024-11-01 08:38:14 -04:00
ubifs [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
udf udf: fix uninit-value use in udf_get_fileshortad 2024-10-02 14:32:37 +02:00
ufs ufs_rename(): fix bogus argument of folio_release_kmap() 2024-10-02 00:05:09 -04:00
unicode unicode: Don't special case ignorable code points 2024-10-09 13:34:01 -04:00
vboxsf fs: Convert aops->write_end to take a folio 2024-08-07 11:32:02 +02:00
verity fsverity: expose verified fsverity built-in signatures to LSMs 2024-08-20 14:03:18 -04:00
xfs XFS bug fies for 6.12-rc6 2024-11-02 09:22:16 -10:00
zonefs zonefs fixes for 6.12-rc2 2024-10-02 12:02:15 -07:00
aio.c fs/aio: Fix __percpu annotation of *cpu pointer in struct kioctx 2024-08-19 13:45:03 +02:00
anon_inodes.c fs: Create anon_inode_getfile_fmode() 2024-04-26 10:33:05 +02:00
attr.c nfsd-6.11 fixes: 2024-08-29 06:20:44 +12:00
backing-file.c fs: pass offset and result to backing_file end_write() callback 2024-10-16 13:17:45 +02:00
bad_inode.c
binfmt_elf_fdpic.c binfmt_elf_fdpic: fix AUXV size calculation when ELF_HWCAP2 is defined 2024-08-26 13:00:38 -07:00
binfmt_elf.c Revert "binfmt_elf, coredump: Log the reason of the failed core dumps" 2024-09-26 11:39:02 -07:00
binfmt_flat.c move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
binfmt_misc.c vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
binfmt_script.c fs: binfmt: add missing MODULE_DESCRIPTION() macros 2024-05-28 12:06:51 +02:00
bpf_fs_kfuncs.c bpf: Add kfunc bpf_get_dentry_xattr() to read xattr from dentry 2024-08-07 11:26:54 -07:00
buffer.c vfs-6.12.folio 2024-09-16 08:54:30 +02:00
char_dev.c
compat_binfmt_elf.c
coredump.c Revert "binfmt_elf, coredump: Log the reason of the failed core dumps" 2024-09-26 11:39:02 -07:00
d_path.c
dax.c fsdax: dax_unshare_iter needs to copy entire blocks 2024-10-07 13:51:47 +02:00
dcache.c vfs-6.12.misc 2024-09-16 08:35:09 +02:00
direct-io.c fs/direct-io: Remove linux/prefetch.h include 2024-08-19 13:45:02 +02:00
drop_caches.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
eventfd.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
eventpoll.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
exec.c ALong with the usual shower of singleton patches, notable patch series in 2024-09-21 07:29:05 -07:00
fcntl.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
fhandle.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
file_table.c slab updates for 6.12 2024-09-18 08:53:53 +02:00
file.c close_range(): fix the logics in descriptor table trimming 2024-09-29 21:52:29 -04:00
filesystems.c
fs_context.c
fs_parser.c fs_parse: add uid & gid option option parsing helpers 2024-07-02 06:20:49 +02:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c inode: port __I_SYNC to var event 2024-08-30 08:22:39 +02:00
fsopen.c [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
init.c
inode.c bcachefs: do not use PF_MEMALLOC_NORECLAIM 2024-10-09 12:47:18 -07:00
internal.h file: reclaim 24 bytes from f_owner 2024-08-28 13:05:39 +02:00
ioctl.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
Kconfig nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT 2024-10-03 16:19:51 -04:00
Kconfig.binfmt exec: Add KUnit test for bprm_stack_limits() 2024-06-19 13:13:55 -07:00
kernel_read_file.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
libfs.c vfs-6.12.folio 2024-09-16 08:54:30 +02:00
locks.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
Makefile bpf: introduce new VFS based BPF kfuncs 2024-08-06 09:01:41 -07:00
mbcache.c
mnt_idmapping.c fuse update for 6.12 2024-09-24 15:29:42 -07:00
mount.h vfs-6.12.mount 2024-09-16 11:15:26 +02:00
mpage.c buffer: Remove calls to set and clear the folio error flag 2024-05-31 12:31:43 +02:00
namei.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
namespace.c fs: don't try and remove empty rbtree node 2024-10-17 15:33:43 +02:00
nsfs.c [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
open.c openat2: explicitly return -E2BIG for (usize > PAGE_SIZE) 2024-10-10 12:09:03 +02:00
pidfs.c pidfs: check for valid pid namespace 2024-09-27 18:29:19 +02:00
pipe.c [tree-wide] finally take no_llseek out 2024-09-27 08:18:43 -07:00
pnode.c
pnode.h
posix_acl.c fs: Use in_group_or_capable() helper to simplify the code 2024-08-30 08:22:37 +02:00
proc_namespace.c fs: rename show_mnt_opts -> show_vfsmnt_opts 2024-06-28 14:36:43 +02:00
read_write.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
readdir.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
remap_range.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
select.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
seq_file.c seq_file: Simplify __seq_puts() 2024-05-02 16:28:20 +02:00
signalfd.c struct fd layout change (and conversion to accessor helpers) 2024-09-23 09:35:36 -07:00
splice.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
stack.c
stat.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
statfs.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
super.c fs/super.c: introduce get_tree_bdev_flags() 2024-10-21 14:30:26 +02:00
sync.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
sysctls.c
timerfd.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
userfaultfd.c fork: do not invoke uffd on fork if error occurs 2024-10-28 21:40:38 -07:00
utimes.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00
xattr.c introduce fd_file(), convert all accessors to it. 2024-08-12 22:00:43 -04:00