7776 Commits

Author SHA1 Message Date
Demi Marie Obenour
a85f1a9de9 dm ioctl: Refuse to create device named "control"
Typical userspace setups create a symlink under /dev/mapper with the
name of the device, but /dev/mapper/control is reserved for DM's control
device.  Therefore, trying to create such a device is almost certain to
be a userspace bug.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23 10:31:51 -04:00
Demi Marie Obenour
249bed821b dm ioctl: Avoid double-fetch of version
The version is fetched once in check_version(), which then does some
validation and then overwrites the version in userspace with the API
version supported by the kernel.  copy_params() then fetches the version
from userspace *again*, and this time no validation is done.  The result
is that the kernel's version number is completely controllable by
userspace, provided that userspace can win a race condition.

Fix this flaw by not copying the version back to the kernel the second
time.  This is not exploitable as the version is not further used in the
kernel.  However, it could become a problem if future patches start
relying on the version field.

Cc: stable@vger.kernel.org
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23 10:31:51 -04:00
Demi Marie Obenour
10655c7a48 dm ioctl: structs and parameter strings must not overlap
The NUL terminator for each target parameter string must precede the
following 'struct dm_target_spec'.  Otherwise, dm_split_args() might
corrupt this struct.  Furthermore, the first 'struct dm_target_spec'
must come after the 'struct dm_ioctl', as if it overlaps too much
dm_split_args() could corrupt the 'struct dm_ioctl'.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23 10:31:51 -04:00
Demi Marie Obenour
13f4a697f8 dm ioctl: Avoid pointer arithmetic overflow
Especially on 32-bit systems, it is possible for the pointer
arithmetic to overflow and cause a userspace pointer to be
dereferenced in the kernel.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23 10:31:51 -04:00
Demi Marie Obenour
b60528d9e6 dm ioctl: Check dm_target_spec is sufficiently aligned
Otherwise subsequent code, if given malformed input, could dereference
a misaligned 'struct dm_target_spec *'.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> # use %zu
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23 10:31:49 -04:00
Andy Shevchenko
25c9a4ab4d dm integrity: Use %*ph for printing hexdump of a small buffer
The kernel already has a helper to print a hexdump of a small
buffer via pointer extension. Use that instead of open coded
variant.

In long term it helps to kill pr_cont() or at least narrow down
its use.

Note, the format is slightly changed, i.e. the trailing space is
always printed. Also the IV dump is limited by 64 bytes which seems
fine.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-22 18:39:50 -04:00
Linus Torvalds
8ba90f5cc7 19 hotfixes. 8 of these are cc:stable.
This includes a wholesale reversion of the post-6.4 series "make slab shrink
 lockless".  After input from Dave Chinner it has been decided that we
 should go a different way.  Thread starts at
 https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJH/qAAKCRDdBJ7gKXxA
 jq7uAP9AtDGHfvOuW5jlHdYfpUBnbfuQDKjiik71UuIxyhtwQQEAqpOBv7UDuhHj
 NbNIGTIi/xM5vkpjV6CBo9ymR7qTKwo=
 =uGuc
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "19 hotfixes.  8 of these are cc:stable.

  This includes a wholesale reversion of the post-6.4 series 'make slab
  shrink lockless'. After input from Dave Chinner it has been decided
  that we should go a different way [1]"

Link: https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area [1]

* tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  selftests/mm: fix cross compilation with LLVM
  mailmap: add entries for Ben Dooks
  nilfs2: prevent general protection fault in nilfs_clear_dirty_page()
  Revert "mm: vmscan: make global slab shrink lockless"
  Revert "mm: vmscan: make memcg slab shrink lockless"
  Revert "mm: vmscan: add shrinker_srcu_generation"
  Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless"
  Revert "mm: vmscan: hold write lock to reparent shrinker nr_deferred"
  Revert "mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()"
  Revert "mm: shrinkers: convert shrinker_rwsem to mutex"
  nilfs2: fix buffer corruption due to concurrent device reads
  scripts/gdb: fix SB_* constants parsing
  scripts: fix the gfp flags header path in gfp-translate
  udmabuf: revert 'Add support for mapping hugepages (v4)'
  mm/khugepaged: fix iteration in collapse_file
  memfd: check for non-NULL file_seals in memfd_create() syscall
  mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
  mm/mprotect: fix do_mprotect_pkey() limit check
  writeback: fix dereferencing NULL mapping->host on writeback_page_template
2023-06-20 17:20:22 -07:00
Catalin Marinas
7bc757140f dm-crypt: use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN
ARCH_DMA_MINALIGN represents the minimum (static) alignment for safe DMA
operations while ARCH_KMALLOC_MINALIGN is the minimum kmalloc() objects
alignment.

Link: https://lkml.kernel.org/r/20230612153201.554742-10-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jerry Snitselaar <jsnitsel@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:21 -07:00
Qi Zheng
47a7c01c3e Revert "mm: shrinkers: convert shrinker_rwsem to mutex"
Patch series "revert shrinker_srcu related changes".


This patch (of 7):

This reverts commit cf2e309ebca7bb0916771839f9b580b06c778530.

Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec
test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make
global slab shrink lockless").  The root cause is that SRCU has to be
careful to not frequently check for SRCU read-side critical section exits.
Therefore, even if no one is currently in the SRCU read-side critical
section, synchronize_srcu() cannot return quickly.  That's why
unregister_shrinker() has become slower.

After discussion, we will try to use the refcount+RCU method [2] proposed
by Dave Chinner to continue to re-implement the lockless slab shrink.  So
revert the shrinker_mutex back to shrinker_rwsem first.

[1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
[2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/

Link: https://lkml.kernel.org/r/20230609081518.3039120-1-qi.zheng@linux.dev
Link: https://lkml.kernel.org/r/20230609081518.3039120-2-qi.zheng@linux.dev
Reported-by: kernel test robot <yujie.liu@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 13:19:33 -07:00
Mike Snitzer
fa37564624 dm thin: disable discards for thin-pool if no_discard_passdown
Also rename disable_passdown_if_not_supported to
disable_discard_passdown_if_not_supported.

And fold passdown_enabled() into only caller.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:14 -04:00
Mike Snitzer
862c6663c1 dm: remove stale/redundant dm_internal_{suspend,resume} prototypes in dm.h
dm_internal_suspend() no longer exists.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:14 -04:00
Mike Snitzer
c4f512d255 dm: skip dm-stats work in alloc_io() unless needed
Don't dm_stats_record_start() if dm_stats_used() is false.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Mike Snitzer
06eed768ea dm: avoid needless dm_io access if all IO accounting is disabled
Update dm_io_acct() to eliminate most dm_io struct accesses if both
block core's IO stats and dm-stats are disabled.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Li Nan
526d10061b dm: support turning off block-core's io stats accounting
Commit bc58ba9468d9 ("block: add sysfs file for controlling io stats
accounting") allowed users to turn off disk stat accounting completely
by checking if queue flag QUEUE_FLAG_IO_STAT is set. In dm, this flag
is neither set nor checked: so block-core's io stats are continuously
counted and cannot be turned off.

Add support for turning off block-core's io stats accounting for dm.
Set QUEUE_FLAG_IO_STAT for dm's request_queue. If QUEUE_FLAG_IO_STAT
is set when an io starts, record the need for block core's io stats by
setting the DM_IO_BLK_STAT dm_io flag to avoid io stats being disabled
in the middle of the io.

DM statistics (dm-stats) is independent of block-core's io stats and
remains unchanged.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Christophe JAILLET
e118029cb7 dm zone: Use the bitmap API to allocate bitmaps
Use bitmap_zalloc()/bitmap_free() instead of hand-writing them.
It is less verbose and it improves the semantic.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Li Lingfeng
d483001206 dm thin metadata: Fix ABBA deadlock by resetting dm_bufio_client
As described in commit 8111964f1b85 ("dm thin: Fix ABBA deadlock between
shrink_slab and dm_pool_abort_metadata"), ABBA deadlocks will be
triggered because shrinker_rwsem currently needs to held by
dm_pool_abort_metadata() as a side-effect of thin-pool metadata
operation failure.

The following three problem scenarios have been noticed:

1) Described by commit 8111964f1b85 ("dm thin: Fix ABBA deadlock between
   shrink_slab and dm_pool_abort_metadata")

2) shrinker_rwsem and throttle->lock
          P1(drop cache)                        P2(kworker)
drop_caches_sysctl_handler
 drop_slab
  shrink_slab
   down_read(&shrinker_rwsem)  - LOCK A
   do_shrink_slab
    super_cache_scan
     prune_icache_sb
      dispose_list
       evict
        ext4_evict_inode
         ext4_clear_inode
          ext4_discard_preallocations
           ext4_mb_load_buddy_gfp
            ext4_mb_init_cache
             ext4_wait_block_bitmap
              __ext4_error
               ext4_handle_error
                ext4_commit_super
                 ...
                 dm_submit_bio
                                     do_worker
                                      throttle_work_update
                                       down_write(&t->lock) -- LOCK B
                                      process_deferred_bios
                                       commit
                                        metadata_operation_failed
                                         dm_pool_abort_metadata
                                          dm_block_manager_create
                                           dm_bufio_client_create
                                            register_shrinker
                                             down_write(&shrinker_rwsem)
                                             -- LOCK A
                 thin_map
                  thin_bio_map
                   thin_defer_bio_with_throttle
                    throttle_lock
                     down_read(&t->lock)  - LOCK B

3) shrinker_rwsem and wait_on_buffer
          P1(drop cache)                            P2(kworker)
drop_caches_sysctl_handler
 drop_slab
  shrink_slab
   down_read(&shrinker_rwsem)  - LOCK A
   do_shrink_slab
   ...
    ext4_wait_block_bitmap
     __ext4_error
      ext4_handle_error
       jbd2_journal_abort
        jbd2_journal_update_sb_errno
         jbd2_write_superblock
          submit_bh
           // LOCK B
           // RELEASE B
                             do_worker
                              throttle_work_update
                               down_write(&t->lock) - LOCK B
                              process_deferred_bios
                               process_bio
                               commit
                                metadata_operation_failed
                                 dm_pool_abort_metadata
                                  dm_block_manager_create
                                   dm_bufio_client_create
                                    register_shrinker
                                     register_shrinker_prepared
                                      down_write(&shrinker_rwsem)  - LOCK A
                               bio_endio
      wait_on_buffer
       __wait_on_buffer

Fix these by resetting dm_bufio_client without holding shrinker_rwsem.

Fixes: 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata")
Cc: stable@vger.kernel.org
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Mikulas Patocka
2a32897c84 dm crypt: fix crypt_ctr_cipher_new return value on invalid AEAD cipher
If the user specifies invalid AEAD cipher, dm-crypt should return the
error returned from crypt_ctr_auth_spec, not -ENOMEM.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Mike Snitzer
ef6953fb68 dm thin: update .io_hints methods to not require handling discards last
Removes assumptions about what might follow the discard setup code
(previously the code would return early if discards not enabled).

Makes it possible to add more capabilites to the end of each .io_hints
method (which is the natural thing to do when adding new features).

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Mike Snitzer
c0a7a0ac07 dm thin: remove return code variable in pool_map
Always returns DM_MAPIO_REMAPPED so no need for variable.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Mikulas Patocka
4c2c845bdc dm flakey: introduce random_read_corrupt and random_write_corrupt options
The random_read_corrupt and random_write_corrupt options corrupt a
random byte in a bio with the provided probability. The corruption
only happens in the "down" interval.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Mikulas Patocka
1d9a943898 dm flakey: clone pages on write bio before corrupting them
dm-flakey has an option to corrupt write bios. It corrupts the memory that
is being written. This can cause system crashes or security bugs - for
example, if the user writes a shared library code with O_DIRECT flag to a
dm-flakey device, it corrupts the memory for all users that have the
shared library mapped.

Fix this bug by cloning the bio and corrupting the clone rather than
the original.

Also drop the test for ZERO_PAGE(0) - it can't happen because we write
the cloned pages.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Mikulas Patocka
5054e778fc dm crypt: allocate compound pages if possible
It was reported that allocating pages for the write buffer in dm-crypt
causes measurable overhead [1].

Change dm-crypt to allocate compound pages if they are available. If
not, fall back to the mempool.

[1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16 18:24:13 -04:00
Linus Torvalds
0e306952d7 - Fix DM thinp discard performance regression introduced during 6.4
merge; where DM core was splitting large discards every 128K
   (max_sectors_kb) rather than every 64M (discard_max_bytes).
 
 - Extend DM core LOCKFS fix, made during 6.4 merge, to also fix race
   between do_mount and dm's do_suspend (in addition to the earlier
   fix's do_mount race with dm's do_resume).
 
 - Fix DM thin metadata operations to first check if the thin-pool is
   in "fail_io" mode; otherwise UAF can occur.
 
 - Fix DM thinp's call to __blkdev_issue_discard to use GFP_NOIO rather
   than GFP_NOWAIT (__blkdev_issue_discard cannot handle NULL return
   from bio_alloc).
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmSLTh0ACgkQxSPxCi2d
 A1rmqggAildPKBjT8nqZmU86lpsy60E03OwvBnGPkMF5pjkOUmTjkb5EWVSAmeuO
 ojj0pWlC+1ZvVkiDfkWxt0NL/4ETD4q+5oy1ARBcOawPX6bj0eXLoBr6m10b+KOb
 mKAoXgYrESEzQ2qPBe4a4Lj3zIBXzXpMpW9TtF23z4HnDpnwpED5xNPWBgiWc3O/
 /6MF1ASLp0DWldoL+gmIp9hEzyQzbzgM4uBOGC4UAYk3U1I55qwX6bWDZ9cQNGMh
 AqCSrphuKHvbsb31yb1X3hB3g1XbAeSvvcizgFY0g9ZpncddKm5gx0BWVDO7qGBG
 UxLIec19kQ2CIEx/QJZIhEjneLlJ/g==
 =mME4
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM thinp discard performance regression introduced during this
   merge window where DM core was splitting large discards every 128K
   (max_sectors_kb) rather than every 64M (discard_max_bytes).

 - Extend DM core LOCKFS fix, made during 6.4 merge, to also fix race
   between do_mount and dm's do_suspend (in addition to the earlier
   fix's do_mount race with dm's do_resume).

 - Fix DM thin metadata operations to first check if the thin-pool is in
   "fail_io" mode; otherwise UAF can occur.

 - Fix DM thinp's call to __blkdev_issue_discard to use GFP_NOIO rather
   than GFP_NOWAIT (__blkdev_issue_discard cannot handle NULL return
   from bio_alloc).

* tag 'for-6.4/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: use op specific max_sectors when splitting abnormal io
  dm thin: fix issue_discard to pass GFP_NOIO to __blkdev_issue_discard
  dm thin metadata: check fail_io before using data_sm
  dm: don't lock fs when the map is NULL during suspend or resume
2023-06-15 20:19:21 -07:00
Mike Snitzer
be04c14a1b dm: use op specific max_sectors when splitting abnormal io
Split abnormal IO in terms of the corresponding operation specific
max_sectors (max_discard_sectors, max_secure_erase_sectors or
max_write_zeroes_sectors).

This fixes a significant dm-thinp discard performance regression that
was introduced with commit e2dd8aca2d76 ("dm bio prison v1: improve
concurrent IO performance"). Relative to discard: max_discard_sectors
is used instead of max_sectors; which fixes excessive discard splitting
(e.g. max_sectors=128K vs max_discard_sectors=64M).

Tested by discarding an 1 Petabyte dm-thin device:
lvcreate -V 1125899906842624B -T test/pool -n thin
time blkdiscard /dev/test/thin

Before this fix (splitting discards every 128K): ~116m
 After this fix (splitting discards every 64M) : 0m33.460s

Reported-by: Zorro Lang <zlang@redhat.com>
Fixes: 06961c487a33 ("dm: split discards further if target sets max_discard_granularity")
Requires: 13f6facf3fae ("dm: allow targets to require splitting WRITE_ZEROES and SECURE_ERASE")
Fixes: e2dd8aca2d76 ("dm bio prison v1: improve concurrent IO performance")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-15 12:47:16 -04:00
Mike Snitzer
722d908223 dm thin: fix issue_discard to pass GFP_NOIO to __blkdev_issue_discard
issue_discard() passes GFP_NOWAIT to __blkdev_issue_discard() despite
its code assuming bio_alloc() always succeeds.

Commit 3dba53a958a75 ("dm thin: use __blkdev_issue_discard for async
discard support") clearly shows where things went bad:

Before commit 3dba53a958a75, dm-thin.c's open-coded
__blkdev_issue_discard_async() properly handled using GFP_NOWAIT.
Unfortunately __blkdev_issue_discard() doesn't and it was missed
during review.

Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-15 12:46:59 -04:00
Li Lingfeng
cb65b282c9 dm thin metadata: check fail_io before using data_sm
Must check pmd->fail_io before using pmd->data_sm since
pmd->data_sm may be destroyed by other processes.

       P1(kworker)                             P2(message)
do_worker
 process_prepared
  process_prepared_discard_passdown_pt2
   dm_pool_dec_data_range
                                    pool_message
                                     commit
                                      dm_pool_commit_metadata
                                        ↓
                                       // commit failed
                                      metadata_operation_failed
                                       abort_transaction
                                        dm_pool_abort_metadata
                                         __open_or_format_metadata
                                           ↓
                                          dm_sm_disk_open
                                            ↓
                                           // open failed
                                           // pmd->data_sm is NULL
    dm_sm_dec_blocks
      ↓
     // try to access pmd->data_sm --> UAF

As shown above, if dm_pool_commit_metadata() and
dm_pool_abort_metadata() fail in pool_message process, kworker may
trigger UAF.

Fixes: be500ed721a6 ("dm space maps: improve performance with inc/dec on ranges of blocks")
Cc: stable@vger.kernel.org
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-15 12:46:59 -04:00
Li Lingfeng
2760904d89 dm: don't lock fs when the map is NULL during suspend or resume
As described in commit 38d11da522aa ("dm: don't lock fs when the map is
NULL in process of resume"), a deadlock may be triggered between
do_resume() and do_mount().

This commit preserves the fix from commit 38d11da522aa but moves it to
where it also serves to fix a similar deadlock between do_suspend()
and do_mount().  It does so, if the active map is NULL, by clearing
DM_SUSPEND_LOCKFS_FLAG in dm_suspend() which is called by both
do_suspend() and do_resume().

Fixes: 38d11da522aa ("dm: don't lock fs when the map is NULL in process of resume")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-15 12:46:53 -04:00
Mingzhe Zou
f0854489fc bcache: fixup btree_cache_wait list damage
We get a kernel crash about "list_add corruption. next->prev should be
prev (ffff9c801bc01210), but was ffff9c77b688237c.
(next=ffffae586d8afe68)."

crash> struct list_head 0xffff9c801bc01210
struct list_head {
  next = 0xffffae586d8afe68,
  prev = 0xffffae586d8afe68
}
crash> struct list_head 0xffff9c77b688237c
struct list_head {
  next = 0x0,
  prev = 0x0
}
crash> struct list_head 0xffffae586d8afe68
struct list_head struct: invalid kernel virtual address: ffffae586d8afe68  type: "gdb_readmem_callback"
Cannot access memory at address 0xffffae586d8afe68

[230469.019492] Call Trace:
[230469.032041]  prepare_to_wait+0x8a/0xb0
[230469.044363]  ? bch_btree_keys_free+0x6c/0xc0 [escache]
[230469.056533]  mca_cannibalize_lock+0x72/0x90 [escache]
[230469.068788]  mca_alloc+0x2ae/0x450 [escache]
[230469.080790]  bch_btree_node_get+0x136/0x2d0 [escache]
[230469.092681]  bch_btree_check_thread+0x1e1/0x260 [escache]
[230469.104382]  ? finish_wait+0x80/0x80
[230469.115884]  ? bch_btree_check_recurse+0x1a0/0x1a0 [escache]
[230469.127259]  kthread+0x112/0x130
[230469.138448]  ? kthread_flush_work_fn+0x10/0x10
[230469.149477]  ret_from_fork+0x35/0x40

bch_btree_check_thread() and bch_dirty_init_thread() may call
mca_cannibalize() to cannibalize other cached btree nodes. Only one thread
can do it at a time, so the op of other threads will be added to the
btree_cache_wait list.

We must call finish_wait() to remove op from btree_cache_wait before free
it's memory address. Otherwise, the list will be damaged. Also should call
bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up
other waiters.

Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded")
Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Cc: stable@vger.kernel.org
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:32:55 -06:00
Zheng Wang
80fca8a10b bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent
In some specific situations, the return value of __bch_btree_node_alloc
may be NULL. This may lead to a potential NULL pointer dereference in
caller function like a calling chain :
btree_split->bch_btree_node_alloc->__bch_btree_node_alloc.

Fix it by initializing the return value in __bch_btree_node_alloc.

Fixes: cafe56359144 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:32:00 -06:00
Zheng Wang
028ddcac47 bcache: Remove unnecessary NULL point check in node allocations
Due to the previous fix of __bch_btree_node_alloc, the return value will
never be a NULL pointer. So IS_ERR is enough to handle the failure
situation. Fix it by replacing IS_ERR_OR_NULL check by an IS_ERR check.

Fixes: cafe56359144 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:30:43 -06:00
Andrea Tomassetti
ccb8c3bd6d bcache: Remove dead references to cache_readaheads
The cache_readaheads stat counter is not used anymore and should be
removed.

Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:30:11 -06:00
Thomas Weißschuh
b98dd0b0a5 bcache: make kobj_type structures constant
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:30:11 -06:00
ye xingchen
a301b2deb6 bcache: Convert to use sysfs_emit()/sysfs_emit_at() APIs
Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.

Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15 07:30:11 -06:00
Yu Kuai
460af1f9d9 md/raid1-10: limit the number of plugged bio
bio can be added to plug infinitely, and following writeback test can
trigger huge amount of plugged bio:

Test script:
modprobe brd rd_nr=4 rd_size=10485760
mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal
echo 0 > /proc/sys/vm/dirty_background_ratio
fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test

Test result:
Monitor /sys/block/md0/inflight will found that inflight keep increasing
until fio finish writing, after running for about 2 minutes:

[root@fedora ~]# cat /sys/block/md0/inflight
       0  4474191

Fix the problem by limiting the number of plugged bio based on the number
of copies for original bio.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com
2023-06-13 15:25:44 -07:00
Yu Kuai
9efcc2c3df md/raid1-10: don't handle pluged bio by daemon thread
current->bio_list will be set under submit_bio() context, in this case
bitmap io will be added to the list and wait for current io submission to
finish, while current io submission must wait for bitmap io to be done.
commit 874807a83139 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix
the deadlock by handling plugged bio by daemon thread.

On the one hand, the deadlock won't exist after commit a214b949d8e3
("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On
the other hand, current solution makes it impossible to flush plugged bio
in raid1/10_make_request(), because this will cause that all the writes
will goto daemon thread.

In order to limit the number of plugged bio, commit 874807a83139
("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the
deadlock is fixed by handling bitmap io asynchronously.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com
2023-06-13 15:25:44 -07:00
Yu Kuai
a022325ab9 md/md-bitmap: add a new helper to unplug bitmap asynchrously
If bitmap is enabled, bitmap must update before submitting write io, this
is why unplug callback must move these io to 'conf->pending_io_list' if
'current->bio_list' is not empty, which will suffer performance
degradation.

A new helper md_bitmap_unplug_async() is introduced to submit bitmap io
in a kworker, so that submit bitmap io in raid10_unplug() doesn't require
that 'current->bio_list' is empty.

This patch prepare to limit the number of plugged bio.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com
2023-06-13 15:25:44 -07:00
Yu Kuai
7db922bae3 md/raid1-10: submit write io directly if bitmap is not enabled
Commit 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10")
add bitmap support, and it changed that write io is submitted through
daemon thread because bitmap need to be updated before write io. And
later, plug is used to fix performance regression because all the write io
will go to demon thread, which means io can't be issued concurrently.

However, if bitmap is not enabled, the write io should not go to daemon
thread in the first place, and plug is not needed as well.

Fixes: 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-5-yukuai1@huaweicloud.com
2023-06-13 15:25:44 -07:00
Yu Kuai
8295efbe68 md/raid1-10: factor out a helper to submit normal write
There are multiple places to do the same thing, factor out a helper to
prevent redundant code, and the helper will be used in following patch
as well.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-4-yukuai1@huaweicloud.com
2023-06-13 15:25:43 -07:00
Yu Kuai
5ec6ca140a md/raid1-10: factor out a helper to add bio to plug
The code in raid1 and raid10 is identical, prepare to limit the number
of plugged bios.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-3-yukuai1@huaweicloud.com
2023-06-13 15:25:43 -07:00
Yu Kuai
010444623e md/raid10: prevent soft lockup while flush writes
Currently, there is no limit for raid1/raid10 plugged bio. While flushing
writes, raid1 has cond_resched() while raid10 doesn't, and too many
writes can cause soft lockup.

Follow up soft lockup can be triggered easily with writeback test for
raid10 with ramdisks:

watchdog: BUG: soft lockup - CPU#10 stuck for 27s! [md0_raid10:1293]
Call Trace:
 <TASK>
 call_rcu+0x16/0x20
 put_object+0x41/0x80
 __delete_object+0x50/0x90
 delete_object_full+0x2b/0x40
 kmemleak_free+0x46/0xa0
 slab_free_freelist_hook.constprop.0+0xed/0x1a0
 kmem_cache_free+0xfd/0x300
 mempool_free_slab+0x1f/0x30
 mempool_free+0x3a/0x100
 bio_free+0x59/0x80
 bio_put+0xcf/0x2c0
 free_r10bio+0xbf/0xf0
 raid_end_bio_io+0x78/0xb0
 one_write_done+0x8a/0xa0
 raid10_end_write_request+0x1b4/0x430
 bio_endio+0x175/0x320
 brd_submit_bio+0x3b9/0x9b7 [brd]
 __submit_bio+0x69/0xe0
 submit_bio_noacct_nocheck+0x1e6/0x5a0
 submit_bio_noacct+0x38c/0x7e0
 flush_pending_writes+0xf0/0x240
 raid10d+0xac/0x1ed0

Fix the problem by adding cond_resched() to raid10 like what raid1 did.

Note that unlimited plugged bio still need to be optimized, for example,
in the case of lots of dirty pages writeback, this will take lots of
memory and io will spend a long time in plug, hence io latency is bad.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529131106.2123367-2-yukuai1@huaweicloud.com
2023-06-13 15:25:43 -07:00
Li Nan
2ae6aaf769 md/raid10: fix io loss while replacement replace rdev
When removing a disk with replacement, the replacement will be used to
replace rdev. During this process, there is a brief window in which both
rdev and replacement are read as NULL in raid10_write_request(). This
will result in io not being submitted but it should be.

  //remove				//write
  raid10_remove_disk			raid10_write_request
   mirror->rdev = NULL
					 read rdev -> NULL
   mirror->rdev = mirror->replacement
   mirror->replacement = NULL
					 read replacement -> NULL

Fix it by reading replacement first and rdev later, meanwhile, use smp_mb()
to prevent memory reordering.

Fixes: 475b0321a4df ("md/raid10: writes should get directed to replacement as well as original.")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230602091839.743798-3-linan666@huaweicloud.com
2023-06-13 15:25:43 -07:00
Li Nan
8d355a46c1 md/raid10: Do not add spare disk when recovery fails
In raid10_sync_request(), if data cannot be read from any disk for
recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After
multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will
return 'max_sector', indicating that the recovery has been completed.
However, the recovery is just aborted and the data remains inconsistent.

Fix it by setting mirror->recovery_disabled, which will prevent the spare
disk from being added to this mirror. The same issue also exists during
resync, it will be fixed afterwards.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230602091839.743798-2-linan666@huaweicloud.com
2023-06-13 15:25:43 -07:00
Li Nan
4d8a5754a6 md/raid10: clean up md_add_new_disk()
Commit 1a855a060665 ("md: fix bug with re-adding of partially recovered
device.") only add device which is set to In_sync. But it let devices
without metadata cannot be added when they should be.

Commit bf572541ab44 ("md: fix regression with re-adding devices to arrays
with no metadata") fix the above issue, it set device without metadata to
In_sync when add new disk.

However, after commit f466722ca614 ("md: Change handling of save_raid_disk
and metadata update during recovery.") deletes changes of the first patch,
setting In_sync for devcie without metadata is meanless because the flag
will be cleared soon and will not be used during this period. Clean it up.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527101851.3266500-2-linan666@huaweicloud.com
2023-06-13 15:25:42 -07:00
Li Nan
6090368abc md/raid10: prioritize adding disk to 'removed' mirror
When add a new disk to raid10, it will traverse conf->mirror from start
and find one of the following mirror to add:
  1. mirror->rdev is set to WantReplacement and it have no replacement,
     set new disk to mirror->replacement.
  2. no mirror->rdev, set new disk to mirror->rdev.

There is a array as below (sda is set to WantReplacement):

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync set-A   /dev/sda
       -       0        0        1      removed
       2       8       32        2      active sync set-A   /dev/sdc
       3       8       48        3      active sync set-B   /dev/sdd

Use 'mdadm --add' to add a new disk to this array, the new disk will
become sda's replacement instead of add to removed position, which is
confusing for users. Meanwhile, after new disk recovery success, sda
will be set to Faulty.

Prioritize adding disk to 'removed' mirror is a better choice. In the
above scenario, the behavior is the same as before, except sda will not
be deleted. Before other disks are added, continued use sda is more
reliable.

Signed-off-by: Li Nan <linan122@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527092007.3008856-1-linan666@huaweicloud.com
2023-06-13 15:25:42 -07:00
Li Nan
59f8f0b54c md/raid10: improve code of mrdev in raid10_sync_request
'need_recover' and 'mrdev' are equivalent in raid10_sync_request(), and
inc mrdev->nr_pending is unreasonable if don't need recovery. Replace
'need_recover' with 'mrdev', and only inc nr_pending when needed.

Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527072218.2365857-3-linan666@huaweicloud.com
2023-06-13 15:25:42 -07:00
Li Nan
34817a2441 md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request
There are two check of 'mreplace' in raid10_sync_request(). In the first
check, 'need_replace' will be set and 'mreplace' will be used later if
no-Faulty 'mreplace' exists, In the second check, 'mreplace' will be
set to NULL if it is Faulty, but 'need_replace' will not be changed
accordingly. null-ptr-deref occurs if Faulty is set between two check.

Fix it by merging two checks into one. And replace 'need_replace' with
'mreplace' because their values are always the same.

Fixes: ee37d7314a32 ("md/raid10: Fix raid10 replace hang when new added disk faulty")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230527072218.2365857-2-linan666@huaweicloud.com
2023-06-13 15:25:42 -07:00
Yu Kuai
75aa7a1b8f md/raid5: don't start reshape when recovery or replace is in progress
When recovery is interrupted (reboot, etc.) check for MD_RECOVERY_RUNNING
is not enough to tell recovery is in progress. Also check recovery_cp
before starting reshape.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529133410.2125914-1-yukuai1@huaweicloud.com
2023-06-13 15:25:41 -07:00
Yu Kuai
4469315439 md: protect md_thread with rcu
Currently, there are many places that md_thread can be accessed without
protection, following are known scenarios that can cause
null-ptr-dereference or uaf:

1) sync_thread that is allocated and started from md_start_sync()
2) mddev->thread can be accessed directly from timeout_store() and
   md_bitmap_daemon_work()
3) md_unregister_thread() from action_store().

Currently, a global spinlock 'pers_lock' is borrowed to protect
'mddev->thread' in some places, this problem can be fixed likewise,
however, use a global lock for all the cases is not good.

Fix this problem by protecting all md_thread with rcu.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
2023-06-13 15:25:39 -07:00
Yu Kuai
4eeb6535cd md/bitmap: factor out a helper to set timeout
Register/unregister 'mddev->thread' are both under 'reconfig_mutex',
however, some context didn't hold the mutex to access mddev->thread,
which can cause null-ptr-deference:

1) md_bitmap_daemon_work() can be called from md_check_recovery() where
'reconfig_mutex' is not held, deference 'mddev->thread' might cause
null-ptr-deference, because md_unregister_thread() reset the pointer
before stopping the thread.

2) timeout_store() access 'mddev->thread' multiple times,
null-ptr-deference can be triggered if 'mddev->thread' is reset in the
middle.

This patch factor out a helper to set timeout, the new helper always
check if 'mddev->thread' is null first, so that problem 1 can be fixed.

Now that this helper only access 'mddev->thread' once, but it's possible
that 'mddev->thread' can be freed while this helper is still in progress,
hence the problem is not fixed yet. Follow up patches will fix this by
protecting md_thread with rcu.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-5-yukuai1@huaweicloud.com
2023-06-13 15:25:13 -07:00
Yu Kuai
c333673a78 md/bitmap: always wake up md_thread in timeout_store
md_wakeup_thread() can handle the case that pass in md_thread is NULL,
the only difference is that md_wakeup_thread() will be called when
current timeout is 'MAX_SCHEDULE_TIMEOUT', this should not matter
because timeout_store() is not hot path, and the daemon process is
woke up more than demand from other context already.

Prepare to factor out a helper to set timeout.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230523021017.3048783-4-yukuai1@huaweicloud.com
2023-06-13 15:25:13 -07:00