1198803 Commits

Author SHA1 Message Date
Li Zhijian
6548fce058 drivers/rnbd: restore sysfs interface to rnbd-client
Commit 137380c0ec40 renamed 'rnbd-client' to 'rnbd_client', this changed
sysfs interface to /sys/devices/virtual/rnbd_client/ctl/map_device
from /sys/devices/virtual/rnbd-client/ctl/map_device.

CC: Ivan Orlov <ivan.orlov0322@gmail.com>
CC: "Md. Haris Iqbal" <haris.iqbal@ionos.com>
CC: Jack Wang <jinpu.wang@ionos.com>
Fixes: 137380c0ec40 ("block/rnbd: make all 'class' structures const")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230816022210.2501228-1-lizhijian@fujitsu.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-18 15:00:39 -06:00
Ming Lei
a7a7dabb5d nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll
Now nvme_ns_chr_uring_cmd_iopoll() has switched to request based io
polling, and the associated NS is guaranteed to be live in case of
io polling, so request is guaranteed to be valid because blk-mq uses
pre-allocated request pool.

Remove the rcu read lock in nvme_ns_chr_uring_cmd_iopoll(), which
isn't needed any more after switching to request based io polling.

Fix "BUG: sleeping function called from invalid context" because
set_page_dirty_lock() from blk_rq_unmap_user() may sleep.

Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands")
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Cc: Kanchan Joshi <joshi.k@samsung.com>
Cc: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Guangwu Zhang <guazhang@redhat.com>
Link: https://lore.kernel.org/r/20230809020440.174682-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11 08:12:32 -06:00
Chengming Zhou
f099a108ca blk-iocost: fix queue stats accounting
The q->stats->accounting is not only used by iocost, but iocost only
increase this counter, never decrease it. So queue stats accounting
will always enabled after using iocost once.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230804070609.31623-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 16:04:14 -06:00
Jens Axboe
2bc0576925 block: don't make REQ_POLLED imply REQ_NOWAIT
Normally these two flags do go together, as the issuer of polled IO
generally cannot wait for resources that will get freed as part of IO
completion. This is because that very task is the one that will complete
the request and free those resources, hence that would introduce a
deadlock.

But it is possible to have someone else issue the polled IO, eg via
io_uring if the request is punted to io-wq. For that case, it's fine to
have the task block on IO submission, as it is not the same task that
will be completing the IO.

It's completely up to the caller to ask for both polled and nowait IO
separately! If we don't allow polled IO where IOCB_NOWAIT isn't set in
the kiocb, then we can run into repeated -EAGAIN submissions and not
make any progress.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 16:04:07 -06:00
Jens Axboe
d74f714896 block: get rid of unused plug->nowait flag
This was introduced to add a plug based way of signaling nowait issues,
but we have since moved on from that. Kill the old dead code, nobody is
setting it anymore.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08 15:50:37 -06:00
Christoph Hellwig
95848dcb9d zram: take device and not only bvec offset into account
Commit af8b04c63708 ("zram: simplify bvec iteration in
__zram_make_request") changed the bio iteration in zram to rely on the
implicit capping to page boundaries in bio_for_each_segment.  But it
failed to care for the fact zram not only care about the page alignment
of the bio payload, but also the page alignment into the device.  For
buffered I/O and swap those are the same, but for direct I/O or kernel
internal I/O like XFS log buffer writes they can differ.

Fix this by open coding bio_for_each_segment and limiting the bvec len
so that it never crosses over a page alignment boundary in the device
in addition to the payload boundary already taken care of by
bio_iter_iovec.

Cc: stable@vger.kernel.org
Fixes: af8b04c63708 ("zram: simplify bvec iteration in __zram_make_request")
Reported-by: Dusty Mabe <dusty@dustymabe.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Link: https://lore.kernel.org/r/20230805055537.147835-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-05 16:13:15 -06:00
Jens Axboe
a592ab6171 nvme fixes for Linux 6.5
- Fixes for request_queue state (Ming)
  - Another uuid quirk (August)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmTLvkMACgkQPe3zGtjz
 RglPmhAAkj5xv1UO4nga4i+GHgjl07Eohi8C4tYWiRae6K2bzl1sQsKRbtaDKBY3
 RmAZ2kZIJteDso6RTYqzAffSLHLRav3gs4gxoEgETAVlqSqhNGp7xoi8YdajK5As
 h1dl6P1Whs4p0CH118GNtPvYcKo+p9SVpR8bTqSlWUGNu0L/auf/B8/E5SaEScbe
 ovqElpcgvAmBFCo9LiHpEph8bDMeQRsJRIomEFJU98BwM35VMe0H3lbmw9pgBEny
 cZOTXf+AznBJWUwZ+gQtvGPO8r8QVlwgHtT3SVBDMRZjkQLnKXZpgU9btlOMBCNv
 IaLCVTOfIKTihM5wxRByx1OJx4DHVM2oGn/06RR38+TkQUgaIUHqWY9Po4fY21C0
 gHe28iFNJHMKyznjepNnKHDS28ZTMk2taD8EbR2ke8Kba8HWv4hNRIHJ/PvJHwJW
 mWuCwQA0hvUrL2HrnlU1sVJt5+zu3mE9dfp2YbhCE/vVLp5+SYnQkRm6TVtob6D4
 7Qmj/I4QkxuZs7XIgXaA8bHH5NPH4Ga2j3WFxQA3Pvp7ZXTnRbbVUWHrD+N7i1cM
 cNLfSGww5SZAvgzBM17UcA/MxKHENfZM0O9A/4hlZwH544n21aQAefv19yqTLfBE
 e0CE0+htpxgBJL1DmakMl7FJIpl4eQfhwH3C24sBaG5u/w/mMcE=
 =YQ2k
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.5-2023-08-02' of git://git.infradead.org/nvme into block-6.5

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.5

 - Fixes for request_queue state (Ming)
 - Another uuid quirk (August)"

* tag 'nvme-6.5-2023-08-02' of git://git.infradead.org/nvme:
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G
  nvme-rdma: fix potential unbalanced freeze & unfreeze
  nvme-tcp: fix potential unbalanced freeze & unfreeze
  nvme: fix possible hang when removing a controller during error recovery
2023-08-03 20:03:42 -06:00
August Wikerfors
688b419c57 nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G
The Samsung PM9B1 512G SSD found in some Lenovo Yoga 7 14ARB7 laptop units
reports eui as 0001000200030004 when resuming from s2idle, causing the
device to be removed with this error in dmesg:

nvme nvme0: identifiers changed for nsid 1

To fix this, add a quirk to ignore namespace identifiers for this device.

Signed-off-by: August Wikerfors <git@augustwikerfors.se>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-08-01 13:28:29 -07:00
Ming Lei
3e9dce80db ublk: return -EINTR if breaking from waiting for existed users in DEL_DEV
If user interrupts wait_event_interruptible() in ublk_ctrl_del_dev(),
return -EINTR and let user know what happens.

Fixes: 0abe39dec065 ("block: ublk: improve handling device deletion")
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27 07:17:36 -06:00
Ming Lei
0c0cbd4ebc ublk: fail to recover device if queue setup is interrupted
In ublk_ctrl_end_recovery(), if wait_for_completion_interruptible() is
interrupted by signal, queues aren't setup successfully yet, so we
have to fail UBLK_CMD_END_USER_RECOVERY, otherwise kernel oops can be
triggered.

Fixes: c732a852b419 ("ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support")
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27 07:17:36 -06:00
Ming Lei
53e7d08f6d ublk: fail to start device if queue setup is interrupted
In ublk_ctrl_start_dev(), if wait_for_completion_interruptible() is
interrupted by signal, queues aren't setup successfully yet, so we
have to fail UBLK_CMD_START_DEV, otherwise kernel oops can be triggered.

Reported by German when working on qemu-storage-deamon which requires
single thread ublk daemon.

Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Reported-by: German Maglione <gmaglione@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230726144502.566785-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-27 07:17:36 -06:00
Bart Van Assche
e0933b526f block: Fix a source code comment in include/uapi/linux/blkzoned.h
Fix the symbolic names for zone conditions in the blkzoned.h header
file.

Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Fixes: 6a0cb1bc106f ("block: Implement support for zoned block devices")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230706201422.3987341-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 20:11:54 -06:00
Stefan Haberland
856d8e3c63 s390/dasd: print copy pair message only for the correct error
The DASD driver has certain types of requests that might be rejected by
the storage server or z/VM because they are not supported. Since the
missing support of the command is not a real issue there is no user
visible kernel error message for this.

For copy pair setups  there is a specific error that IO is not allowed on
secondary devices. This error case is explicitly handled and an error
message is printed.

The code checking for the error did use a bitwise 'and' that is used to
check for specific bits. But in this case the whole sense byte has to
match.

This leads to the problem that the copy pair related error message is
erroneously printed for other error cases that are usually not reported.
This might heavily confuse users and lead to follow on actions that might
disrupt application processing.

Fix by checking the sense byte for the exact value and not single bits.

Cc: stable@vger.kernel.org # 6.1+
Fixes: 1fca631a1185 ("s390/dasd: suppress generic error messages for PPRC secondary devices")
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20230721193647.3889634-5-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 13:05:29 -06:00
Stefan Haberland
8a2278ce9c s390/dasd: fix hanging device after request requeue
The DASD device driver has a function to requeue requests to the
blocklayer.
This function is used in various cases when basic settings for the device
have to be changed like High Performance Ficon related parameters or copy
pair settings.

The functions iterates over the device->ccw_queue and also removes the
requests from the block->ccw_queue.
In case the device is started on an alias device instead of the base
device it might be removed from the block->ccw_queue without having it
canceled properly before. This might lead to a hanging device since the
request is no longer on a queue and can not be handled properly.

Fix by iterating over the block->ccw_queue instead of the
device->ccw_queue. This will take care of all blocklayer related requests
and handle them on all associated DASD devices.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20230721193647.3889634-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 13:05:29 -06:00
Stefan Haberland
acea28a6b7 s390/dasd: use correct number of retries for ERP requests
If a DASD request fails an error recovery procedure (ERP) request might
be built as a copy of the original request to do error recovery.

The ERP request gets a number of retries assigned.
This number is always 256 no matter what other value might have been set
for the original request. This is not what is expected when a user
specifies a certain amount of retries for the device via sysfs.

Correctly use the number of retries of the original request for ERP
requests.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20230721193647.3889634-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 13:05:29 -06:00
Stefan Haberland
05f1d8ed03 s390/dasd: fix hanging device after quiesce/resume
Quiesce and resume are functions that tell the DASD driver to stop/resume
issuing I/Os to a specific DASD.

On resume dasd_schedule_block_bh() is called to kick handling of IO
requests again. This does unfortunately not cover internal requests which
are used for path verification for example.

This could lead to a hanging device when a path event or anything else
that triggers internal requests occurs on a quiesced device.

Fix by also calling dasd_schedule_device_bh() which triggers handling of
internal requests on resume.

Fixes: 8e09f21574ea ("[S390] dasd: add hyper PAV support to DASD device driver, part 1")

Cc: stable@vger.kernel.org
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20230721193647.3889634-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 13:05:29 -06:00
Mauricio Faria de Oliveira
bb5faa99f0 loop: do not enforce max_loop hard limit by (new) default
Problem:

The max_loop parameter is used for 2 different purposes:

1) initial number of loop devices to pre-create on init
2) maximum number of loop devices to add on access/open()

Historically, its default value (zero) caused 1) to create non-zero
number of devices (CONFIG_BLK_DEV_LOOP_MIN_COUNT), and no hard limit on
2) to add devices with autoloading.

However, the default value changed in commit 85c50197716c ("loop: Fix
the max_loop commandline argument treatment when it is set to 0") to
CONFIG_BLK_DEV_LOOP_MIN_COUNT, for max_loop=0 not to pre-create devices.

That does improve 1), but unfortunately it breaks 2), as the default
behavior changed from no-limit to hard-limit.

Example:

For example, this userspace code broke for N >= CONFIG, if the user
relied on the default value 0 for max_loop:

    mknod("/dev/loopN");
    open("/dev/loopN");  // now fails with ENXIO

Though affected users may "fix" it with (loop.)max_loop=0, this means to
require a kernel parameter change on stable kernel update (that commit
Fixes: an old commit in stable).

Solution:

The original semantics for the default value in 2) can be applied if the
parameter is not set (ie, default behavior).

This still keeps the intended function in 1) and 2) if set, and that
commit's intended improvement in 1) if max_loop=0.

Before 85c50197716c:
  - default:     1) CONFIG devices   2) no limit
  - max_loop=0:  1) CONFIG devices   2) no limit
  - max_loop=X:  1) X devices        2) X limit

After 85c50197716c:
  - default:     1) CONFIG devices   2) CONFIG limit (*)
  - max_loop=0:  1) 0 devices (*)    2) no limit
  - max_loop=X:  1) X devices        2) X limit

This commit:
  - default:     1) CONFIG devices   2) no limit (*)
  - max_loop=0:  1) 0 devices        2) no limit
  - max_loop=X:  1) X devices        2) X limit

Future:

The issue/regression from that commit only affects code under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard, thus the fix too is
contained under it.

Once that deprecated functionality/code is removed, the purpose 2) of
max_loop (hard limit) is no longer in use, so the module parameter
description can be changed then.

Tests:

Linux 6.4-rc7
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
CONFIG_BLOCK_LEGACY_AUTOLOAD=y

- default (original)

	# ls -1 /dev/loop*
	/dev/loop-control
	/dev/loop0
	...
	/dev/loop7

	# ./test-loop
	open: /dev/loop8: No such device or address

- default (patched)

	# ls -1 /dev/loop*
	/dev/loop-control
	/dev/loop0
	...
	/dev/loop7

	# ./test-loop
	#

- max_loop=0 (original & patched):

	# ls -1 /dev/loop*
	/dev/loop-control

	# ./test-loop
	#

- max_loop=8 (original & patched):

	# ls -1 /dev/loop*
	/dev/loop-control
	/dev/loop0
	...
	/dev/loop7

	# ./test-loop
	open: /dev/loop8: No such device or address

- max_loop=0 (patched; CONFIG_BLOCK_LEGACY_AUTOLOAD is not set)

	# ls -1 /dev/loop*
	/dev/loop-control

	# ./test-loop
	open: /dev/loop8: No such device or address

Fixes: 85c50197716c ("loop: Fix the max_loop commandline argument treatment when it is set to 0")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-3-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-21 13:20:57 -06:00
Mauricio Faria de Oliveira
23881aec85 loop: deprecate autoloading callback loop_probe()
The 'probe' callback in __register_blkdev() is only used under the
CONFIG_BLOCK_LEGACY_AUTOLOAD deprecation guard.

The loop_probe() function is only used for that callback, so guard it
too, accordingly.

See commit fbdee71bb5d8 ("block: deprecate autoloading based on dev_t").

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230720143033.841001-2-mfo@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-21 13:20:48 -06:00
David Jeffery
106397376c sbitmap: fix batching wakeup
Current code supposes that it is enough to provide forward progress by
just waking up one wait queue after one completion batch is done.

Unfortunately this way isn't enough, cause waiter can be added to wait
queue just after it is woken up.

Follows one example(64 depth, wake_batch is 8)

1) all 64 tags are active

2) in each wait queue, there is only one single waiter

3) each time one completion batch(8 completions) wakes up just one
   waiter in each wait queue, then immediately one new sleeper is added
   to this wait queue

4) after 64 completions, 8 waiters are wakeup, and there are still 8
   waiters in each wait queue

5) after another 8 active tags are completed, only one waiter can be
   wakeup, and the other 7 can't be waken up anymore.

Turns out it isn't easy to fix this problem, so simply wakeup enough
waiters for single batch.

Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20230721095715.232728-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-21 11:40:20 -06:00
Ming Lei
29b434d1e4 nvme-rdma: fix potential unbalanced freeze & unfreeze
Move start_freeze into nvme_rdma_configure_io_queues(), and there is
at least two benefits:

1) fix unbalanced freeze and unfreeze, since re-connection work may
fail or be broken by removal

2) IO during error recovery can be failfast quickly because nvme fabrics
unquiesces queues after teardown.

One side-effect is that !mpath request may timeout during connecting
because of queue topo change, but that looks not one big deal:

1) same problem exists with current code base

2) compared with !mpath, mpath use case is dominant

Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-21 00:53:33 -07:00
Ming Lei
99dc264014 nvme-tcp: fix potential unbalanced freeze & unfreeze
Move start_freeze into nvme_tcp_configure_io_queues(), and there is
at least two benefits:

1) fix unbalanced freeze and unfreeze, since re-connection work may
fail or be broken by removal

2) IO during error recovery can be failfast quickly because nvme fabrics
unquiesces queues after teardown.

One side-effect is that !mpath request may timeout during connecting
because of queue topo change, but that looks not one big deal:

1) same problem exists with current code base

2) compared with !mpath, mpath use case is dominant

Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-21 00:53:29 -07:00
Ming Lei
1b95e81791 nvme: fix possible hang when removing a controller during error recovery
Error recovery can be interrupted by controller removal, then the
controller is left as quiesced, and IO hang can be caused.

Fix the issue by unquiescing controller unconditionally when removing
namespaces.

This way is reasonable and safe given forward progress can be made
when removing namespaces.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reported-by: Chunguang Xu <brookxu.cn@gmail.com>
Closes: https://lore.kernel.org/linux-nvme/cover.1685350577.git.chunguang.xu@shopee.com/
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-21 00:53:26 -07:00
Chengming Zhou
013adcbef1 blk-iocost: skip empty flush bio in iocost
The flush bio may have data, may have no data (empty flush), we couldn't
calculate cost for empty flush bio. So we'd better just skip it for now.

Another side effect is that empty flush bio's bio_end_sector() is 0, cause
iocg->cursor reset to 0, may break the cost calculation of other bios.

This isn't good enough, since flush bio still consume the device bandwidth,
but flush request is special, can be merged randomly in the flush state
machine, we don't know how to calculate cost for it for now.

Its completion time also has flaws, which may include the pre-flush or
post-flush completion time, but I don't know if we need to fix that and
how to fix it.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230720121441.1408522-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-20 14:02:17 -06:00
Chengming Zhou
3641c90c4e blk-mq: delete dead struct blk_mq_hw_ctx->queued field
This counter is not used anywhere, so delete it.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230720095512.1403123-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-20 13:29:25 -06:00
Ross Lagerwall
7090426351 blk-mq: Fix stall due to recursive flush plug
We have seen rare IO stalls as follows:

* blk_mq_plug_issue_direct() is entered with an mq_list containing two
requests.
* For the first request, it sets last == false and enters the driver's
queue_rq callback.
* The driver queue_rq callback indirectly calls schedule() which calls
blk_flush_plug(). This may happen if the driver has the
BLK_MQ_F_BLOCKING flag set and is allowed to sleep in ->queue_rq.
* blk_flush_plug() handles the remaining request in the mq_list. mq_list
is now empty.
* The original call to queue_rq resumes (with last == false).
* The loop in blk_mq_plug_issue_direct() terminates because there are no
remaining requests in mq_list.

The IO is now stalled because the last request submitted to the driver
had last == false and there was no subsequent call to commit_rqs().

Fix this by returning early in blk_mq_flush_plug_list() if rq_count is 0
which it will be in the recursive case, rather than checking if the
mq_list is empty. At the same time, adjust one of the callers to skip
the mq_list empty check as it is not necessary.

Fixes: dc5fc361d891 ("block: attempt direct issue of plug list")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230714101106.3635611-1-ross.lagerwall@citrix.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-14 13:44:24 -06:00
Christoph Hellwig
9f87fc4d72 block: queue data commands from the flush state machine at the head
We used to insert the data commands following a pre-flush to the head
of the queue until commit 1e82fadfc6b ("blk-mq: do not do head insertions
post-pre-flush commands").  Not doing this seems to cause hangs of
such commands on NFS workloads when exported from file systems with
SATA SSDs.  I have no idea why this would starve these workloads,
but doing a semantic revert of this patch (which looks quite different
due to various other changes) fixes the hangs.

Fixes: 1e82fadfc6b ("blk-mq: do not do head insertions post-pre-flush commands")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230714143014.11879-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-14 08:42:58 -06:00
Jens Axboe
90b4622954 nvme fixes for Linux 6.5
- Don't require quirk to use duplicate namespace identifiers
    (Christoph, Sagi)
  - One more BOGUS_NID quirk (Pankaj)
  - IO timeout and error hanlding fixes for PCI (Keith)
  - Enhanced metadata format mask fix (Ankit)
  - Association race condition fix for fibre channel (Michael)
  - Correct debugfs error checks (Minjie)
  - Use PAGE_SECTORS_SHIFT where needed (Damien)
  - Reduce kernel logs for legacy nguid attribute (Keith)
  - Use correct dma direction when unmapping metadata (Ming)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE3Fbyvv+648XNRdHTPe3zGtjzRgkFAmSwW9cACgkQPe3zGtjz
 RglgZBAAiW5lUqiYwSUZNPN5EzV5eKhy7qrQ34lH7t8lobECp5CEueuIKn1Fs7KW
 kcaJTJJpBqHAd/UwbYLVRTJnBVWJ9a5VgHtoNlsNT8IAqz/k9cR+kirGqPRdhxXf
 TgKgP2g1kLS3DYoh7sOBjJiqAurUcV44XKRg7QypiRKaE9uA19zfVic9cI+o9wmE
 d0cZsl04WcTIce1OQp0g0ZDmng3DHh9EYG3uYrpMMiTO+7QQ5eZzPuxHPCewllxE
 ovUyZsSgYSVaxSaRE5Zj9P19z3Nt0xrjKwtoJGic6gY4oCZK66HF+U74jbeJL2Ux
 Wnl0XHZNoEFvCcPMwLeylJzHndmCUn9BVVO0LGBIDjW62JnhkvM63FWhuYdDXxuV
 xOAIhA4DahDbJrQXjEmsPMhrKy+y4vb25vqoxL/UF9z72IKUqDilq3bvSohhxzVK
 OOoHPB8uMaoCFqfu4C8wLZZeC/9Xb1OxZAypma+vNHa7s4GqSN1oYLR6cQcMb0Ct
 W78zx99/rZrlg0joo2iv5JBjWeusVQivb+bNbqlNJ8FRaF16u88ASkvQGo3MuX9M
 QzM9DWeWJOVAxmniCq4jUlB/nG+qzVshOfrfbbexVOHEHeiRjIe4MiEhSwjLgAg9
 d6Q+Btfe0lS0A5+naKPrFWcSgpw9WO9C9TJcO7rHkf9d5alPwUc=
 =focK
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.5-2023-07-13' of git://git.infradead.org/nvme into block-6.5

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.5

 - Don't require quirk to use duplicate namespace identifiers
   (Christoph, Sagi)
 - One more BOGUS_NID quirk (Pankaj)
 - IO timeout and error hanlding fixes for PCI (Keith)
 - Enhanced metadata format mask fix (Ankit)
 - Association race condition fix for fibre channel (Michael)
 - Correct debugfs error checks (Minjie)
 - Use PAGE_SECTORS_SHIFT where needed (Damien)
 - Reduce kernel logs for legacy nguid attribute (Keith)
 - Use correct dma direction when unmapping metadata (Ming)"

* tag 'nvme-6.5-2023-07-13' of git://git.infradead.org/nvme:
  nvme-pci: fix DMA direction of unmapping integrity data
  nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices
  nvme: ensure disabling pairs with unquiesce
  nvme-fc: fix race between error recovery and creating association
  nvme-fc: return non-zero status code when fails to create association
  nvme: fix parameter check in nvme_fault_inject_init()
  nvme: warn only once for legacy uuid attribute
  nvme: fix the NVME_ID_NS_NVM_STS_MASK definition
  nvmet: use PAGE_SECTORS_SHIFT
  nvme: add BOGUS_NID quirk for Samsung SM953
2023-07-13 14:26:38 -06:00
Chengming Zhou
5c17f45e91 blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq
The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation
state of the block device. Most of the time request is allocated after
rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected.

But for plug batched allocation introduced by the commit 47c122e35d7e
("block: pre-allocate requests if plug is started and is a batch"), we can
rq_qos_throttle() after the allocation of the request. This is what the
blk_mq_get_cached_request() does.

In this case, the cached request alloc_time_ns or start_time_ns is much
ahead if blocked in any qos ->throttle().

Fix it by setting alloc_time_ns and start_time_ns to now when the allocated
request is actually used.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230710105516.2053478-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-13 12:30:57 -06:00
Ming Lei
b8f6446b68 nvme-pci: fix DMA direction of unmapping integrity data
DMA direction should be taken in dma_unmap_page() for unmapping integrity
data.

Fix this DMA direction, and reported in Guangwu's test.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Fixes: 4aedb705437f ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13 08:16:07 -07:00
Christoph Hellwig
ac522fc6c3 nvme: don't reject probe due to duplicate IDs for single-ported PCIe devices
While duplicate IDs are still very harmful, including the potential to easily
see changing devices in /dev/disk/by-id, it turn out they are extremely
common for cheap end user NVMe devices.

Relax our check for them for so that it doesn't reject the probe on
single-ported PCIe devices, but prints a big warning instead.  In doubt
we'd still like to see quirk entries to disable the potential for
changing supposed stable device identifier links, but this will at least
allow users how have two (or more) of these devices to use them without
having to manually add a new PCI ID entry with the quirk through sysfs or
by patching the kernel.

Fixes: 2079f41ec6ff ("nvme: check that EUI/GUID/UUID are globally unique")
Cc: stable@vger.kernel.org # 6.0+
Co-developed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13 08:12:50 -07:00
Bart Van Assche
f673b4f5bd block/mq-deadline: Fix a bug in deadline_from_pos()
A bug was introduced in deadline_from_pos() while implementing the
suggestion to use round_down() in the following code:

	pos -= bdev_offset_from_zone_start(rq->q->disk->part0, pos);

This patch makes deadline_from_pos() use round_down() such that 'pos' is
rounded down.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/5zthzi3lppvcdp4nemum6qck4gpqbdhvgy4k3qwguhgzxc4quj@amulvgycq67h/
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Fixes: 0effb390c4ba ("block: mq-deadline: Handle requeued requests correctly")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230712173344.2994513-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-12 11:37:41 -06:00
Keith Busch
71a5bb153b nvme: ensure disabling pairs with unquiesce
If any error handling that disables the controller fails to queue the
reset work, like if the state changed to disconnected inbetween, then
the failed teardown needs to unquiesce the queues since it's no longer
paired with reset_work. Just make sure that the controller can be put
into a resetting state prior to starting the disable so that no other
handling can change the queue states while recovery is happening.

Reported-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 09:32:58 -07:00
Michael Liang
ee6fdc5055 nvme-fc: fix race between error recovery and creating association
There is a small race window between nvme-fc association creation and error
recovery. Fix this race condition by protecting accessing to controller
state and ASSOC_FAILED flag under nvme-fc controller lock.

Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 09:29:51 -07:00
Michael Liang
60e445bdfc nvme-fc: return non-zero status code when fails to create association
Return non-zero status code(-EIO) when needed, so re-connecting or
deleting controller will be triggered properly.

Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 09:29:48 -07:00
Minjie Du
733ba346d9 nvme: fix parameter check in nvme_fault_inject_init()
Make IS_ERR() judge the debugfs_create_dir() function return.

Signed-off-by: Minjie Du <duminjie@vivo.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 08:48:33 -07:00
Keith Busch
b718ae835b nvme: warn only once for legacy uuid attribute
Report the legacy fallback behavior for uuid attributes just once
instead of logging repeated warnings for the same condition every time
the attribute is read. The old behavior is too spamy on the kernel logs.

Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reported-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-12 08:48:19 -07:00
Jens Axboe
dc8cbb65dc block: remove dead struc request->completion_data field
It's no longer used. While in there, also update the comment as to why
it can coexist with the rb_node.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-10 10:23:05 -06:00
Ankit Kumar
b938e66036 nvme: fix the NVME_ID_NS_NVM_STS_MASK definition
As per NVMe command set specification 1.0c Storage tag size is 7 bits.

Fixes: 4020aad85c67 ("nvme: add support for enhanced metadata")
Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-10 09:12:31 -07:00
Damien Le Moal
9bcf156f58 nvmet: use PAGE_SECTORS_SHIFT
Replace occurences of the pattern "PAGE_SHIFT - 9" in the passthru and
loop targets with PAGE_SECTORS_SHIFT.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-10 08:46:30 -07:00
Pankaj Raghav
e5bb0988a5 nvme: add BOGUS_NID quirk for Samsung SM953
Add the quirk as SM953 is reporting bogus namespace ID.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217593
Reported-by: Clemens Springsguth <cspringsguth@gmail.com>
Tested-by: Clemens Springsguth <cspringsguth@gmail.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-10 08:44:36 -07:00
Eric Biggers
2fb48d88e7 blk-crypto: use dynamic lock class for blk_crypto_profile::lock
When a device-mapper device is passing through the inline encryption
support of an underlying device, calls to blk_crypto_evict_key() take
the blk_crypto_profile::lock of the device-mapper device, then take the
blk_crypto_profile::lock of the underlying device (nested).  This isn't
a real deadlock, but it causes a lockdep report because there is only
one lock class for all instances of this lock.

Lockdep subclasses don't really work here because the hierarchy of block
devices is dynamic and could have more than 2 levels.

Instead, register a dynamic lock class for each blk_crypto_profile, and
associate that with the lock.

This avoids false-positive lockdep reports like the following:

    ============================================
    WARNING: possible recursive locking detected
    6.4.0-rc5 #2 Not tainted
    --------------------------------------------
    fscryptctl/1421 is trying to acquire lock:
    ffffff80829ca418 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0x44/0x1c0

                   but task is already holding lock:
    ffffff8086b68ca8 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0xc8/0x1c0

                   other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&profile->lock);
      lock(&profile->lock);

                    *** DEADLOCK ***

     May be due to missing lock nesting notation

Fixes: 1b2628397058 ("block: Keyslot Manager for Inline Encryption")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230610061139.212085-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-05 16:36:12 -06:00
Michael Schmitz
7eb1e47696 block/partition: fix signedness issue for Amiga partitions
Making 'blk' sector_t (i.e. 64 bit if LBD support is active) fails the
'blk>0' test in the partition block loop if a value of (signed int) -1 is
used to mark the end of the partition block list.

Explicitly cast 'blk' to signed int to allow use of -1 to terminate the
partition block linked list.

Fixes: b6f3f28f604b ("block: add overflow checks for Amiga partition support")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Link: https://lore.kernel.org/r/024ce4fa-cc6d-50a2-9aae-3701d0ebf668@xenosoft.de
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Reviewed-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-05 16:34:56 -06:00
Linus Torvalds
6cd06ab12d gup: make the stack expansion warning a bit more targeted
I added a warning about about GUP no longer expanding the stack in
commit a425ac5365f6 ("gup: add warning if some caller would seem to want
stack expansion"), but didn't really expect anybody to hit it.

And it's true that nobody seems to have hit a _real_ case yet, but we
certainly have a number of reports of false positives.  Which not only
causes extra noise in itself, but might also end up hiding any real
cases if they do exist.

So let's tighten up the warning condition, and replace the simplistic

	vma = find_vma(mm, start);
	if (vma && (start < vma->vm_start)) {
		WARN_ON_ONCE(vma->vm_flags & VM_GROWSDOWN);

with a

	vma = gup_vma_lookup(mm, start);

helper function which works otherwise like just "vma_lookup()", but with
some heuristics for when to warn about gup no longer causing stack
expansion.

In particular, don't just warn for "below the stack", but warn if it's
_just_ below the stack (with "just below" arbitrarily defined as 64kB,
because why not?).  And rate-limit it to at most once per hour, which
means that any false positives shouldn't completely hide subsequent
reports, but we won't be flooding the logs about it either.

The previous code triggered when some GUP user (chromium crashpad)
accessing past the end of the previous vma, for example.  That has never
expanded the stack, it just causes GUP to return early, and as such we
shouldn't be warning about it.

This is still going trigger the randomized testers, but to mitigate the
noise from that, use "dump_stack()" instead of "WARN_ON_ONCE()" to get
the kernel call chain.  We'll get the relevant information, but syzbot
shouldn't get too upset about it.

Also, don't even bother with the GROWSUP case, which would be using
different heuristics entirely, but only happens on parisc.

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: John Hubbard <jhubbard@nvidia.com>
Reported-by: syzbot+6cf44e127903fdf9d929@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-05 09:33:31 -07:00
Linus Torvalds
d528014517 Revert ".gitignore: ignore *.cover and *.mbx"
This reverts commit 534066a983df0935847061c844eb178f8a53a9e7.

It's actively detrimental in that it hides files that shouldn't be
hidden.

If I have some b4 mbx file in my git directory, it either was already
applied with "git am" and is now stale, or maybe it's waiting for that
to happen.  In neither case is "ignore it" the right option.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-04 15:05:12 -07:00
Linus Torvalds
04f2933d37 Scope-based Resource Management infrastructure
These are the first few patches in the Scope-based Resource Management
 series that introduce the infrastructure but not any conversions as of
 yet.
 
 Adding the infrastructure now allows multiple people to start using them.
 
 Of note is that Sparse will need some work since it doesn't yet
 understand this attribute and might have decl-after-stmt issues -- but I
 think that's being worked on.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEv3OU3/byMaA0LqWJdkfhpEvA5LoFAmSZehYVHHBldGVyekBp
 bmZyYWRlYWQub3JnAAoJEHZH4aRLwOS6uC0P/0pduumvCE+osm869VOroHeVqz6B
 B36oqJlD1uqIA8JWiNW8TbpUUqTvU0FyW1jU1YnUdol5Kb3XH7APQIayGj2HM+Mc
 kMyHc4/ICxBzRI7STiBtCEbHIwYAfuibCAfuq7QP4GDw4vHKvUC7BUUaUdbQSey0
 IyUsvgPqLRL1KW3tb2z7Zfz6Oo/f0A/+viTiqOKNxSctXaD7TOEUpIZ/RhIgWbIV
 as65sgzLD/87HV9LbtTyOQCxjR5lpDlTaITapMfgwC9bQ8vQwjmaX3WETIlUmufl
 8QG3wiPIDynmqjmpEeY+Cyjcu/Gbz/2XENPgHBd6g2yJSFLSyhSGcDBAZ/lFKlnB
 eIJqwIFUvpMbAskNxHooNz1zHzqHhOYcCSUITRk6PG8AS1HOD/aNij4sWHMzJSWU
 R3ZUOEf5k6FkER3eAKxEEsJ2HtIurOVfvFXwN1e0iCKvNiQ1QdP74LddcJqsrTY6
 ihXsfVbtJarM9dakuWgk3fOj79cx3BQEm82002vvNzhRRUB5Fx7v47lyB5mAcYEl
 Al9gsLg4mKkVNiRh/tcelchKPp9WOKcXFEKiqom75VNwZH4oa+KSaa4GdDUQZzNs
 73/Zj6gPzFBP1iLwqicvg5VJIzacKuH59WRRQr1IudTxYgWE67yz1OVicplD0WaR
 l0tF9LFAdBHUy8m/
 =adZ8
 -----END PGP SIGNATURE-----

Merge tag 'core_guards_for_6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue

Pull scope-based resource management infrastructure from Peter Zijlstra:
 "These are the first few patches in the Scope-based Resource Management
  series that introduce the infrastructure but not any conversions as of
  yet.

  Adding the infrastructure now allows multiple people to start using
  them.

  Of note is that Sparse will need some work since it doesn't yet
  understand this attribute and might have decl-after-stmt issues"

* tag 'core_guards_for_6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue:
  kbuild: Drop -Wdeclaration-after-statement
  locking: Introduce __cleanup() based infrastructure
  apparmor: Free up __cleanup() name
  dmaengine: ioat: Free up __cleanup() name
2023-07-04 13:50:38 -07:00
David Howells
03275585ca afs: Fix accidental truncation when storing data
When an AFS FS.StoreData RPC call is made, amongst other things it is
given the resultant file size to be.  On the server, this is processed
by truncating the file to new size and then writing the data.

Now, kafs has a lock (vnode->io_lock) that serves to serialise
operations against a specific vnode (ie.  inode), but the parameters for
the op are set before the lock is taken.  This allows two writebacks
(say sync and kswapd) to race - and if writes are ongoing the writeback
for a later write could occur before the writeback for an earlier one if
the latter gets interrupted.

Note that afs_writepages() cannot take i_mutex and only takes a shared
lock on vnode->validate_lock.

Also note that the server does the truncation and the write inside a
lock, so there's no problem at that end.

Fix this by moving the calculation for the proposed new i_size inside
the vnode->io_lock.  Also reset the iterator (which we might have read
from) and update the mtime setting there.

Fixes: bd80d8a80e12 ("afs: Use ITER_XARRAY for writing")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/3526895.1687960024@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-07-04 12:24:32 -07:00
Linus Torvalds
538140ca60 overlayfs update for 6.5 - part 2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmSkQkYACgkQEVvVyTe/
 1Wr21A/+LLrN6fmY8wCnZqdk1i3Qlob8CpqeGEL0zV3yMVeE+d16AxI3uvm8GCAZ
 RAkTWUVwiLqnv8Ua28lylvWiM4Rmr1gwxoCCd19NlHrU2Suy599dXeE530aUxpuo
 PG74XR+3RHQ+tg1PJmLUiOtSqOwgBeWHaESfTyVhdNM7ywQGU4PCejxD8WAfOAuu
 rzEfYsX1deQ0vidRvc/Y2XFdW91woML2FoMmyqhS52pinRJ04clIheSe8poczgQ4
 xIg+GMc6XdIYW42XdEKPGE/sCEq3nFLAQZgMJDXUvt6fAHxeVhirslNKYwzBS35w
 lqlUTTcf7oo0yJr2vf8Ut9yNMZPYaxsOr9UR4+23IkJj9NP8jkowwih0nc85R1yr
 8eYLu3IwNlmP6azg4dHtzw5UZzJ6eabG63gNVJCyl7wqC+hkcOvt3IQDv2nklX38
 oDMX9VO3IUNyl5katAjGhPYYmUXTB/YJgnbgQGu77ZVizKQHgonyFglNQb8USFwP
 gHkLE3RWuTd6BaNOcKqDJSr63S6ylLPX7xfvYvrncTnoeGCcumKFhAYx3fHX3kIo
 42ZyEiVq0hhmqfRH5AQgE9MhBzN7c6jmrSyQHM6GaMvmLCDRtKBVQH6Y4iC5VoJ1
 JSXLFAy9aVu0DdcCdTsjgOb1t8FOyC0kFtZsdZLnDQKcZngRXoo=
 =YILi
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs

Pull more overlayfs updates from Amir Goldstein:
 "This is a small 'move code around' followup by Christian to his work
  on porting overlayfs to the new mount api for 6.5. It makes things a
  bit cleaner and simpler for the next development cycle when I hand
  overlayfs back over to Miklos"

* tag 'ovl-update-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: move all parameter handling into params.{c,h}
2023-07-04 11:52:54 -07:00
Linus Torvalds
94c76955e8 gfs2 fixes
- Move the freeze/thaw logic from glock callback context to process /
   worker thread context to prevent deadlocks.
 
 - Fix a quota reference couting bug in do_qc().
 
 - Carry on deallocating inodes even when gfs2_rindex_update() fails.
 
 - Retry filesystem-internal reads when they are interruped by a signal.
 
 - Eliminate kmap_atomic() in favor of kmap_local_page() /
   memcpy_{from,to}_page().
 
 - Get rid of noop_direct_IO.
 
 - And a few more minor fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmSkEWIUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqTSQ/+OKFnr1O1pVFP2k4CA5bGuJX4sfOt
 TT045OAfy7VuE79q1Xi6HZ880kdP7fg4XUbcrwm45fEt5Q5kbbe6pjP34fCGGVSE
 vLmb5BG/msb6fDLRN17S6uzyqKwFv9tGofTAtCtWTs7gmtezpBByHNHAu6duG89c
 5LOhUbz3BhIOPrHUs6RB3h/OpmofVNwWCKQmfCEzYsHMzoWz1XuZJMWRb5Ho6BII
 mOF0tWlEfr5gdavyWZ9UkuHqe3HI/1PfUNFVAD9S/V3kp2XPnc+HM3yP+S8df4p8
 HT5VqVjH3JQL7sf6CnUXo9LP1veB+hHuvAaOyrHIKqHbIHwgxlWIIYcEYWKEGG8p
 1CMjm6bI/hLQuFBUrpD1z3ppLavPZdl16Z3kCAVx8Ixl3livqMuiiZGBRdtBGdBr
 RT66+SX1GurLp1EcWncqEXdJ5jgYKqVeArZKdh3thXSVaO+b8yq3IeuDRHseLnLA
 egyzO3yLocvze3YFiZI4Y0V4ako9NO+2GNPd5+O9Bh9L7RepjiMCloDNpfDPQsTR
 fJdW4t9vMts2eCzRASsXBUdUUV0k1Qvxdm+9BbDHFW9YvH7BQzEoi6qTZFsamdGT
 NQqKdhrfiKsSEA8HFCgePYPzOGMAyH7fiaQUaXUGwkZy6AVmK+piKlJn+csNKbHQ
 3dUo4upAzoeMwsA=
 =GKxj
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - Move the freeze/thaw logic from glock callback context to process /
   worker thread context to prevent deadlocks

 - Fix a quota reference couting bug in do_qc()

 - Carry on deallocating inodes even when gfs2_rindex_update() fails

 - Retry filesystem-internal reads when they are interruped by a signal

 - Eliminate kmap_atomic() in favor of kmap_local_page() /
   memcpy_{from,to}_page()

 - Get rid of noop_direct_IO

 - And a few more minor fixes and cleanups

* tag 'gfs2-v6.4-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (23 commits)
  gfs2: Add quota_change type
  gfs2: Use memcpy_{from,to}_page where appropriate
  gfs2: Convert remaining kmap_atomic calls to kmap_local_page
  gfs2: Replace deprecated kmap_atomic with kmap_local_page
  gfs: Get rid of unnucessary locking in inode_go_dump
  gfs2: gfs2_freeze_lock_shared cleanup
  gfs2: Replace sd_freeze_state with SDF_FROZEN flag
  gfs2: Rework freeze / thaw logic
  gfs2: Rename SDF_{FS_FROZEN => FREEZE_INITIATOR}
  gfs2: Reconfiguring frozen filesystem already rejected
  gfs2: Rename gfs2_freeze_lock{ => _shared }
  gfs2: Rename the {freeze,thaw}_super callbacks
  gfs2: Rename remaining "transaction" glock references
  gfs2: retry interrupted internal reads
  gfs2: Fix possible data races in gfs2_show_options()
  gfs2: Fix duplicate should_fault_in_pages() call
  gfs2: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
  gfs2: Don't remember delete unless it's successful
  gfs2: Update rl_unlinked before releasing rgrp lock
  gfs2: Fix gfs2_qa_get imbalance in gfs2_quota_hold
  ...
2023-07-04 11:45:16 -07:00
Linus Torvalds
ccf46d8531 More power management updates for 6.5-rc1
- Add missing __init annotation to one function in the intel_idle
    drvier (Rafael Wysocki).
 
  - Make intel_pstate use a correct scaling factor when mapping HWP
    performance levels to frequency values on hybrid-capable systems
    with disabled E-cores (Srinivas Pandruvada).
 
  - Fix Kconfig dependencies of the cpufreq-dt-platform driver (Viresh
    Kumar).
 
  - Add support to build cpufreq-dt-platdev as a module (Zhipeng Wang).
 
  - Don't allocate Sparc's cpufreq_driver dynamically (Viresh Kumar).
 
  - Add support for TI's AM62A7 platform (Vibhore Vardhan).
 
  - Add support for Armada's ap807 platform (Russell King (Oracle)).
 
  - Add support for StarFive JH7110 SoC (Mason Huo).
 
  - Fix voltage selection for Mediatek Socs (Daniel Golle).
 
  - Fix error handling in Tegra's cpufreq driver (Christophe JAILLET).
 
  - Document Qualcomm's IPQ8074 in DT bindings (Robert Marko).
 
  - Don't warn for disabling a non-existing frequency for imx6q cpufreq
    driver (Christoph Niedermaier).
 
  - Use dev_err_probe() in Qualcomm's cpufreq driver (Andrew Halaney).
 
  - Simplify performance state related logic in the OPP core (Viresh
    Kumar).
 
  - Fix use-after-free and improve locking around lazy_opp_tables (Viresh
    Kumar, Stephan Gerhold).
 
  - Minor cleanups - using dev_err_probe() and rate-limiting debug
    messages (Andrew Halaney, Adrián Larumbe).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmSkTGwSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxbxAQAJ92EqcE2n2Eyts6KU7n3sKu8jEIiNPV
 3bdFgTfjsc26Sm+ppiISRkk0CCNcyljCM4I5wRtEe4+hfTszAdrFd2SNYUHjHCZl
 xIz9MVtZlh3ODm2d8ClFvQEL3So+oO6UG5V5IS27/34sfgQAS+77AWtjdh/j8d+0
 ZM4jikbR87eACmpmgAjp17LIH2qdKx4Sri5VUZSNVde4es/m306gkhT5hZFDrRai
 S+s/Mxfp4sxSNNiZC6nXp10bvU/cT7ZWsOcaaOltZmt/Y9yWiLFEmNKjR3ljvzOA
 H6EPMcfhdBUiIhdMpjqMttVNOOgaU2ma/JKQF5JLDkFCnT/BGVi3HsvA8uT8tIgA
 doSZ0+ODQ7ar/v55Rv9ENwshfb5oaPDV729EhQyCrNCc77yvjwA1sXoeDEuS2GXa
 sYXHIsuAs/aNzUdIQZ2G5pjG0zs3xu6oMODfbYoPGg3tj9ITT6TqxnBkjBhsfTKl
 bv0bqh/nMPVcLZaOEXvM538XiEfKmGz9wWH5d8WydyDkCGTG97m/bQaZOlEry7RY
 R70vPOi8ArVRwJlmW5Gl8NENy2DsCMj+BXmxMU2EwhwL1KLvDWQ+Qf6MFU7o0iym
 W/5RQGbbH/clVJL+zXSTsqmfMJanrIM4DY2wKaRz0o2xxiOh/cNRuzm0ZjfAmbml
 MjEhdmcNX76P
 =7lBb
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These add support for new hardware (ap807 and AM62A7), fix several
  issues in cpufreq drivers and in the operating performance points
  (OPP) framework, fix up intel_idle after recent changes and add
  documentation.

  Specifics:

   - Add missing __init annotation to one function in the intel_idle
     drvier (Rafael Wysocki)

   - Make intel_pstate use a correct scaling factor when mapping HWP
     performance levels to frequency values on hybrid-capable systems
     with disabled E-cores (Srinivas Pandruvada)

   - Fix Kconfig dependencies of the cpufreq-dt-platform driver (Viresh
     Kumar)

   - Add support to build cpufreq-dt-platdev as a module (Zhipeng Wang)

   - Don't allocate Sparc's cpufreq_driver dynamically (Viresh Kumar)

   - Add support for TI's AM62A7 platform (Vibhore Vardhan)

   - Add support for Armada's ap807 platform (Russell King (Oracle))

   - Add support for StarFive JH7110 SoC (Mason Huo)

   - Fix voltage selection for Mediatek Socs (Daniel Golle)

   - Fix error handling in Tegra's cpufreq driver (Christophe JAILLET)

   - Document Qualcomm's IPQ8074 in DT bindings (Robert Marko)

   - Don't warn for disabling a non-existing frequency for imx6q cpufreq
     driver (Christoph Niedermaier)

   - Use dev_err_probe() in Qualcomm's cpufreq driver (Andrew Halaney)

   - Simplify performance state related logic in the OPP core (Viresh
     Kumar)

   - Fix use-after-free and improve locking around lazy_opp_tables
     (Viresh Kumar, Stephan Gerhold)

   - Minor cleanups - using dev_err_probe() and rate-limiting debug
     messages (Andrew Halaney, Adrián Larumbe)"

* tag 'pm-6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
  cpufreq: intel_pstate: Fix scaling for hybrid-capable systems with disabled E-cores
  cpufreq: Make CONFIG_CPUFREQ_DT_PLATDEV depend on OF
  intel_idle: Add __init annotation to matchup_vm_state_with_baremetal()
  OPP: Properly propagate error along when failing to get icc_path
  OPP: Use dev_err_probe() when failing to get icc_path
  cpufreq: qcom-cpufreq-hw: Use dev_err_probe() when failing to get icc paths
  cpufreq: mediatek: correct voltages for MT7622 and MT7623
  cpufreq: armada-8k: add ap807 support
  OPP: Simplify the over-designed pstate <-> level dance
  OPP: pstate is only valid for genpd OPP tables
  OPP: don't drop performance constraint on OPP table removal
  OPP: Protect `lazy_opp_tables` list with `opp_table_lock`
  OPP: Staticize `lazy_opp_tables` in of.c
  cpufreq: dt-platdev: Support building as module
  opp: Fix use-after-free in lazy_opp_tables after probe deferral
  dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ8074
  cpufreq: dt-platdev: Blacklist ti,am62a7 SoC
  cpufreq: ti-cpufreq: Add support for AM62A7
  OPP: rate-limit debug messages when no change in OPP is required
  cpufreq: imx6q: don't warn for disabling a non-existing frequency
  ...
2023-07-04 11:22:50 -07:00
Linus Torvalds
b869e9f499 Another set of clk driver updates and fixes for the merge window. The
driver updates needed more time to bake in linux-next.
 
 Updates:
  - Support for more clk controllers in Qualcomm SoCs such as SM8350,
    SM8450, SDX75, SC8280XP, and IPQ9574
  - Runtime PM enablement of some more Qualcomm clk controllers
  - Various fixes to Qualcomm clk driver data to use correct clk_ops
    and to check halt bits properly
  - AT91 updates to modernize with clk_parent_data structures
 
 Fixes:
  - Remove "syscon" from dt binding fix for ti,j721e-system-controller
  - Fix determine rate in the Tegra driver that got wrecked by the
    refactorting of muxes this merge window
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE9L57QeeUxqYDyoaDrQKIl8bklSUFAmSkRq8RHHNib3lkQGtl
 cm5lbC5vcmcACgkQrQKIl8bklSUgHBAAkXU0agruD22fecxI/RliNIwyRY4F8CNW
 aKuz5XSGLTZAgaBDjFHPb2VSph9NXbFsLxxmxdrf7bDCAw6ww90vK0G9+JS/b/9D
 fGhqvl+d8oFeFAislr7eSDhKCMlKd5I+v6mUUk0WrPqKddHtIvq2Wh8j7+j9OZmh
 2lov68N95xYTKQPs4+mb0geDktCnFKJsIt2z9DnyPi5ixS5d6qJARXmOruJt7mbq
 1bd9iJLmOZPWm703hJcreLgyo9zhTxLY+BUrHM4S8J72BTFHN/OyLyKai25mSXll
 Odw4/G42VuykV03kvyAZNS2ebhw1DJ+px9jrfU5FH6TqIgfZG2tBnP0RkPbamNcA
 Y9ZBNtaJMfIk/Nqg5q8cyO72y2sfc2QUNt2qn0i02BxOSfSLLxQDuKiWPbilC7ju
 9zYDGIZUhZWLpfK3a+QWAU0VNGak+sq/ZbAQijI8q8MriilwhonQsXQerQC0Jzcw
 zKdMoVPCedK411UZh9fMmIBwEsAFEtYo/qDynpv+w0zwkpDY2t5Sh7lGlQFOiynZ
 YVFbUD7EEf3GJmftdEWTHQpEd+GlkWRoGHO3L9hfyxxUANXQGXhpI4YvpfldntmK
 UlGguQwRqqDzdCm+tBiFlt+j4Vh5A8ywTSG8HLzxy7Gsu7BRkqcafRrlxNmMK8Wv
 aCixlFKAkdA=
 =cF5c
 -----END PGP SIGNATURE-----

Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull more clk updates from Stephen Boyd:
 "Another set of clk driver updates and fixes for the merge window. The
  driver updates needed more time to bake in linux-next.

  Updates:
   - Support for more clk controllers in Qualcomm SoCs such as SM8350,
     SM8450, SDX75, SC8280XP, and IPQ9574
   - Runtime PM enablement of some more Qualcomm clk controllers
   - Various fixes to Qualcomm clk driver data to use correct clk_ops
     and to check halt bits properly
   - AT91 updates to modernize with clk_parent_data structures

  Fixes:
   - Remove 'syscon' from dt binding fix for ti,j721e-system-controller
   - Fix determine rate in the Tegra driver that got wrecked by the
     refactorting of muxes this merge window"

* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (69 commits)
  clk: tegra: Avoid calling an uninitialized function
  dt-bindings: mfd: ti,j721e-system-controller: Remove syscon from example
  clk: at91: sama7g5: s/ep_chg_chg_id/ep_chg_id
  clk: at91: sama7g5: switch to parent_hw and parent_data
  clk: at91: sckc: switch to parent_data/parent_hw
  clk: at91: clk-sam9x60-pll: add support for parent_hw
  clk: at91: clk-utmi: add support for parent_hw
  clk: at91: clk-system: add support for parent_hw
  clk: at91: clk-programmable: add support for parent_hw
  clk: at91: clk-peripheral: add support for parent_hw
  clk: at91: clk-master: add support for parent_hw
  clk: at91: clk-generated: add support for parent_hw
  clk: at91: clk-main: add support for parent_data/parent_hw
  clk: qcom: gcc-sc8280xp: Add runtime PM
  clk: qcom: gpucc-sc8280xp: Add runtime PM
  clk: qcom: mmcc-msm8974: fix MDSS_GDSC power flags
  clk: qcom: gpucc-sm6375: Enable runtime pm
  dt-bindings: clock: sm6375-gpucc: Add VDD_GX
  clk: qcom: gcc-sm6115: Add missing PLL config properties
  clk: qcom: clk-alpha-pll: Add a way to update some bits of test_ctl(_hi)
  ...
2023-07-04 11:07:45 -07:00