Commit Graph

2483 Commits

Author SHA1 Message Date
Ilya Dryomov
18d44c5d06 ceph: allocate sparse_ext map only for sparse reads
If mounted with sparseread option, ceph_direct_read_write() ends up
making an unnecessarily allocation for O_DIRECT writes.

Fixes: 03bc06c7b0 ("ceph: add new mount option to enable sparse reads")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
2024-12-16 23:25:44 +01:00
Ilya Dryomov
66e0c4f914 ceph: fix memory leak in ceph_direct_read_write()
The bvecs array which is allocated in iter_get_bvecs_alloc() is leaked
and pages remain pinned if ceph_alloc_sparse_ext_map() fails.

There is no need to delay the allocation of sparse_ext map until after
the bvecs array is set up, so fix this by moving sparse_ext allocation
a bit earlier.  Also, make a similar adjustment in __ceph_sync_read()
for consistency (a leak of the same kind in __ceph_sync_read() has been
addressed differently).

Cc: stable@vger.kernel.org
Fixes: 03bc06c7b0 ("ceph: add new mount option to enable sparse reads")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
2024-12-16 23:25:44 +01:00
Alex Markuze
9abee47580 ceph: improve error handling and short/overflow-read logic in __ceph_sync_read()
This patch refines the read logic in __ceph_sync_read() to ensure more
predictable and efficient behavior in various edge cases.

- Return early if the requested read length is zero or if the file size
  (`i_size`) is zero.
- Initialize the index variable (`idx`) where needed and reorder some
  code to ensure it is always set before use.
- Improve error handling by checking for negative return values earlier.
- Remove redundant encrypted file checks after failures. Only attempt
  filesystem-level decryption if the read succeeded.
- Simplify leftover calculations to correctly handle cases where the
  read extends beyond the end of the file or stops short.  This can be
  hit by continuously reading a file while, on another client, we keep
  truncating and writing new data into it.
- This resolves multiple issues caused by integer and consequent buffer
  overflow (`pages` array being accessed beyond `num_pages`):
  - https://tracker.ceph.com/issues/67524
  - https://tracker.ceph.com/issues/68980
  - https://tracker.ceph.com/issues/68981

Cc: stable@vger.kernel.org
Fixes: 1065da21e5 ("ceph: stop copying to iter at EOF on sync reads")
Reported-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-12-16 23:25:43 +01:00
Ilya Dryomov
12eb22a5a6 ceph: validate snapdirname option length when mounting
It becomes a path component, so it shouldn't exceed NAME_MAX
characters.  This was hardened in commit c152737be2 ("ceph: Use
strscpy() instead of strcpy() in __get_snap_name()"), but no actual
check was put in place.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
2024-12-16 23:25:43 +01:00
Max Kellermann
550f7ca98e ceph: give up on paths longer than PATH_MAX
If the full path to be built by ceph_mdsc_build_path() happens to be
longer than PATH_MAX, then this function will enter an endless (retry)
loop, effectively blocking the whole task.  Most of the machine
becomes unusable, making this a very simple and effective DoS
vulnerability.

I cannot imagine why this retry was ever implemented, but it seems
rather useless and harmful to me.  Let's remove it and fail with
ENAMETOOLONG instead.

Cc: stable@vger.kernel.org
Reported-by: Dario Weißer <dario@cure53.de>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-12-16 23:25:43 +01:00
Max Kellermann
d6fd6f8280 ceph: fix memory leaks in __ceph_sync_read()
In two `break` statements, the call to ceph_release_page_vector() was
missing, leaking the allocation from ceph_alloc_page_vector().

Instead of adding the missing ceph_release_page_vector() calls, the
Ceph maintainers preferred to transfer page ownership to the
`ceph_osd_request` by passing `own_pages=true` to
osd_req_op_extent_osd_data_pages().  This requires postponing the
ceph_osdc_put_request() call until after the block that accesses the
`pages`.

Cc: stable@vger.kernel.org
Fixes: 03bc06c7b0 ("ceph: add new mount option to enable sparse reads")
Fixes: f0fe1e54cf ("ceph: plumb in decryption during reads")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-12-16 23:25:43 +01:00
Linus Torvalds
9d0ad04553 A fix for the mount "device" string parser from Patrick and two cred
reference counting fixups from Max, marked for stable.  Also included
 a number of cleanups and a tweak to MAINTAINERS to avoid unnecessarily
 CCing netdev list.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmdJ/sMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi8EOB/9Jhq1nOe0dN7aAWN1owZH85TXmOLuX
 eSS79AJp63lmJgx+mF0CbLLN6Vwjvm1vqz10Uhe5VCmqtxKy1/F4QxEOwk+zEMwT
 iGkM+6nUtMMqnxqItJpFC19YQONwgidsNcbi7v8nDEHqH8FXEC4R0pi0990bUQSj
 E8zVzq44TNFQrhWjDJHPnXsxbH9SijRuwu1O4KEZ2HK0QQ8LfpPptozJawH0p2Hs
 Wc6V6ppt7o9F9MHW137I4BOG9xm18aAQa9Ztd5GHhip63SuLpdQNwoM9JhC8J/bt
 bcci5CoCa34P57g9/1kh9ov/NXf9XgjSCFlDzk9zZH0IxaX5CYgTqYL5
 =xf5D
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A fix for the mount "device" string parser from Patrick and two cred
  reference counting fixups from Max, marked for stable.

  Also included a number of cleanups and a tweak to MAINTAINERS to avoid
  unnecessarily CCing netdev list"

* tag 'ceph-for-6.13-rc1' of https://github.com/ceph/ceph-client:
  ceph: fix cred leak in ceph_mds_check_access()
  ceph: pass cred pointer to ceph_mds_auth_match()
  ceph: improve caps debugging output
  ceph: correct ceph_mds_cap_peer field name
  ceph: correct ceph_mds_cap_item field name
  ceph: miscellaneous spelling fixes
  ceph: Use strscpy() instead of strcpy() in __get_snap_name()
  ceph: Use str_true_false() helper in status_show()
  ceph: requalify some char pointers as const
  ceph: extract entity name from device id
  MAINTAINERS: exclude net/ceph from networking
  ceph: Remove fs/ceph deadcode
  libceph: Remove unused ceph_crypto_key_encode
  libceph: Remove unused ceph_osdc_watch_check
  libceph: Remove unused pagevec functions
  libceph: Remove unused ceph_pagelist functions
2024-11-30 10:22:38 -08:00
Max Kellermann
c5cf420303 ceph: fix cred leak in ceph_mds_check_access()
get_current_cred() increments the reference counter, but the
put_cred() call was missing.

Cc: stable@vger.kernel.org
Fixes: 596afb0b89 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-25 15:51:13 +01:00
Max Kellermann
23426309a4 ceph: pass cred pointer to ceph_mds_auth_match()
This eliminates a redundant get_current_cred() call, because
ceph_mds_check_access() has already obtained this pointer.

As a side effect, this also fixes a reference leak in
ceph_mds_auth_match(): by omitting the get_current_cred() call, no
additional cred reference is taken.

Cc: stable@vger.kernel.org
Fixes: 596afb0b89 ("ceph: add ceph_mds_check_access() helper")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-25 15:51:13 +01:00
Patrick Donnelly
8ea412e181 ceph: improve caps debugging output
This improves uniformity and exposes important sequence numbers.

Now looks like:

    <7>[   73.749563] ceph:           caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  caps mds2 op export ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 0 iseq 0 mseq 0
    ...
    <7>[   73.749574] ceph:           caps.c:4102 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  cap 20000000000.fffffffffffffffe export to peer 1 piseq 1 pmseq 1
    ...
    <7>[   73.749645] ceph:           caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  caps mds1 op import ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 1 iseq 1 mseq 1
    ...
    <7>[   73.749681] ceph:           caps.c:4244 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  cap 20000000000.fffffffffffffffe import from peer 2 piseq 686 pmseq 0
    ...
    <7>[  248.645596] ceph:           caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  caps mds1 op revoke ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 2538 iseq 1 mseq 1

See also ceph.git commit cb4ff28af09f ("mds: add issue_seq to all cap
messages").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:47:16 +01:00
Patrick Donnelly
8b41ac43c7 ceph: correct ceph_mds_cap_peer field name
The peer seq is used as the issue_seq. Use that name for consistency.
See also ceph.git commit 1da6ef237fc7 ("include/ceph_fs: correct
ceph_mds_cap_peer field name").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:08:17 +01:00
Patrick Donnelly
50f42c4895 ceph: correct ceph_mds_cap_item field name
The issue_seq is sent with bulk cap releases, not the current sequence
number. See also ceph.git commit 655cddb7c9f3 ("include/ceph_fs: correct
ceph_mds_cap_item field name").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:08:17 +01:00
Linus Torvalds
56be9aaf98 vfs-6.13.pagecache
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUQAAKCRCRxhvAZXjc
 onEpAQCUdwIBHpwmSIFvJFA9aNGpbLzi0dDSEIxuWYtp5qVuogD+ImccwqpG3kEi
 Zq9vokdPpB1zbahxKl1mkvBG4G0GFQE=
 =LbP6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs pagecache updates from Christian Brauner:
 "Cleanup filesystem page flag usage: This continues the work to make
  the mappedtodisk/owner_2 flag available to filesystems which don't use
  buffer heads. Further patches remove uses of Private2. This brings us
  very close to being rid of it entirely"

* tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  migrate: Remove references to Private2
  ceph: Remove call to PagePrivate2()
  btrfs: Switch from using the private_2 flag to owner_2
  mm: Remove PageMappedToDisk
  nilfs2: Convert nilfs_copy_buffer() to use folios
  fs: Move clearing of mappedtodisk to buffer.c
2024-11-18 09:54:32 -08:00
Dmitry Antipov
3500000bb1 ceph: miscellaneous spelling fixes
Correct spelling here and there as suggested by codespell.

Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Abdul Rahim
c152737be2 ceph: Use strscpy() instead of strcpy() in __get_snap_name()
strcpy() performs no bounds checking on the destination buffer. This
could result in linear overflows beyond the end of the buffer, leading
to all kinds of misbehaviors [1].

This fixes checkpatch warning:
    WARNING: Prefer strscpy over strcpy

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy

[ idryomov: formatting ]

Signed-off-by: Abdul Rahim <abdul.rahim@myyahoo.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Thorsten Blum
e50f960bea ceph: Use str_true_false() helper in status_show()
Remove hard-coded strings by using the str_true_false() helper function.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Patrick Donnelly
64cf95d0b1 ceph: requalify some char pointers as const
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:36 +01:00
Patrick Donnelly
955710afcb ceph: extract entity name from device id
Previously, the "name" in the new device syntax "<name>@<fsid>.<fsname>"
was ignored because (presumably) tests were done using mount.ceph which
also passed the entity name using "-o name=foo". If mounting is done
without the mount.ceph helper, the new device id syntax fails to set
the name properly.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/68516
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:35 +01:00
Dr. David Alan Gilbert
6025b482e4 ceph: Remove fs/ceph deadcode
ceph_caps_revoking() has been unused since 2017's commit
3fb99d483e ("ceph: nuke startsync op")

ceph_mdsc_open_export_target_sessions() has been unused since 2013's
commit 11df2dfb61 ("ceph: add imported caps when handling cap export message")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18 17:34:35 +01:00
Linus Torvalds
a3a37691e6 A fix from Patrick for a variety of CephFS lockup scenarios caused by
a regression in cap handling which sneaked in through the netfs helper
 library in 5.18 (marked for stable) and an unrelated one-line cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmcACf8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3DiCACDP9ZubGLZ2GKv3fEyXEURnOyrSBAd
 M4Ci3cc31W0Q1rWtQndGxW9QF/9smcp4kN/xgelWfSZc6FJ1HLR1RHUPjMJpBPx2
 dQs+RaaBTIe0coVaMmiDnBQ1wnfuNhEZH8opnLonjHygANnoR2aYZpyIxpgYD7eZ
 KHfk1F/2rmE+QzYGB+tE7sCpRs3bhbtL2xa/2yQJyw+EWfgyJUjPW1cZe0liiWHi
 dXTMxJqfZLLUW2otLdGnaaZS+7yF3ItcoAWh6cUoiZ9cpqDu43B0RGXOV7Kegui1
 2e8W0sK3QIIYbUKoHo/4VPqrmN4puSjeNteumDvpZE4SgG8BfKWzQnmh
 =3OKU
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix from Patrick for a variety of CephFS lockup scenarios caused by
  a regression in cap handling which sneaked in through the netfs helper
  library in 5.18 (marked for stable) and an unrelated one-line cleanup"

* tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client:
  ceph: fix cap ref leak via netfs init_request
  ceph: use struct_size() helper in __ceph_pool_perm_get()
2024-10-04 10:10:23 -07:00
Matthew Wilcox (Oracle)
fd15ba4cb0
ceph: Remove call to PagePrivate2()
Use the folio that we already have to call folio_test_private_2()
instead.  This is the last call to PagePrivate2(), so replace its
PAGEFLAG() definition with FOLIO_FLAG().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241002040111.1023018-6-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-04 09:24:25 +02:00
Patrick Donnelly
ccda9910d8 ceph: fix cap ref leak via netfs init_request
Log recovered from a user's cluster:

    <7>[ 5413.970692] ceph:  get_cap_refs 00000000958c114b ret 1 got Fr
    <7>[ 5413.970695] ceph:  start_read 00000000958c114b, no cache cap
    ...
    <7>[ 5473.934609] ceph:   my wanted = Fr, used = Fr, dirty -
    <7>[ 5473.934616] ceph:  revocation: pAsLsXsFr -> pAsLsXs (revoking Fr)
    <7>[ 5473.934632] ceph:  __ceph_caps_issued 00000000958c114b cap 00000000f7784259 issued pAsLsXs
    <7>[ 5473.934638] ceph:  check_caps 10000000e68.fffffffffffffffe file_want - used Fr dirty - flushing - issued pAsLsXs revoking Fr retain pAsLsXsFsr  AUTHONLY NOINVAL FLUSH_FORCE

The MDS subsequently complains that the kernel client is late releasing
caps.

Approximately, a series of changes to this code by commits 4987005600
("ceph: convert ceph_readpages to ceph_readahead"), 2de1604173
("netfs: Change ->init_request() to return an error code") and
a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
resulted in subtle resource cleanup to be missed. The main culprit is
the change in error handling in 2de1604173 which meant that a failure
in init_request() would no longer cause cleanup to be called. That
would prevent the ceph_put_cap_refs() call which would cleanup the
leaked cap ref.

Cc: stable@vger.kernel.org
Fixes: a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
Link: https://tracker.ceph.com/issues/67008
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-10-03 09:31:08 +02:00
Thorsten Blum
7264745d55 ceph: use struct_size() helper in __ceph_pool_perm_get()
Use struct_size() to calculate the number of bytes to be allocated.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-10-03 09:31:08 +02:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Linus Torvalds
894b3c35d1 Three CephFS fixes from Xiubo and Luis and a bunch of assorted
cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmb3JroTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizDiB/0elHQQaFxXMjuJRY1IzohozAHi0cHK
 gwgE4nEbECE8vRYK/QvyvZ3S+ep+N+r6jOIiIDyqhjtlY3//oSyyxL7RjMJlVFBq
 Ie37w8r4q1aL1mn9QDQ4iQxcRYyU+JxcUcPR1UUUvLiKgWaRixmq27zby/WQSrkA
 ke2ScBRDtEAYVtdxvxmUJK/DrPr3skwJAGY52KesjwgVhXSL8KG9X1zMUbWdJYDV
 THbQzLZsu4NVh7LlAsS/mh+z0EIZsXxQYU5IY3dIVEYcuLK93lXRGZb+7whtmUef
 wsDtYIe/w30QVxFdrN28qAQp8daUJhp+3t0EZSyecRcq5OPey6ICx1P4
 =+bdB
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Three CephFS fixes from Xiubo and Luis and a bunch of assorted
  cleanups"

* tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client:
  ceph: remove the incorrect Fw reference check when dirtying pages
  ceph: Remove empty definition in header file
  ceph: Fix typo in the comment
  ceph: fix a memory leak on cap_auths in MDS client
  ceph: flush all caps releases when syncing the whole filesystem
  ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases()
  libceph: use min() to simplify code in ceph_dns_resolve_name()
  ceph: Convert to use jiffies macro
  ceph: Remove unused declarations
2024-09-28 08:40:36 -07:00
Xiubo Li
c08dfb1b49 ceph: remove the incorrect Fw reference check when dirtying pages
When doing the direct-io reads it will also try to mark pages dirty,
but for the read path it won't hold the Fw caps and there is case
will it get the Fw reference.

Fixes: 5dda377cf0 ("ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Zhang Zekun
74249188f3 ceph: Remove empty definition in header file
The real definition of ceph_acl_chmod() has been removed since
commit 4db658ea0c ("ceph: Fix up after semantic merge conflict"),
remain the empty definition untouched in the header files. Let's
remove the empty definition.

Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Yan Zhen
0039aebfe8 ceph: Fix typo in the comment
Correctly spelled comments make it easier for the reader to understand
the code.

replace 'tagert' with 'target' in the comment &
replace 'vaild' with 'valid' in the comment &
replace 'carefull' with 'careful' in the comment &
replace 'trsaverse' with 'traverse' in the comment.

Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Luis Henriques (SUSE)
d97079e97e ceph: fix a memory leak on cap_auths in MDS client
The cap_auths that are allocated during an MDS session opening are never
released, causing a memory leak detected by kmemleak.  Fix this by freeing
the memory allocated when shutting down the MDS client.

Fixes: 1d17de9534 ("ceph: save cap_auths in MDS client when session is opened")
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:33 +02:00
Xiubo Li
adc5246176 ceph: flush all caps releases when syncing the whole filesystem
We have hit a race between cap releases and cap revoke request
that will cause the check_caps() to miss sending a cap revoke ack
to MDS. And the client will depend on the cap release to release
that revoking caps, which could be delayed for some unknown reasons.

In Kclient we have figured out the RCA about race and we need
a way to explictly trigger this manually could help to get rid
of the caps revoke stuck issue.

Link: https://tracker.ceph.com/issues/67221
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:28 +02:00
Xiubo Li
c085f6ca95 ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases()
Prepare for adding a helper to flush the cap releases for all
sessions.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-09-24 22:51:19 +02:00
Linus Torvalds
35219bc5c7 vfs-6.12.netfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
 onQWAQD6IxAKPU0zom2FoWNilvSzPs7WglTtvddX9pu/lT1RNAD/YC/wOLW8mvAv
 9oTAmigQDQQhEWdJA9RgLZBiw7k+DAw=
 =zWFb
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull netfs updates from Christian Brauner:
 "This contains the work to improve read/write performance for the new
  netfs library.

  The main performance enhancing changes are:

   - Define a structure, struct folio_queue, and a new iterator type,
     ITER_FOLIOQ, to hold a buffer as a replacement for ITER_XARRAY. See
     that patch for questions about naming and form.

     ITER_FOLIOQ is provided as a replacement for ITER_XARRAY. The
     problem with an xarray is that accessing it requires the use of a
     lock (typically the RCU read lock) - and this means that we can't
     supply iterate_and_advance() with a step function that might sleep
     (crypto for example) without having to drop the lock between pages.
     ITER_FOLIOQ is the iterator for a chain of folio_queue structs,
     where each folio_queue holds a small list of folios. A folio_queue
     struct is a simpler structure than xarray and is not subject to
     concurrent manipulation by the VM. folio_queue is used rather than
     a bvec[] as it can form lists of indefinite size, adding to one end
     and removing from the other on the fly.

   - Provide a copy_folio_from_iter() wrapper.

   - Make cifs RDMA support ITER_FOLIOQ.

   - Use folio queues in the write-side helpers instead of xarrays.

   - Add a function to reset the iterator in a subrequest.

   - Simplify the write-side helpers to use sheaves to skip gaps rather
     than trying to work out where gaps are.

   - In afs, make the read subrequests asynchronous, putting them into
     work items to allow the next patch to do progressive
     unlocking/reading.

   - Overhaul the read-side helpers to improve performance.

   - Fix the caching of a partial block at the end of a file.

   - Allow a store to be cancelled.

  Then some changes for cifs to make it use folio queues instead of
  xarrays for crypto bufferage:

   - Use raw iteration functions rather than manually coding iteration
     when hashing data.

   - Switch to using folio_queue for crypto buffers.

   - Remove the xarray bits.

  Make some adjustments to the /proc/fs/netfs/stats file such that:

   - All the netfs stats lines begin 'Netfs:' but change this to
     something a bit more useful.

   - Add a couple of stats counters to track the numbers of skips and
     waits on the per-inode writeback serialisation lock to make it
     easier to check for this as a source of performance loss.

  Miscellaneous work:

   - Ensure that the sb_writers lock is taken around
     vfs_{set,remove}xattr() in the cachefiles code.

   - Reduce the number of conditional branches in netfs_perform_write().

   - Move the CIFS_INO_MODIFIED_ATTR flag to the netfs_inode struct and
     remove cifs_post_modify().

   - Move the max_len/max_nr_segs members from netfs_io_subrequest to
     netfs_io_request as they're only needed for one subreq at a time.

   - Add an 'unknown' source value for tracing purposes.

   - Remove NETFS_COPY_TO_CACHE as it's no longer used.

   - Set the request work function up front at allocation time.

   - Use bh-disabling spinlocks for rreq->lock as cachefiles completion
     may be run from block-filesystem DIO completion in softirq context.

   - Remove fs/netfs/io.c"

* tag 'vfs-6.12.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits)
  docs: filesystems: corrected grammar of netfs page
  cifs: Don't support ITER_XARRAY
  cifs: Switch crypto buffer to use a folio_queue rather than an xarray
  cifs: Use iterate_and_advance*() routines directly for hashing
  netfs: Cancel dirty folios that have no storage destination
  cachefiles, netfs: Fix write to partial block at EOF
  netfs: Remove fs/netfs/io.c
  netfs: Speed up buffered reading
  afs: Make read subreqs async
  netfs: Simplify the writeback code
  netfs: Provide an iterator-reset function
  netfs: Use new folio_queue data type and iterator instead of xarray iter
  cifs: Provide the capability to extract from ITER_FOLIOQ to RDMA SGEs
  iov_iter: Provide copy_folio_from_iter()
  mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios
  netfs: Use bh-disabling spinlocks for rreq->lock
  netfs: Set the request work function upon allocation
  netfs: Remove NETFS_COPY_TO_CACHE
  netfs: Reserve netfs_sreq_source 0 as unset/unknown
  netfs: Move max_len/max_nr_segs from netfs_io_subrequest to netfs_io_stream
  ...
2024-09-16 12:13:31 +02:00
Linus Torvalds
3352633ce6 vfs-6.12.file
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc
 osS0AQCgIpvey9oW5DMyMw6Bv0hFMRv95gbNQZfHy09iK+NMNAD9GALhb/4cMIVB
 7YrZGXEz454lpgcs8AnrOVjVNfctOQg=
 =e9s9
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs file updates from Christian Brauner:
 "This is the work to cleanup and shrink struct file significantly.

  Right now, (focusing on x86) struct file is 232 bytes. After this
  series struct file will be 184 bytes aka 3 cacheline and a spare 8
  bytes for future extensions at the end of the struct.

  With struct file being as ubiquitous as it is this should make a
  difference for file heavy workloads and allow further optimizations in
  the future.

   - struct fown_struct was embedded into struct file letting it take up
     32 bytes in total when really it shouldn't even be embedded in
     struct file in the first place. Instead, actual users of struct
     fown_struct now allocate the struct on demand. This frees up 24
     bytes.

   - Move struct file_ra_state into the union containg the cleanup hooks
     and move f_iocb_flags out of the union. This closes a 4 byte hole
     we created earlier and brings struct file to 192 bytes. Which means
     struct file is 3 cachelines and we managed to shrink it by 40
     bytes.

   - Reorder struct file so that nothing crosses a cacheline.

     I suspect that in the future we will end up reordering some members
     to mitigate false sharing issues or just because someone does
     actually provide really good perf data.

   - Shrinking struct file to 192 bytes is only part of the work.

     Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache
     is created with SLAB_TYPESAFE_BY_RCU the free pointer must be
     located outside of the object because the cache doesn't know what
     part of the memory can safely be overwritten as it may be needed to
     prevent object recycling.

     That has the consequence that SLAB_TYPESAFE_BY_RCU may end up
     adding a new cacheline.

     So this also contains work to add a new kmem_cache_create_rcu()
     function that allows the caller to specify an offset where the
     freelist pointer is supposed to be placed. Thus avoiding the
     implicit addition of a fourth cacheline.

   - And finally this removes the f_version member in struct file.

     The f_version member isn't particularly well-defined. It is mainly
     used as a cookie to detect concurrent seeks when iterating
     directories. But it is also abused by some subsystems for
     completely unrelated things.

     It is mostly a directory and filesystem specific thing that doesn't
     really need to live in struct file and with its wonky semantics it
     really lacks a specific function.

     For pipes, f_version is (ab)used to defer poll notifications until
     a write has happened. And struct pipe_inode_info is used by
     multiple struct files in their ->private_data so there's no chance
     of pushing that down into file->private_data without introducing
     another pointer indirection.

     But pipes don't rely on f_pos_lock so this adds a union into struct
     file encompassing f_pos_lock and a pipe specific f_pipe member that
     pipes can use. This union of course can be extended to other file
     types and is similar to what we do in struct inode already"

* tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
  fs: remove f_version
  pipe: use f_pipe
  fs: add f_pipe
  ubifs: store cookie in private data
  ufs: store cookie in private data
  udf: store cookie in private data
  proc: store cookie in private data
  ocfs2: store cookie in private data
  input: remove f_version abuse
  ext4: store cookie in private data
  ext2: store cookie in private data
  affs: store cookie in private data
  fs: add generic_llseek_cookie()
  fs: use must_set_pos()
  fs: add must_set_pos()
  fs: add vfs_setpos_cookie()
  s390: remove unused f_version
  ceph: remove unused f_version
  adi: remove unused f_version
  mm: Removed @freeptr_offset to prevent doc warning
  ...
2024-09-16 09:14:02 +02:00
Linus Torvalds
2775df6e5e vfs-6.12.folio
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc
 ou77AQD3U1KjbdgzbUi6kaUmiiWOPhfYTlm8mho8dBjqvTCB+AD/XTWSFCWWhHB4
 KyQZTbjRD81xmVNbKjASazp0EA6Ahwc=
 =gIsD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs folio updates from Christian Brauner:
 "This contains work to port write_begin and write_end to rely on folios
  for various filesystems.

  This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs,
  ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and
  squashfs.

  After this series lands a bunch of the filesystems in this list do not
  mention struct page anymore"

* tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits)
  Squashfs: Ensure all readahead pages have been used
  Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index
  Squashfs: Update squashfs_readpage_block() to not use page->index
  Squashfs: Update squashfs_readahead() to not use page->index
  Squashfs: Update page_actor to not use page->index
  jffs2: Use a folio in jffs2_garbage_collect_dnode()
  jffs2: Convert jffs2_do_readpage_nolock to take a folio
  buffer: Convert __block_write_begin() to take a folio
  ocfs2: Convert ocfs2_write_zero_page to use a folio
  fs: Convert aops->write_begin to take a folio
  fs: Convert aops->write_end to take a folio
  vboxsf: Use a folio in vboxsf_write_end()
  orangefs: Convert orangefs_write_begin() to use a folio
  orangefs: Convert orangefs_write_end() to use a folio
  jffs2: Convert jffs2_write_begin() to use a folio
  jffs2: Convert jffs2_write_end() to use a folio
  hostfs: Convert hostfs_write_end() to use a folio
  fuse: Convert fuse_write_begin() to use a folio
  fuse: Convert fuse_write_end() to use a folio
  f2fs: Convert f2fs_write_begin() to use a folio
  ...
2024-09-16 08:54:30 +02:00
David Howells
ee4cdf7ba8
netfs: Speed up buffered reading
Improve the efficiency of buffered reads in a number of ways:

 (1) Overhaul the algorithm in general so that it's a lot more compact and
     split the read submission code between buffered and unbuffered
     versions.  The unbuffered version can be vastly simplified.

 (2) Read-result collection is handed off to a work queue rather than being
     done in the I/O thread.  Multiple subrequests can be processes
     simultaneously.

 (3) When a subrequest is collected, any folios it fully spans are
     collected and "spare" data on either side is donated to either the
     previous or the next subrequest in the sequence.

Notes:

 (*) Readahead expansion is massively slows down fio, presumably because it
     causes a load of extra allocations, both folio and xarray, up front
     before RPC requests can be transmitted.

 (*) RDMA with cifs does appear to work, both with SIW and RXE.

 (*) PG_private_2-based reading and copy-to-cache is split out into its own
     file and altered to use folio_queue.  Note that the copy to the cache
     now creates a new write transaction against the cache and adds the
     folios to be copied into it.  This allows it to use part of the
     writeback I/O code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240814203850.2240469-20-dhowells@redhat.com/ # v2
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12 12:20:41 +02:00
Christian Brauner
387b499b78
ceph: remove unused f_version
It's not used for ceph so don't bother with it at all.

Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-3-6d3e4816aa7b@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-09 11:58:06 +02:00
Chen Yufan
2015716adb ceph: Convert to use jiffies macro
Use time_after_eq macro instead of using
jiffies directly to handle wraparound.

[ xiubli: adjust the header files order ]

Signed-off-by: Chen Yufan <chenyufan@vivo.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-08-27 09:30:09 +02:00
Yue Haibing
9a948c0c8e ceph: Remove unused declarations
These functions is never implemented and used.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-08-27 09:28:48 +02:00
David Howells
92764e8822
netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"
This partially reverts commit 2ff1e97587.

In addition to reverting the removal of PG_private_2 wrangling from the
buffered read code[1][2], the removal of the waits for PG_private_2 from
netfs_release_folio() and netfs_invalidate_folio() need reverting too.

It also adds a wait into ceph_evict_inode() to wait for netfs read and
copy-to-cache ops to complete.

Fixes: 2ff1e97587 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk [1]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e5ced7804cb9184c4a23f8054551240562a8eda [2]
Link: https://lore.kernel.org/r/20240814203850.2240469-2-dhowells@redhat.com
cc: Max Kellermann <max.kellermann@ionos.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-21 22:32:58 +02:00
Linus Torvalds
4ac0f08f44 vfs-6.11-rc4.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZrym4AAKCRCRxhvAZXjc
 oqT3AP9ydoUNavaZcRayH8r3ybvz9+aJGJ6Q7NznFVCk71vn0gD/buLzmq96Muns
 M5DWHbft2AFwK0Rz2nx8j5OXUeHwrQg=
 =HZBL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "VFS:

   - Fix the name of file lease slab cache. When file leases were split
     out of file locks the name of the file lock slab cache was used for
     the file leases slab cache as well.

   - Fix a type in take_fd() helper.

   - Fix infinite directory iteration for stable offsets in tmpfs.

   - When the icache is pruned all reclaimable inodes are marked with
     I_FREEING and other processes that try to lookup such inodes will
     block.

     But some filesystems like ext4 can trigger lookups in their inode
     evict callback causing deadlocks. Ext4 does such lookups if the
     ea_inode feature is used whereby a separate inode may be used to
     store xattrs.

     Introduce I_LRU_ISOLATING which pins the inode while its pages are
     reclaimed. This avoids inode deletion during inode_lru_isolate()
     avoiding the deadlock and evict is made to wait until
     I_LRU_ISOLATING is done.

  netfs:

   - Fault in smaller chunks for non-large folio mappings for
     filesystems that haven't been converted to large folios yet.

   - Fix the CONFIG_NETFS_DEBUG config option. The config option was
     renamed a short while ago and that introduced two minor issues.
     First, it depended on CONFIG_NETFS whereas it wants to depend on
     CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
     does. Second, the documentation for the config option wasn't fixed
     up.

   - Revert the removal of the PG_private_2 writeback flag as ceph is
     using it and fix how that flag is handled in netfs.

   - Fix DIO reads on 9p. A program watching a file on a 9p mount
     wouldn't see any changes in the size of the file being exported by
     the server if the file was changed directly in the source
     filesystem. Fix this by attempting to read the full size specified
     when a DIO read is requested.

   - Fix a NULL pointer dereference bug due to a data race where a
     cachefiles cookies was retired even though it was still in use.
     Check the cookie's n_accesses counter before discarding it.

  nsfs:

   - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
     the kernel is writing to userspace.

  pidfs:

   - Prevent the creation of pidfds for kthreads until we have a
     use-case for it and we know the semantics we want. It also confuses
     userspace why they can get pidfds for kthreads.

  squashfs:

   - Fix an unitialized value bug reported by KMSAN caused by a
     corrupted symbolic link size read from disk. Check that the
     symbolic link size is not larger than expected"

* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Squashfs: sanity check symbolic link size
  9p: Fix DIO read through netfs
  vfs: Don't evict inode under the inode lru traversing context
  netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
  netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
  file: fix typo in take_fd() comment
  pidfd: prevent creation of pidfds for kthreads
  netfs: clean up after renaming FSCACHE_DEBUG config
  libfs: fix infinite directory reads for offset dir
  nsfs: fix ioctl declaration
  fs/netfs/fscache_cookie: add missing "n_accesses" check
  filelock: fix name of file_lease slab cache
  netfs: Fault in smaller chunks for non-large folio mappings
2024-08-14 09:06:28 -07:00
Dominique Martinet
e3786b29c5
9p: Fix DIO read through netfs
If a program is watching a file on a 9p mount, it won't see any change in
size if the file being exported by the server is changed directly in the
source filesystem, presumably because 9p doesn't have change notifications,
and because netfs skips the reads if the file is empty.

Fix this by attempting to read the full size specified when a DIO read is
requested (such as when 9p is operating in unbuffered mode) and dealing
with a short read if the EOF was less than the expected read.

To make this work, filesystems using netfslib must not set
NETFS_SREQ_CLEAR_TAIL if performing a DIO read where that read hit the EOF.
I don't want to mandatorily clear this flag in netfslib for DIO because,
say, ceph might make a read from an object that is not completely filled,
but does not reside at the end of file - and so we need to clear the
excess.

This can be tested by watching an empty file over 9p within a VM (such as
in the ktest framework):

        while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < /host/tmp/foo

then writing something into the empty file.  The watcher should immediately
display the file content and break out of the loop.  Without this fix, it
remains in the loop indefinitely.

Fixes: 80105ed2fd ("9p: Use netfslib read/write_iter")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218916
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-13 13:53:09 +02:00
David Howells
7b589a9b45
netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used
correctly.  The problem is that we try to set them up in the request
initialisation, but we the cache may be in the process of setting up still,
and so the state may not be correct.  Further, we secondarily sample the
cache state and make contradictory decisions later.

The issue arises because we set up the cache resources, which allows the
cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which
triggers cache writing even if we didn't set the flags when allocating.

Fix this in the following way:

 (1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in
     ->init_request() rather than trying to juggle that in
     netfs_alloc_request().

 (2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is
     to be done, then PG_private_2 is to be used rather than only setting
     it if we decide to cache and then having netfs_rreq_unlock_folios()
     set the non-PG_private_2 writeback-to-cache if it wasn't set.

 (3) Split netfs_rreq_unlock_folios() into two functions, one of which
     contains the deprecated code for using PG_private_2 to avoid
     accidentally doing the writeback path - and always use it if
     USE_PGPRIV2 is set.

 (4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always
     wait for PG_private_2.  This function is deprecated and only used by
     ceph anyway, and so label it so.

 (5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use
     fscache_operation_valid() on the cache_resources instead.  This has
     the advantage of picking up the result of netfs_begin_cache_read() and
     fscache_begin_write_operation() - which are called after the object is
     initialised and will wait for the cache to come to a usable state.

Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be
applied on top of that.  Without this as well, things like:

 rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: {

and:

 WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386

may happen, along with some UAFs due to PG_private_2 not getting used to
wait on writeback completion.

Fixes: 2ff1e97587 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Hristo Venev <hristo@venev.name>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:27 +02:00
David Howells
8e5ced7804
netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
This reverts commit ae678317b9.

Revert the patch that removes the deprecated use of PG_private_2 in
netfslib for the moment as Ceph is actually still using this to track
data copied to the cache.

Fixes: ae678317b9 ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:27 +02:00
Matthew Wilcox (Oracle)
1da86618bd
fs: Convert aops->write_begin to take a folio
Convert all callers from working on a page to working on one page
of a folio (support for working on an entire folio can come later).
Removes a lot of folio->page->folio conversions.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:33:21 +02:00
Matthew Wilcox (Oracle)
a225800f32
fs: Convert aops->write_end to take a folio
Most callers have a folio, and most implementations operate on a folio,
so remove the conversion from folio->page->folio to fit through this
interface.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07 11:32:02 +02:00
Xiubo Li
31634d7597 ceph: force sending a cap update msg back to MDS for revoke op
If a client sends out a cap update dropping caps with the prior 'seq'
just before an incoming cap revoke request, then the client may drop
the revoke because it believes it's already released the requested
capabilities.

This causes the MDS to wait indefinitely for the client to respond
to the revoke. It's therefore always a good idea to ack the cap
revoke request with the bumped up 'seq'.

Currently if the cap->issued equals to the newcaps the check_caps()
will do nothing, we should force flush the caps.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/61782
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-08-01 13:14:28 +02:00
Linus Torvalds
6467dfdfc9 A small patchset to address bogus I/O errors and ultimately an
assertion failure in the face of watch errors with -o exclusive
 mappings in RBD marked for stable and some assorted CephFS fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmajtIUTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi1fwB/4/CsFZLQC+ybWSqoZVYD01qND0muol
 44Nr6NyKdW302jQhAXchecB6s6L+N0azhVlAKKB8sO9XCifKA8RzuW75Y0+8z78B
 pgB6K7ZOzAIuPIG9mmNbutUHEd24CzGXNA28lEQkrNT8D6UZTENXQRJb1dS2GzQ7
 T2PyjFFoyF0z1bDZ85URHxeyMetEe/TzWUlG1P2QI98V+RP8nK+mGYmdXNGKhH87
 Ltf2pxjsiomtoH4QRm4LX7vwOUY1Ljf4HpSS1p+c5Fova2UTtTDbVfTFbh+ZjuQV
 hlh1ypNLM+igifu3nVeJ/Ga2f71zVFM66tnmpjcY3DxZAp70e1W2HMFD
 =Zfcy
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A small patchset to address bogus I/O errors and ultimately an
  assertion failure in the face of watch errors with -o exclusive
  mappings in RBD marked for stable and some assorted CephFS fixes"

* tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client:
  rbd: don't assume rbd_is_lock_owner() for exclusive mappings
  rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings
  rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait
  ceph: fix incorrect kmalloc size of pagevec mempool
  ceph: periodically flush the cap releases
  ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch()
  ceph: use cap_wait_list only if debugfs is enabled
2024-07-26 10:34:42 -07:00
ethanwu
03230edb0b ceph: fix incorrect kmalloc size of pagevec mempool
The kmalloc size of pagevec mempool is incorrectly calculated.
It misses the size of page pointer and only accounts the number for the array.

Fixes: a0102bda5b ("ceph: move sb->wb_pagevec_pool to be a global mempool")
Signed-off-by: ethanwu <ethanwu@synology.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-23 10:01:57 +02:00
Xiubo Li
578eb54c4a ceph: periodically flush the cap releases
The MDS could be waiting the caps releases infinitely in some corner
case and then reporting the caps revoke stuck warning. To fix this
we should periodically flush the cap releases.

Link: https://tracker.ceph.com/issues/57244
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-23 10:01:57 +02:00
Chen Ni
77bb4a501a ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch()
Replace a comma between expression statements by a semicolon.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-23 10:01:57 +02:00