Commit Graph

83733 Commits

Author SHA1 Message Date
Xiubo Li
ce0d5bd3a6 ceph: make num_fwd and num_retry to __u32
The num_fwd in MClientRequestForward is int32_t, while the num_fwd
in ceph_mds_request_head is __u8. This is buggy when the num_fwd
is larger than 256 it will always be truncate to 0 again. But the
client couldn't recoginize this.

This will make them to __u32 instead. Because the old cephs will
directly copy the raw memories when decoding the reqeust's head,
so we need to make sure this kclient will be compatible with old
cephs. For newer cephs they will decode the requests depending
the version, which will be much simpler and easier to extend new
members.

Link: https://tracker.ceph.com/issues/62145
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-31 14:56:27 +02:00
Christoph Hellwig
5069ba84b5 NFS: switch back to using kill_anon_super
NFS switch to open coding kill_anon_super in 7b14a21389
("nfs: don't call bdi_unregister") to avoid the extra bdi_unregister
call.  At that point bdi_destroy was called in nfs_free_server and
thus it required a later freeing of the anon dev_t.  But since
0db10944a7 ("nfs: Convert to separately allocated bdi") the bdi has
been free implicitly by the sb destruction, so this isn't needed
anymore.

By not open coding kill_anon_super, nfs now inherits the fix in
dc3216b141 ("super: ensure valid info"), and we remove the only
open coded version of kill_anon_super.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230831052940.256193-1-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-31 12:47:16 +02:00
Christian Brauner
69881be3d9 fs: export sget_dev()
They will be used for mtd devices as well.

Acked-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230829-vfs-super-mtd-v1-1-fecb572e5df3@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-31 12:47:15 +02:00
Katya Orlova
efc0b0bcff smb: propagate error code of extract_sharename()
In addition to the EINVAL, there may be an ENOMEM.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 70431bfd82 ("cifs: Support fscache indexing rewrite")
Signed-off-by: Katya Orlova <e.orlova@ispras.ru>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30 23:38:49 -05:00
Linus Torvalds
b97d64c722 22 smb3/cifs client fixes and two related changes (for unicode mapping)
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmTvk44ACgkQiiy9cAdy
 T1GT+wwAkiM+BFVoW0QhLK9ztPptYofGiTKr1AySV09AWos9Fwdhv0aS4LDKhauW
 PVsORfnFcLdyAHtgX2DlhJHMpLWDz3Z51KWiUSo7AAZjIp/4K0yEarg4WKPtUPN0
 PMET2OuqAfIfYLCxSZYFjiGK6xgSJEz+xIhX0qJPRZsyJp50WlFlyZRUfFa+6hXt
 pguatCVw4qhP9hkdcklCY8rwlFDdWEHj9wD/PB2Qschw4gzxDUMwOJjDgT6PNxjA
 SAC6J+NQVtMcnASd5pn0+Mbc+vNfKZ0PM+KZcDrJphcBz+arY6Hu57v3/yu2y++L
 DqRI6QtEwVmHzytM51x5JaWFE0Asj/NsH69LVm4bXVIkkcXBut6lLhrd/KVSP+xN
 LY4EcYEoufAAaecrQrMO4x2Tm10f+GMi1Fh9NvpLZVRrUXy4rdxLP2aC+q+i3uY3
 34FaAbpjQ7NJq2yZTL8xDOdCvi8E3t58DsBv4jA9Y/SYGWYao8Kw0vxhCt0SVZPc
 HaoMfkxl
 =CtQO
 -----END PGP SIGNATURE-----

Merge tag '6.6-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client updates from Steve French:

 - fixes for excessive stack usage

 - multichannel reconnect improvements

 - DFS fix and cleanup patches

 - move UCS-2 conversion code to fs/nls and update cifs and jfs to use
   them

 - cleanup patch for compounding, one to fix confusing function name

 - inode number collision fix

 - reparse point fixes (including avoiding an extra unneeded query on
   symlinks) and a minor cleanup

 - directory lease (caching) improvement

* tag '6.6-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: (24 commits)
  fs/jfs: Use common ucs2 upper case table
  fs/smb/client: Use common code in client
  fs/smb: Swing unicode common code from smb->NLS
  fs/smb: Remove unicode 'lower' tables
  SMB3: rename macro CIFS_SERVER_IS_CHAN to avoid confusion
  [SMB3] send channel sequence number in SMB3 requests after reconnects
  cifs: update desired access while requesting for directory lease
  smb: client: reduce stack usage in smb2_query_reparse_point()
  smb: client: reduce stack usage in smb2_query_info_compound()
  smb: client: reduce stack usage in smb2_set_ea()
  smb: client: reduce stack usage in smb_send_rqst()
  smb: client: reduce stack usage in cifs_demultiplex_thread()
  smb: client: reduce stack usage in cifs_try_adding_channels()
  smb: cilent: set reparse mount points as automounts
  smb: client: query reparse points in older dialects
  smb: client: do not query reparse points twice on symlinks
  smb: client: parse reparse point flag in create response
  smb: client: get rid of dfs code dep in namespace.c
  smb: client: get rid of dfs naming in automount code
  smb: client: rename cifs_dfs_ref.c to namespace.c
  ...
2023-08-30 21:01:40 -07:00
Linus Torvalds
53ea7f624f New code for 6.6:
* Chandan Babu will be taking over as the XFS release manager.  He has
    reviewed all the patches that are in this branch, though I'm signing
    the branch one last time since I'm still technically maintainer. :P
  * Create a maintainer entry profile for XFS in which we lay out the
    various roles that I have played for many years.  Aside from release
    manager, the remaining roles are as yet unfilled.
  * Start merging online repair -- we now have in-memory pageable memory
    for staging btrees, a bunch of pending fixes, and we've started the
    process of refactoring the scrub support code to support more of
    repair.  In particular, reaping of old blocks from damaged structures.
  * Scrub the realtime summary file.
  * Fix a bug where scrub's quota iteration only ever returned the root
    dquot.  Oooops.
  * Fix some typos.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZOQE2AAKCRBKO3ySh0YR
 pvmZAQDe+KceaVx6Dv2f9ihckeS2dILSpDTo1bh9BeXnt005VwD/ceHTaJxEl8lp
 u/dixFDkRgp9RYtoTAK2WNiUxYetsAc=
 =oZN6
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Chandan Babu:

 - Chandan Babu will be taking over as the XFS release manager. He has
   reviewed all the patches that are in this branch, though I'm signing
   the branch one last time since I'm still technically maintainer. :P

 - Create a maintainer entry profile for XFS in which we lay out the
   various roles that I have played for many years.  Aside from release
   manager, the remaining roles are as yet unfilled.

 - Start merging online repair -- we now have in-memory pageable memory
   for staging btrees, a bunch of pending fixes, and we've started the
   process of refactoring the scrub support code to support more of
   repair.  In particular, reaping of old blocks from damaged structures.

 - Scrub the realtime summary file.

 - Fix a bug where scrub's quota iteration only ever returned the root
   dquot.  Oooops.

 - Fix some typos.

[ Pull request from Chandan Babu, but signed tag and description from
  Darrick Wong, thus the first person singular above is Darrick, not
  Chandan ]

* tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (37 commits)
  fs/xfs: Fix typos in comments
  xfs: fix dqiterate thinko
  xfs: don't check reflink iflag state when checking cow fork
  xfs: simplify returns in xchk_bmap
  xfs: rewrite xchk_inode_is_allocated to work properly
  xfs: hide xfs_inode_is_allocated in scrub common code
  xfs: fix agf_fllast when repairing an empty AGFL
  xfs: allow userspace to rebuild metadata structures
  xfs: clear pagf_agflreset when repairing the AGFL
  xfs: allow the user to cancel repairs before we start writing
  xfs: don't complain about unfixed metadata when repairs were injected
  xfs: implement online scrubbing of rtsummary info
  xfs: always rescan allegedly healthy per-ag metadata after repair
  xfs: move the realtime summary file scrubber to a separate source file
  xfs: wrap ilock/iunlock operations on sc->ip
  xfs: get our own reference to inodes that we want to scrub
  xfs: track usage statistics of online fsck
  xfs: improve xfarray quicksort pivot
  xfs: create scaffolding for creating debugfs entries
  xfs: cache pages used for xfarray quicksort convergence
  ...
2023-08-30 12:34:12 -07:00
Linus Torvalds
1500e7e072 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmTvFssACgkQnJ2qBz9k
 QNl7HggAuY154urYJdh7M+mbKDSywhcK0YT5pNNkcXVpv/t2c073Ce57+ObDCBaS
 xetyFgH2XlvuAJ4dWmRDwBEzJ0jquKzvYJEMiXAexgy47ctnNPx5kLPsXpt3g+2q
 pro7sK1b5BmX/zrgOontbJ8/YAwX85XToD4Cv5XyNSx/ex6/zsd5FProfdiY/HAt
 qAcv7NkNTBbJBEBHhBNQSL2wOj3LzQV1U8v0XEcsBvTUxlX2jH8J4CsuFIotXqCF
 37SNvZPk2c04HbaLgyU4Ura69qD0fn4vTMocuCoaf0CN2PL5jblRAwsAO2bfSqJE
 AxZFq3afI0YV3Y9OrVlzHtSALuiZMQ==
 =QPEQ
 -----END PGP SIGNATURE-----

Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull ext2, quota, and udf updates from Jan Kara:

 - fixes for possible use-after-free issues with quota when racing with
   chown

 - fixes for ext2 crashing when xattr allocation races with another
   block allocation to the same file from page writeback code

 - fix for block number overflow in ext2

 - marking of reiserfs as obsolete in MAINTAINERS

 - assorted minor cleanups

* tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: Fix kernel-doc warnings
  ext2: improve consistency of ext2_fsblk_t datatype usage
  ext2: dump current reservation window info
  ext2: fix race between setxattr and write back
  ext2: introduce new flags argument for ext2_new_blocks()
  ext2: remove ext2_new_block()
  ext2: fix datatype of block number in ext2_xattr_set2()
  udf: Drop pointless aops assignment
  quota: use lockdep_assert_held_write in dquot_load_quota_sb
  MAINTAINERS: change reiserfs status to obsolete
  udf: Fix -Wstringop-overflow warnings
  quota: simplify drop_dquot_ref()
  quota: fix dqput() to follow the guarantees dquot_srcu should provide
  quota: add new helper dquot_active()
  quota: rename dquot_active() to inode_quota_active()
  quota: factor out dquot_write_dquot()
  ext2: remove redundant assignment to variable desc and variable best_desc
2023-08-30 12:10:50 -07:00
Linus Torvalds
63580f669d overlayfs update for 6.6
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmTu0QoACgkQEVvVyTe/
 1WpbzBAAjIZXzhn8KldDpG0muw9JKaSOxM45uhZE1s/2uKsVCyp4k3lubTbxxYO1
 S9rUjhF2gSJFOfuSOK/XXEKXyu4MGT7iy7pKswu0k8+AHDDRBksPXJKA/AkhLPUr
 vX1pU6aWw2OSn1xdhIgY+F4DveyzYQL/CEoUzFyRPxSB0G/yjktRAjdZ2HL4cAvN
 eVXPyTj0bd4LVj1ITla4uj8DbgivrqmRJbZ9bKnSRE8GXWBriJhV//M2Q3QRno+W
 04TtAvyh+klQeqZFVOQ0reZUFZzYBBZZTmqoFiUzTny7oljWl5F0+JfJOHhRGknG
 LYZCia34+T6TZPhOnZzT/szTDoXVvNJhEf+vBQCqhaCugqJc/2uJdw9CW8ZcDvA9
 ZNOMxEbXE4VgGjJ0HM6MoDMUoIEUiNWEnXWEaKyCAfOPqgYwPy+QeDO4JtBPQpRn
 fwZx7Xpc1FLpTc9feHxzox9o81S8rPRMycUBg2c3KZB6TFnYNDxWIIo365naMCzz
 A8IDVGf+gd+S4NaZvh9FUijciIslYfyFgqwQERZmJnpDk1d1NyeUC7Nn7EkmUpyp
 guRaC+rUcqYP4CpuSHTCPle94qHqiAkbsKSJWebZ2M1j9fjZ+okPw0k83Nih79vu
 vRhs70Ah51v1lpBb0mlDjsV3vKm3Apv8nMJKZvVuC+Cw6Qiob5s=
 =F4Hi
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs

Pull overlayfs updates from Amir Goldstein:

 - add verification feature needed by composefs (Alexander Larsson)

 - improve integration of overlayfs and fanotify (Amir Goldstein)

 - fortify some overlayfs code (Andrea Righi)

* tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: validate superblock in OVL_FS()
  ovl: make consistent use of OVL_FS()
  ovl: Kconfig: introduce CONFIG_OVERLAY_FS_DEBUG
  ovl: auto generate uuid for new overlay filesystems
  ovl: store persistent uuid/fsid with uuid=on
  ovl: add support for unique fsid per instance
  ovl: support encoding non-decodable file handles
  ovl: Handle verity during copy-up
  ovl: Validate verity xattr when resolving lowerdata
  ovl: Add versioned header for overlay.metacopy xattr
  ovl: Add framework for verity support
2023-08-30 11:54:09 -07:00
Anna Schumaker
c4a123d2e8 pNFS: Fix assignment of xprtdata.cred
The comma at the end of the line was leftover from an earlier refactor
of the _nfs4_pnfs_v3_ds_connect() function. This is technically valid C,
so the compilers didn't catch it, but if I'm understanding how it works
correctly it assigns the return value of rpc_clnt_add_xprtr() to
xprtdata.cred.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: a12f996d34 ("NFSv4/pNFS: Use connections to a DS that are all of the same protocol family")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-30 14:31:31 -04:00
Olga Kornievskaia
5690eed941 NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQ
If the client sent a synchronous copy and the server replied with
ERR_OFFLOAD_NO_REQ indicating that it wants an asynchronous
copy instead, the client should retry with asynchronous copy.

Fixes: 539f57b3e0 ("NFS handle COPY ERR_OFFLOAD_NO_REQS")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-30 11:08:27 -04:00
Benjamin Coddington
f67b55b658 NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN
Commit 64cfca85ba asserts the only valid return values for
nfs2/3_decode_dirent should not include -ENAMETOOLONG, but for a server
that sends a filename3 which exceeds MAXNAMELEN in a READDIR response the
client's behavior will be to endlessly retry the operation.

We could map -ENAMETOOLONG into -EBADCOOKIE, but that would produce
truncated listings without any error.  The client should return an error
for this case to clearly assert that the server implementation must be
corrected.

Fixes: 64cfca85ba ("NFS: Return valid errors from nfs2/3_decode_dirent()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-30 11:08:27 -04:00
Dr. David Alan Gilbert
f3a9b3758e fs/jfs: Use common ucs2 upper case table
Use the UCS-2 upper case tables from nls, that are shared
with smb.

This code in JFS is hard to test, so we're only reusing the
same tables (which are identical), not trying to reuse the
rest of the helper functions.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30 08:55:52 -05:00
Dr. David Alan Gilbert
de54845290 fs/smb/client: Use common code in client
Now we've got the common code, use it for the client as well.
Note there's a change here where we're using the server version of
UniStrcat now which had different types (__le16 vs wchar_t) but
it's not interpreting the value other than checking for 0, however
we do need casts to keep sparse happy.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30 08:55:52 -05:00
Dr. David Alan Gilbert
089f7f5913 fs/smb: Swing unicode common code from smb->NLS
Swing most of the inline functions and unicode tables into nls
from the copy in smb/server.  This is UCS-2 rather than most
of the rest of the code in NLS, but it currently seems like the
best place for it.

The actual unicode.c implementations vary much more between server
and client so they're unmoved.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30 08:55:51 -05:00
Dr. David Alan Gilbert
9e74938954 fs/smb: Remove unicode 'lower' tables
The unicode glue in smb/*/..uniupr.h has a section guarded
by 'ifndef UNIUPR_NOLOWER' - but that's always
defined in smb/*/..unicode.h.  Nuke those tables.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30 08:55:51 -05:00
Steve French
b3773b19d4 SMB3: rename macro CIFS_SERVER_IS_CHAN to avoid confusion
Since older dialects such as CIFS do not support multichannel
the macro CIFS_SERVER_IS_CHAN can be confusing (it requires SMB 3
or later) so shorten its name to "SERVER_IS_CHAN"

Suggested-by: Tom Talpey <tom@talpey.com>
Acked-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-30 08:55:02 -05:00
Linus Torvalds
3d3dfeb3ae for-6.6/block-2023-08-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs08EQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqa4EACu/zKE+omGXBV0Q7kEpVsChjp0ElGtSDIJ
 tJfTuvnWqQjrqRv4ksmZvGdx8SkqFuXri4/7oBXlsaqeUVbIQdWJUpLErBye6nxa
 lUb6nXOFWwyG94cMRYs71lN0loosjb7aiVw7oVLAIhntq3p3doFl/cyy3ndMZrUE
 pZbsrWSt4QiOKhcO0TtIjfAwsr31AN51qFiNNITEiZl3UjXfkGRCK81X0yM2N8zZ
 7Y0h1ldPBsZ/olNWeRyaW1uB64nKM0buR7/nDxCV/NI05nndJ34bIgo/JIj4xy0v
 SiBj2+y86+oMJZt17yYENwOQdtX3hbyESGuVm9dCrO0t9/byVQxkUk0OMm65BM/l
 l2d+gmMQZTbHziqfLlgq9i3i9+B4C2hsb7iBpuo7SW/FPbM45POgi3lpiZycaZyu
 krQo1qwL4KSGXzGN9CabEuKDcJcXqLxqMDOyEDA3R5Kz06V9tNuM+Di/mr4vuZHK
 sVHUfHuWBO9ionLlGPdc3fH/CuMqic8SHjumiAm2menBZV6cSzRDxpm6H4CyLt7y
 tWmw7BNU7dfHFGd+Jw0Ld49sAuEybszEXq6qYv5uYBVfJNqDvOvEeVoQp0RN2jJA
 AG30hymcZgxn9n7gkIgkPQDgIGUjnzUR8B2mE2UFU1CYVHXYXAXU55CCI5oeTkbs
 d0Y/zCZf1A==
 =p1bd
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:
 "Pretty quiet round for this release. This contains:

   - Add support for zoned storage to ublk (Andreas, Ming)

   - Series improving performance for drivers that mark themselves as
     needing a blocking context for issue (Bart)

   - Cleanup the flush logic (Chengming)

   - sed opal keyring support (Greg)

   - Fixes and improvements to the integrity support (Jinyoung)

   - Add some exports for bcachefs that we can hopefully delete again in
     the future (Kent)

   - deadline throttling fix (Zhiguo)

   - Series allowing building the kernel without buffer_head support
     (Christoph)

   - Sanitize the bio page adding flow (Christoph)

   - Write back cache fixes (Christoph)

   - MD updates via Song:
      - Fix perf regression for raid0 large sequential writes (Jan)
      - Fix split bio iostat for raid0 (David)
      - Various raid1 fixes (Heinz, Xueshi)
      - raid6test build fixes (WANG)
      - Deprecate bitmap file support (Christoph)
      - Fix deadlock with md sync thread (Yu)
      - Refactor md io accounting (Yu)
      - Various non-urgent fixes (Li, Yu, Jack)

   - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li,
     Ming, Nitesh, Ruan, Tejun, Thomas, Xu)"

* tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits)
  block: use strscpy() to instead of strncpy()
  block: sed-opal: keyring support for SED keys
  block: sed-opal: Implement IOC_OPAL_REVERT_LSP
  block: sed-opal: Implement IOC_OPAL_DISCOVERY
  blk-mq: prealloc tags when increase tagset nr_hw_queues
  blk-mq: delete redundant tagset map update when fallback
  blk-mq: fix tags leak when shrink nr_hw_queues
  ublk: zoned: support REQ_OP_ZONE_RESET_ALL
  md: raid0: account for split bio in iostat accounting
  md/raid0: Fix performance regression for large sequential writes
  md/raid0: Factor out helper for mapping and submitting a bio
  md raid1: allow writebehind to work on any leg device set WriteMostly
  md/raid1: hold the barrier until handle_read_error() finishes
  md/raid1: free the r1bio before waiting for blocked rdev
  md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()
  blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
  drivers/rnbd: restore sysfs interface to rnbd-client
  md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()
  raid6: test: only check for Altivec if building on powerpc hosts
  raid6: test: make sure all intermediate and artifact files are .gitignored
  ...
2023-08-29 20:21:42 -07:00
Linus Torvalds
adfd671676 sysctl-6.6-rc1
Long ago we set out to remove the kitchen sink on kernel/sysctl.c arrays and
 placings sysctls to their own sybsystem or file to help avoid merge conflicts.
 Matthew Wilcox pointed out though that if we're going to do that we might as
 well also *save* space while at it and try to remove the extra last sysctl
 entry added at the end of each array, a sentintel, instead of bloating the
 kernel by adding a new sentinel with each array moved.
 
 Doing that was not so trivial, and has required slowing down the moves of
 kernel/sysctl.c arrays and measuring the impact on size by each new move.
 
 The complex part of the effort to help reduce the size of each sysctl is being
 done by the patient work of el señor Don Joel Granados. A lot of this is truly
 painful code refactoring and testing and then trying to measure the savings of
 each move and removing the sentinels. Although Joel already has code which does
 most of this work, experience with sysctl moves in the past shows is we need to
 be careful due to the slew of odd build failures that are possible due to the
 amount of random Kconfig options sysctls use.
 
 To that end Joel's work is split by first addressing the major housekeeping
 needed to remove the sentinels, which is part of this merge request. The rest
 of the work to actually remove the sentinels will be done later in future
 kernel releases.
 
 At first I was only going to send his first 7 patches of his patch series,
 posted 1 month ago, but in retrospect due to the testing the changes have
 received in linux-next and the minor changes they make this goes with the
 entire set of patches Joel had planned: just sysctl house keeping. There are
 networking changes but these are part of the house keeping too.
 
 The preliminary math is showing this will all help reduce the overall build
 time size of the kernel and run time memory consumed by the kernel by about
 ~64 bytes per array where we are able to remove each sentinel in the future.
 That also means there is no more bloating the kernel with the extra ~64 bytes
 per array moved as no new sentinels are created.
 
 Most of this has been in linux-next for about a month, the last 7 patches took
 a minor refresh 2 week ago based on feedback.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmTuVnMSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinIckP/imvRlfkO6L0IP7MmJBRPtwY01rsTAKO
 Q14dZ//bG4DVQeGl1FdzrF6hhuLgekU0qW1YDFIWiCXO7CbaxaNBPSUkeW6ReVoC
 R/VHNUPxSR1PWQy1OTJV2t4XKri2sB7ijmUsfsATtISwhei9bggTHEysShtP4tv+
 U87DzhoqMnbYIsfMo49KCqOa1Qm7TmjC1a7WAp6Fph3GJuXAzZR5pXpsd0NtOZ9x
 Ud5RT22icnQpMl7K+yPsqY6XcS5JkgBe/WbSzMAUkYZvBZFBq9t2D+OW5h9TZMhw
 piJWQ9X0Rm7qI2D15mJfXwaOhhyDhWci391hzdJmS6DI0prf6Ma2NFdAWOt/zomI
 uiRujS4bGeBUaK5F4TX2WQ1+jdMtAZ+0FncFnzt4U8q7dzUc91uVCm6iHW3gcfAb
 N7OEg2ZL0gkkgCZHqKxN8wpNQiC2KwnNk+HLAbnL2a/oJYfBtdopQmlxWfrN2hpF
 xxROiENqk483BRdMXDq6DR/gyDZmZWCobXIglSzlqCOjCOcLbDziIJ7pJk83ok09
 h/QnXTYHf9protBq9OIQesgh2pwNzBBLifK84KZLKcb7IbdIKjpQrW5STp04oNGf
 wcGJzEz8tXUe0UKyMM47AcHQGzIy6cdXNLjyF8a+m7rnZzr1ndnMqZyRStZzuQin
 AUg2VWHKPmW9
 =sq2p
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "Long ago we set out to remove the kitchen sink on kernel/sysctl.c
  arrays and placings sysctls to their own sybsystem or file to help
  avoid merge conflicts. Matthew Wilcox pointed out though that if we're
  going to do that we might as well also *save* space while at it and
  try to remove the extra last sysctl entry added at the end of each
  array, a sentintel, instead of bloating the kernel by adding a new
  sentinel with each array moved.

  Doing that was not so trivial, and has required slowing down the moves
  of kernel/sysctl.c arrays and measuring the impact on size by each new
  move.

  The complex part of the effort to help reduce the size of each sysctl
  is being done by the patient work of el señor Don Joel Granados. A lot
  of this is truly painful code refactoring and testing and then trying
  to measure the savings of each move and removing the sentinels.
  Although Joel already has code which does most of this work,
  experience with sysctl moves in the past shows is we need to be
  careful due to the slew of odd build failures that are possible due to
  the amount of random Kconfig options sysctls use.

  To that end Joel's work is split by first addressing the major
  housekeeping needed to remove the sentinels, which is part of this
  merge request. The rest of the work to actually remove the sentinels
  will be done later in future kernel releases.

  The preliminary math is showing this will all help reduce the overall
  build time size of the kernel and run time memory consumed by the
  kernel by about ~64 bytes per array where we are able to remove each
  sentinel in the future. That also means there is no more bloating the
  kernel with the extra ~64 bytes per array moved as no new sentinels
  are created"

* tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: Use ctl_table_size as stopping criteria for list macro
  sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl
  vrf: Update to register_net_sysctl_sz
  networking: Update to register_net_sysctl_sz
  netfilter: Update to register_net_sysctl_sz
  ax.25: Update to register_net_sysctl_sz
  sysctl: Add size to register_net_sysctl function
  sysctl: Add size arg to __register_sysctl_init
  sysctl: Add size to register_sysctl
  sysctl: Add a size arg to __register_sysctl_table
  sysctl: Add size argument to init_header
  sysctl: Add ctl_table_size to ctl_table_header
  sysctl: Use ctl_table_header in list_for_each_table_entry
  sysctl: Prefer ctl_table_header in proc_sysctl
2023-08-29 17:39:15 -07:00
Linus Torvalds
d68b4b6f30 - An extensive rework of kexec and crash Kconfig from Eric DeVolder
("refactor Kconfig to consolidate KEXEC and CRASH options").
 
 - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
   couple of macros to args.h").
 
 - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
   commands").
 
 - vsprintf inclusion rationalization from Andy Shevchenko
   ("lib/vsprintf: Rework header inclusions").
 
 - Switch the handling of kdump from a udev scheme to in-kernel handling,
   by Eric DeVolder ("crash: Kernel handling of CPU and memory hot
   un/plug").
 
 - Many singleton patches to various parts of the tree
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO2GpAAKCRDdBJ7gKXxA
 juW3AQD1moHzlSN6x9I3tjm5TWWNYFoFL8af7wXDJspp/DWH/AD/TO0XlWWhhbYy
 QHy7lL0Syha38kKLMXTM+bN6YQHi9AU=
 =WJQa
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - An extensive rework of kexec and crash Kconfig from Eric DeVolder
   ("refactor Kconfig to consolidate KEXEC and CRASH options")

 - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
   couple of macros to args.h")

 - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
   commands")

 - vsprintf inclusion rationalization from Andy Shevchenko
   ("lib/vsprintf: Rework header inclusions")

 - Switch the handling of kdump from a udev scheme to in-kernel
   handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory
   hot un/plug")

 - Many singleton patches to various parts of the tree

* tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits)
  document while_each_thread(), change first_tid() to use for_each_thread()
  drivers/char/mem.c: shrink character device's devlist[] array
  x86/crash: optimize CPU changes
  crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
  crash: hotplug support for kexec_load()
  x86/crash: add x86 crash hotplug support
  crash: memory and CPU hotplug sysfs attributes
  kexec: exclude elfcorehdr from the segment digest
  crash: add generic infrastructure for crash hotplug support
  crash: move a few code bits to setup support of crash hotplug
  kstrtox: consistently use _tolower()
  kill do_each_thread()
  nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
  scripts/bloat-o-meter: count weak symbol sizes
  treewide: drop CONFIG_EMBEDDED
  lockdep: fix static memory detection even more
  lib/vsprintf: declare no_hash_pointers in sprintf.h
  lib/vsprintf: split out sprintf() and friends
  kernel/fork: stop playing lockless games for exe_file replacement
  adfs: delete unused "union adfs_dirtail" definition
  ...
2023-08-29 14:53:51 -07:00
Chuck Lever
6372e2ee62 NFSD: da_addr_body field missing in some GETDEVICEINFO replies
The XDR specification in RFC 8881 looks like this:

struct device_addr4 {
	layouttype4	da_layout_type;
	opaque		da_addr_body<>;
};

struct GETDEVICEINFO4resok {
	device_addr4	gdir_device_addr;
	bitmap4		gdir_notification;
};

union GETDEVICEINFO4res switch (nfsstat4 gdir_status) {
case NFS4_OK:
	GETDEVICEINFO4resok gdir_resok4;
case NFS4ERR_TOOSMALL:
	count4		gdir_mincount;
default:
	void;
};

Looking at nfsd4_encode_getdeviceinfo() ....

When the client provides a zero gd_maxcount, then the Linux NFS
server implementation encodes the da_layout_type field and then
skips the da_addr_body field completely, proceeding directly to
encode gdir_notification field.

There does not appear to be an option in the specification to skip
encoding da_addr_body. Moreover, Section 18.40.3 says:

> If the client wants to just update or turn off notifications, it
> MAY send a GETDEVICEINFO operation with gdia_maxcount set to zero.
> In that event, if the device ID is valid, the reply's da_addr_body
> field of the gdir_device_addr field will be of zero length.

Since the layout drivers are responsible for encoding the
da_addr_body field, put this fix inside the ->encode_getdeviceinfo
methods.

Fixes: 9cf514ccfa ("nfsd: implement pNFS operations")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tom Haynes <loghyr@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever
78c542f916 SUNRPC: Add enum svc_auth_status
In addition to the benefits of using an enum rather than a set of
macros, we now have a named type that can improve static type
checking of function return values.

As part of this change, I removed a stale comment from svcauth.h;
the return values from current implementations of the
auth_ops::release method are all zero/negative errno, not the SVC_OK
enum values as the old comment suggested.

Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown
c743b4259c SUNRPC: remove timeout arg from svc_recv()
Most svc threads have no interest in a timeout.
nfsd sets it to 1 hour, but this is a wart of no significance.

lockd uses the timeout so that it can call nlmsvc_retry_blocked().
It also sometimes calls svc_wake_up() to ensure this is called.

So change lockd to be consistent and always use svc_wake_up() to trigger
nlmsvc_retry_blocked() - using a timer instead of a timeout to
svc_recv().

And change svc_recv() to not take a timeout arg.

This makes the sp_threads_timedout counter always zero.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown
7b719e2bf3 SUNRPC: change svc_recv() to return void.
svc_recv() currently returns a 0 on success or one of two errors:
 - -EAGAIN means no message was successfully received
 - -EINTR means the thread has been told to stop

Previously nfsd would stop as the result of a signal as well as
following kthread_stop().  In that case the difference was useful: EINTR
means stop unconditionally.  EAGAIN means stop if kthread_should_stop(),
continue otherwise.

Now threads only exit when kthread_should_stop() so we don't need the
distinction.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown
f78116d3bf SUNRPC: call svc_process() from svc_recv().
All callers of svc_recv() go on to call svc_process() on success.
Simplify callers by having svc_recv() do that for them.

This loses one call to validate_process_creds() in nfsd.  That was
debugging code added 14 years ago.  I don't think we need to keep it.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown
9f28a971ee nfsd: separate nfsd_last_thread() from nfsd_put()
Now that the last nfsd thread is stopped by an explicit act of calling
svc_set_num_threads() with a count of zero, we only have a limited
number of places that can happen, and don't need to call
nfsd_last_thread() in nfsd_put()

So separate that out and call it at the two places where the number of
threads is set to zero.

Move the clearing of ->nfsd_serv and the call to svc_xprt_destroy_all()
into nfsd_last_thread(), as they are really part of the same action.

nfsd_put() is now a thin wrapper around svc_put(), so make it a static
inline.

nfsd_put() cannot be called after nfsd_last_thread(), so in a couple of
places we have to use svc_put() instead.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown
18e4cf9155 nfsd: Simplify code around svc_exit_thread() call in nfsd()
Previously a thread could exit asynchronously (due to a signal) so some
care was needed to hold nfsd_mutex over the last svc_put() call.  Now a
thread can only exit when svc_set_num_threads() is called, and this is
always called under nfsd_mutex.  So no care is needed.

Not only is the mutex held when a thread exits now, but the svc refcount
is elevated, so the svc_put() in svc_exit_thread() will never be a final
put, so the mutex isn't even needed at this point in the code.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown
3903902401 nfsd: don't allow nfsd threads to be signalled.
The original implementation of nfsd used signals to stop threads during
shutdown.
In Linux 2.3.46pre5 nfsd gained the ability to shutdown threads
internally it if was asked to run "0" threads.  After this user-space
transitioned to using "rpc.nfsd 0" to stop nfsd and sending signals to
threads was no longer an important part of the API.

In commit 3ebdbe5203 ("SUNRPC: discard svo_setup and rename
svc_set_num_threads_sync()") (v5.17-rc1~75^2~41) we finally removed the
use of signals for stopping threads, using kthread_stop() instead.

This patch makes the "obvious" next step and removes the ability to
signal nfsd threads - or any svc threads.  nfsd stops allowing signals
and we don't check for their delivery any more.

This will allow for some simplification in later patches.

A change worth noting is in nfsd4_ssc_setup_dul().  There was previously
a signal_pending() check which would only succeed when the thread was
being shut down.  It should really have tested kthread_should_stop() as
well.  Now it just does the latter, not the former.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
NeilBrown
8db14cad28 lockd: remove SIGKILL handling
lockd allows SIGKILL and responds by dropping all locks and restarting
the grace period.  This functionality has been present since 2.1.32 when
lockd was added to Linux.

This functionality is undocumented and most likely added as a useful
debug aid.  When there is a need to drop locks, the better approach is
to use /proc/fs/nfsd/unlock_*.

This patch removes SIGKILL handling as part of preparation for removing
all signal handling from sunrpc service threads.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Su Hui
de8d38cf44 fs: lockd: avoid possible wrong NULL parameter
clang's static analysis warning: fs/lockd/mon.c: line 293, column 2:
Null pointer passed as 2nd argument to memory copy function.

Assuming 'hostname' is NULL and calling 'nsm_create_handle()', this will
pass NULL as 2nd argument to memory copy function 'memcpy()'. So return
NULL if 'hostname' is invalid.

Fixes: 77a3ef33e2 ("NSM: More clean up of nsm_get_handle()")
Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Zhu Wang
7afdc0c902 exportfs: remove kernel-doc warnings in exportfs
Remove kernel-doc warning in exportfs:

fs/exportfs/expfs.c:395: warning: Function parameter or member 'parent'
not described in 'exportfs_encode_inode_fh'

Signed-off-by: Zhu Wang <wangzhu9@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Jeff Layton
d424797032 nfsd: inherit required unset default acls from effective set
A well-formed NFSv4 ACL will always contain OWNER@/GROUP@/EVERYONE@
ACEs, but there is no requirement for inheritable entries for those
entities. POSIX ACLs must always have owner/group/other entries, even for a
default ACL.

nfsd builds the default ACL from inheritable ACEs, but the current code
just leaves any unspecified ACEs zeroed out. The result is that adding a
default user or group ACE to an inode can leave it with unwanted deny
entries.

For instance, a newly created directory with no acl will look something
like this:

	# NFSv4 translation by server
	A::OWNER@:rwaDxtTcCy
	A::GROUP@:rxtcy
	A::EVERYONE@:rxtcy

	# POSIX ACL of underlying file
	user::rwx
	group::r-x
	other::r-x

...if I then add new v4 ACE:

	nfs4_setfacl -a A:fd:1000:rwx /mnt/local/test

...I end up with a result like this today:

	user::rwx
	user:1000:rwx
	group::r-x
	mask::rwx
	other::r-x
	default:user::---
	default:user:1000:rwx
	default:group::---
	default😷:rwx
	default:other::---

	A::OWNER@:rwaDxtTcCy
	A::1000:rwaDxtcy
	A::GROUP@:rxtcy
	A::EVERYONE@:rxtcy
	D:fdi:OWNER@:rwaDx
	A:fdi:OWNER@:tTcCy
	A:fdi:1000:rwaDxtcy
	A:fdi:GROUP@:tcy
	A:fdi:EVERYONE@:tcy

...which is not at all expected. Adding a single inheritable allow ACE
should not result in everyone else losing access.

The setfacl command solves a silimar issue by copying owner/group/other
entries from the effective ACL when none of them are set:

    "If a Default ACL entry is created, and the  Default  ACL  contains  no
     owner,  owning group,  or  others  entry,  a  copy of the ACL owner,
     owning group, or others entry is added to the Default ACL.

Having nfsd do the same provides a more sane result (with no deny ACEs
in the resulting set):

	user::rwx
	user:1000:rwx
	group::r-x
	mask::rwx
	other::r-x
	default:user::rwx
	default:user:1000:rwx
	default:group::r-x
	default😷:rwx
	default:other::r-x

	A::OWNER@:rwaDxtTcCy
	A::1000:rwaDxtcy
	A::GROUP@:rxtcy
	A::EVERYONE@:rxtcy
	A:fdi:OWNER@:rwaDxtTcCy
	A:fdi:1000:rwaDxtcy
	A:fdi:GROUP@:rxtcy
	A:fdi:EVERYONE@:rxtcy

Reported-by: Ondrej Valousek <ondrej.valousek@diasemi.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2136452
Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Alexander Aring
be2be5f7f4 lockd: nlm_blocked list race fixes
This patch fixes races when lockd accesses the global nlm_blocked list.
It was mostly safe to access the list because everything was accessed
from the lockd kernel thread context but there exist cases like
nlmsvc_grant_deferred() that could manipulate the nlm_blocked list and
it can be called from any context.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Jeff Layton
f2b7019d2e nfsd: set missing after_change as before_change + 1
In the event that we can't fetch post_op_attr attributes, we still need
to set a value for the after_change. The operation has already happened,
so we're not able to return an error at that point, but we do want to
ensure that the client knows that its cache should be invalidated.

If we weren't able to fetch post-op attrs, then just set the
after_change to before_change + 1. The atomic flag should already be
clear in this case.

Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Jeff Layton
976626073a nfsd: remove unsafe BUG_ON from set_change_info
At one time, nfsd would scrape inode information directly out of struct
inode in order to populate the change_info4. At that time, the BUG_ON in
set_change_info made some sense, since having it unset meant a coding
error.

More recently, it calls vfs_getattr to get this information, which can
fail. If that fails, fh_pre_saved can end up not being set. While this
situation is unfortunate, we don't need to crash the box.

Move set_change_info to nfs4proc.c since all of the callers are there.
Revise the condition for setting "atomic" to also check for
fh_pre_saved. Drop the BUG_ON and just have it zero out both
change_attr4s when this occurs.

Reported-by: Boyang Xue <bxue@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2223560
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Jeff Layton
a332018a91 nfsd: handle failure to collect pre/post-op attrs more sanely
Collecting pre_op_attrs can fail, in which case it's probably best to
fail the whole operation.

Change fh_fill_pre_attrs and fh_fill_both_attrs to return __be32, and
have the callers check the return code and abort the operation if it's
not nfs_ok.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Jeff Layton
5865bafa19 nfsd: add a MODULE_DESCRIPTION
I got this today from modpost:

    WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nfsd/nfsd.o

Add a module description.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever
e7421ce714 NFSD: Rename struct svc_cacherep
The svc_ prefix is identified with the SunRPC layer. Although the
duplicate reply cache caches RPC replies, it is only for the NFS
protocol. Rename the struct to better reflect its purpose.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever
cb18eca4b8 NFSD: Remove svc_rqst::rq_cacherep
Over time I'd like to see NFS-specific fields moved out of struct
svc_rqst, which is an RPC layer object. These fields are layering
violations.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever
c135e1269f NFSD: Refactor the duplicate reply cache shrinker
Avoid holding the bucket lock while freeing cache entries. This
change also caps the number of entries that are freed when the
shrinker calls to reduce the shrinker's impact on the cache's
effectiveness.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever
a9507f6af1 NFSD: Replace nfsd_prune_bucket()
Enable nfsd_prune_bucket() to drop the bucket lock while calling
kfree(). Use the same pattern that Jeff recently introduced in the
NFSD filecache.

A few percpu operations are moved outside the lock since they
temporarily disable local IRQs which is expensive and does not
need to be done while the lock is held.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever
ff0d169329 NFSD: Rename nfsd_reply_cache_alloc()
For readability, rename to match the other helpers.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever
35308e7f0f NFSD: Refactor nfsd_reply_cache_free_locked()
To reduce contention on the bucket locks, we must avoid calling
kfree() while each bucket lock is held.

Start by refactoring nfsd_reply_cache_free_locked() into a helper
that removes an entry from the bucket (and must therefore run under
the lock) and a second helper that frees the entry (which does not
need to hold the lock).

For readability, rename the helpers nfsd_cacherep_<verb>.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Dai Ngo
1d3dd1d56c NFSD: Enable write delegation support
This patch grants write delegations for OPEN with NFS4_SHARE_ACCESS_WRITE
if there is no conflict with other OPENs.

Write delegation conflicts with another OPEN, REMOVE, RENAME and SETATTR
are handled the same as read delegation using notify_change,
try_break_deleg.

The NFSv4.0 protocol does not enable a server to determine that a
conflicting GETATTR originated from the client holding the
delegation versus coming from some other client. With NFSv4.1 and
later, the SEQUENCE operation that begins each COMPOUND contains a
client ID, so delegation recall can be safely squelched in this case.

With NFSv4.0, however, the server must recall or send a CB_GETATTR
(per RFC 7530 Section 16.7.5) even when the GETATTR originates from
the client holding that delegation.

An NFSv4.0 client can trigger a pathological situation if it always
sends a DELEGRETURN preceded by a conflicting GETATTR in the same
COMPOUND. COMPOUND execution will always stop at the GETATTR and the
DELEGRETURN will never get executed. The server eventually revokes
the delegation, which can result in loss of open or lock state.

Tracepoint added to track whether read or write delegation is granted.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Chuck Lever
50bce06f0e NFSD: Report zero space limit for write delegations
Replace the -1 (no limit) with a zero (no reserved space).

This prevents certain non-determinant client behavior, such as
silly-renaming a file when the only open reference is a write
delegation. Such a rename can leave unexpected .nfs files in a
directory that is otherwise supposed to be empty.

Note that other server implementations that support write delegation
also set this field to zero.

Suggested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Dai Ngo
fd19ca36fd NFSD: handle GETATTR conflict with write delegation
If the GETATTR request on a file that has write delegation in effect and
the request attributes include the change info and size attribute then
the write delegation is recalled. If the delegation is returned within
30ms then the GETATTR is serviced as normal otherwise the NFS4ERR_DELAY
error is returned for the GETATTR.

Add counter for write delegation recall due to conflict GETATTR. This is
used to evaluate the need to implement CB_GETATTR to adoid recalling the
delegation with conflit GETATTR.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Dai Ngo
d67cd907cf locks: allow support for write delegation
Remove the check for F_WRLCK in generic_add_lease to allow file_lock
to be used for write delegation.

First consumer is NFSD.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-29 17:45:22 -04:00
Linus Torvalds
b96a3e9142 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP.  It
   also speeds up GUP handling of transparent hugepages.
 
 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").
 
 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").
 
 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").
 
 - xu xin has doe some work on KSM's handling of zero pages.  These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support tracking
   KSM-placed zero-pages").
 
 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
 
 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").
 
 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
 
 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").
 
 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").
 
 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").
 
 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").
 
 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").
 
 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap").  And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").
 
 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
 
 - Baoquan He has converted some architectures to use the GENERIC_IOREMAP
   ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
   GENERIC_IOREMAP way").
 
 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").
 
 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep").  Liam also developed some efficiency improvements
   ("Reduce preallocations for maple tree").
 
 - Cleanup and optimization to the secondary IOMMU TLB invalidation, from
   Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").
 
 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").
 
 - Kemeng Shi provides some maintenance work on the compaction code ("Two
   minor cleanups for compaction").
 
 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
   file-backed faults under the VMA lock").
 
 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").
 
 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").
 
 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").
 
 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
 
 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").
 
 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").
 
 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
 
 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").
 
 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").
 
 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
   on memory feature on ppc64").
 
 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
 
 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").
 
 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").
 
 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").
 
 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").
 
 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").
 
 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").
 
 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").
 
 - pagetable handling cleanups from Matthew Wilcox ("New page table range
   API").
 
 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").
 
 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").
 
 - Matthew Wilcox has also done some maintenance work on the MM subsystem
   documentation ("Improve mm documentation").
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
 jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
 Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
 =Afws
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in
   add_to_avail_list")

 - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP. It
   also speeds up GUP handling of transparent hugepages.

 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").

 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").

 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").

 - xu xin has doe some work on KSM's handling of zero pages. These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support
   tracking KSM-placed zero-pages").

 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").

 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with
   UFFD").

 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").

 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").

 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").

 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").

 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").

 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap"). And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").

 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

 - Baoquan He has converted some architectures to use the
   GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
   architectures to take GENERIC_IOREMAP way").

 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").

 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep"). Liam also developed some efficiency
   improvements ("Reduce preallocations for maple tree").

 - Cleanup and optimization to the secondary IOMMU TLB invalidation,
   from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").

 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").

 - Kemeng Shi provides some maintenance work on the compaction code
   ("Two minor cleanups for compaction").

 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
   most file-backed faults under the VMA lock").

 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").

 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").

 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").

 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").

 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").

 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").

 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").

 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
   memmap on memory feature on ppc64").

 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock
   migratetype").

 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").

 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").

 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").

 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").

 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").

 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").

 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").

 - pagetable handling cleanups from Matthew Wilcox ("New page table
   range API").

 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").

 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").

 - Matthew Wilcox has also done some maintenance work on the MM
   subsystem documentation ("Improve mm documentation").

* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
  maple_tree: shrink struct maple_tree
  maple_tree: clean up mas_wr_append()
  secretmem: convert page_is_secretmem() to folio_is_secretmem()
  nios2: fix flush_dcache_page() for usage from irq context
  hugetlb: add documentation for vma_kernel_pagesize()
  mm: add orphaned kernel-doc to the rst files.
  mm: fix clean_record_shared_mapping_range kernel-doc
  mm: fix get_mctgt_type() kernel-doc
  mm: fix kernel-doc warning from tlb_flush_rmaps()
  mm: remove enum page_entry_size
  mm: allow ->huge_fault() to be called without the mmap_lock held
  mm: move PMD_ORDER to pgtable.h
  mm: remove checks for pte_index
  memcg: remove duplication detection for mem_cgroup_uncharge_swap
  mm/huge_memory: work on folio->swap instead of page->private when splitting folio
  mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
  mm/swap: use dedicated entry for swap in folio
  mm/swap: stop using page->private on tail pages for THP_SWAP
  selftests/mm: fix WARNING comparing pointer to 0
  selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
  ...
2023-08-29 14:25:26 -07:00
Linus Torvalds
9d6b14cd1e flexible-array transformations for 6.6-rc1
Hi Linus,
 
 Please, pull the following flexible-array transformations. These patches
 have been baking in linux-next for a while.
 
 Thanks
 --
 Gustavo
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmTtBEQACgkQRwW0y0cG
 2zHgeBAAm1D5H+4THFOuYVsl93igowMbcPwraoo2xxnNtLqQR4y6p9SiWcjh2vk3
 bmabPtID3jPZytjpV6QUYe6JlcOoWc2LsNTWqwnt2X2fhcovJfGMc/K3+7bRQzE8
 uBqooxkiaZvPiUATeo0zdBSN9aYWXvEpYrBR5J9EtJWyS93LiYb/nn2DY+0d1yrY
 89koxTV9/eOKJ7xi6UB0zWbyayor+JawX5RI2ysMFHO6UB2aVrNjqUCJP51dyywa
 2QKtwkY/8vQIo8Y8c7ReApHR0t+s0jmxRi9ip+4rh/0VNl8ON5SoL+EcRB5FLCTn
 E/VqQcebjIwo0AfeEzUQIc68E1hZJLYhGGJupF1Wsb9IvmbdTTMhhfLJmWStz5+E
 GnNpXx9gTjCoM5+FqJlMkHgxT9JpHYSV5LtnZKmM7wiexHND+5T/ukAxRUhFjLY3
 saMbXnJ85kHsNYLAxIDUfdRcXEDNBJl7XiVI2YRCI9fttD+GVjdFtATdNZFsFAge
 kHZCAFloFHwe0lTcksN5I/NVdDc6Jv7ABjrTKNezfrjfRn24zvR61V54/ebPp5Fs
 2JA5kVJpgGnUdIM2uot2lYLCmLDjhGNfXE95AjK0SdJXrZtzYqzTcKJg6w1gz3ZJ
 VHWK4snsu48KxTx66f+v38ovKXLtlA8Rk8kYOHWXYgB0ZXOOAy4=
 =L9kF
 -----END PGP SIGNATURE-----

Merge tag 'flex-array-transformations-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull flexible-array updates from Gustavo A. R. Silva.

* tag 'flex-array-transformations-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  fs: omfs: Use flexible-array member in struct omfs_extent
  sparc: openpromio: Address -Warray-bounds warning
  reiserfs: Replace one-element array with flexible-array member
2023-08-29 12:48:12 -07:00
Linus Torvalds
468e28d4ac v6.6-vfs.super.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZO2teQAKCRCRxhvAZXjc
 omN9AP9F3rtueiMv0kfdwRhXl4GITY+o5OpiEpLjHdPC4nEalwEAvt8h4nvNmTg6
 B54+2wNX2vc3t5UVPuFlqSHtUxoKTgA=
 =DMxT
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.super.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull superblock fixes from Christian Brauner:
 "Two follow-up fixes for the super work this cycle:

   - Move a misplaced lockep assertion before we potentially free the
     object containing the lock.

   - Ensure that filesystems which match superblocks in sget{_fc}()
     based on sb->s_fs_info are guaranteed to see a valid sb->s_fs_info
     as long as a superblock still appears on the filesystem type's
     superblock list.

     What we want as a proper solution for next cycle is to split
     sb->free_sb() out of sb->kill_sb() so that we can simply call
     kill_super_notify() after sb->kill_sb() but before sb->free_sb().

     Currently, this is lumped together in sb->kill_sb()"

* tag 'v6.6-vfs.super.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  super: ensure valid info
  super: move lockdep assert
2023-08-29 11:59:37 -07:00
Namjae Jeon
0e2378eaa2 ksmbd: add missing calling smb2_set_err_rsp() on error
If some error happen on smb2_sess_setup(), Need to call
smb2_set_err_rsp() to set error response.
This patch add missing calling smb2_set_err_rsp() on error.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:20 -05:00
Namjae Jeon
0ba5439d9a ksmbd: replace one-element array with flex-array member in struct smb2_ea_info
UBSAN complains about out-of-bounds array indexes on 1-element arrays in
struct smb2_ea_info.

UBSAN: array-index-out-of-bounds in fs/smb/server/smb2pdu.c:4335:15
index 1 is out of range for type 'char [1]'
CPU: 1 PID: 354 Comm: kworker/1:4 Not tainted 6.5.0-rc4 #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop
Reference Platform, BIOS 6.00 07/22/2020
Workqueue: ksmbd-io handle_ksmbd_work [ksmbd]
Call Trace:
 <TASK>
 __dump_stack linux/lib/dump_stack.c:88
 dump_stack_lvl+0x48/0x70 linux/lib/dump_stack.c:106
 dump_stack+0x10/0x20 linux/lib/dump_stack.c:113
 ubsan_epilogue linux/lib/ubsan.c:217
 __ubsan_handle_out_of_bounds+0xc6/0x110 linux/lib/ubsan.c:348
 smb2_get_ea linux/fs/smb/server/smb2pdu.c:4335
 smb2_get_info_file linux/fs/smb/server/smb2pdu.c:4900
 smb2_query_info+0x63ae/0x6b20 linux/fs/smb/server/smb2pdu.c:5275
 __process_request linux/fs/smb/server/server.c:145
 __handle_ksmbd_work linux/fs/smb/server/server.c:213
 handle_ksmbd_work+0x348/0x10b0 linux/fs/smb/server/server.c:266
 process_one_work+0x85a/0x1500 linux/kernel/workqueue.c:2597
 worker_thread+0xf3/0x13a0 linux/kernel/workqueue.c:2748
 kthread+0x2b7/0x390 linux/kernel/kthread.c:389
 ret_from_fork+0x44/0x90 linux/arch/x86/kernel/process.c:145
 ret_from_fork_asm+0x1b/0x30 linux/arch/x86/entry/entry_64.S:304
 </TASK>

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:20 -05:00
Namjae Jeon
4b081ce0d8 ksmbd: fix slub overflow in ksmbd_decode_ntlmssp_auth_blob()
If authblob->SessionKey.Length is bigger than session key
size(CIFS_KEY_SIZE), slub overflow can happen in key exchange codes.
cifs_arc4_crypt copy to session key array from SessionKey from client.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21940
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:20 -05:00
Namjae Jeon
17d5b135bb ksmbd: fix wrong DataOffset validation of create context
If ->DataOffset of create context is 0, DataBuffer size is not correctly
validated. This patch change wrong validation code and consider tag
length in request.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21824
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:20 -05:00
Yang Li
bf26f1b4e0 ksmbd: Fix one kernel-doc comment
Fix one kernel-doc comment to silence the warning:
fs/smb/server/smb2pdu.c:4160: warning: Excess function parameter 'infoclass_size' description in 'buffer_check_err'

Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:20 -05:00
Namjae Jeon
e628bf939a ksmbd: reduce descriptor size if remaining bytes is less than request size
Create 3 kinds of files to reproduce this problem.

dd if=/dev/urandom of=127k.bin bs=1024 count=127
dd if=/dev/urandom of=128k.bin bs=1024 count=128
dd if=/dev/urandom of=129k.bin bs=1024 count=129

When copying files from ksmbd share to windows or cifs.ko, The following
error message happen from windows client.

"The file '129k.bin' is too large for the destination filesystem."

We can see the error logs from ksmbd debug prints

[48394.611537] ksmbd: RDMA r/w request 0x0: token 0x669d, length 0x20000
[48394.612054] ksmbd: smb_direct: RDMA write, len 0x20000, needed credits 0x1
[48394.612572] ksmbd: filename 129k.bin, offset 131072, len 131072
[48394.614189] ksmbd: nbytes 1024, offset 132096 mincount 0
[48394.614585] ksmbd: Failed to process 8 [-22]

And we can reproduce it with cifs.ko,
e.g. dd if=129k.bin of=/dev/null bs=128KB count=2

This problem is that ksmbd rdma return error if remaining bytes is less
than Length of Buffer Descriptor V1 Structure.

smb_direct_rdma_xmit()
...
     if (desc_buf_len == 0 || total_length > buf_len ||
           total_length > t->max_rdma_rw_size)
               return -EINVAL;

This patch reduce descriptor size with remaining bytes and remove the
check for total_length and buf_len.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:20 -05:00
Atte Heikkilä
65656f5242 ksmbd: fix force create mode' and force directory mode'
`force create mode' and `force directory mode' should be bitwise ORed
with the perms after `create mask' and `directory mask' have been
applied, respectively.

Signed-off-by: Atte Heikkilä <atteh.mailbox@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:20 -05:00
Namjae Jeon
041bba4414 ksmbd: fix wrong interim response on compound
If smb2_lock or smb2_open request is compound, ksmbd could send wrong
interim response to client. ksmbd allocate new interim buffer instead of
using resonse buffer to support compound request.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:19 -05:00
Namjae Jeon
e2b76ab8b5 ksmbd: add support for read compound
MacOS sends a compound request including read to the server
(e.g. open-read-close). So far, ksmbd has not handled read as
a compound request. For compatibility between ksmbd and an OS that
supports SMB, This patch provides compound support for read requests.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:19 -05:00
Yang Yingliang
084ba46fc4 ksmbd: switch to use kmemdup_nul() helper
Use kmemdup_nul() helper instead of open-coding to
simplify the code.

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-29 12:30:19 -05:00
Alexei Filippov
0225e10972 jfs: validate max amount of blocks before allocation.
The lack of checking bmp->db_max_freebud in extBalloc() can lead to
shift out of bounds, so this patch prevents undefined behavior, because
bmp->db_max_freebud == -1 only if there is no free space.

Signed-off-by: Aleksei Filippov <halip0503@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+5f088f29593e6b4c8db8@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=01abadbd6ae6a08b1f1987aa61554c6b3ac19ff2
2023-08-29 12:25:47 -05:00
Colin Ian King
87098a0d9e jfs: remove redundant initialization to pointer ip
The pointer ip is being initialized with a value that is never read, it
is being re-assigned later on. The assignment is redundant and can be
removed.  Cleans up clang scan warning:

fs/jfs/namei.c:886:16: warning: Value stored to 'ip' during its
initialization is never read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2023-08-29 12:20:50 -05:00
Bernd Schubert
f73016b63b fuse: conditionally fill kstat in fuse_do_statx()
The code path

fuse_update_attributes
    fuse_update_get_attr
        fuse_do_statx

has the risk to use a NULL pointer for struct kstat *stat, although current
callers of fuse_update_attributes() only set request_mask to values that
will trigger the call of fuse_do_getattr(), which already handles the NULL
pointer.  Future updates might miss that fuse_do_statx() does not handle it
it is safer to add a condition already right now.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Fixes: d3045530bd ("fuse: implement statx")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-08-29 14:58:48 +02:00
Christian Brauner
dc3216b141
super: ensure valid info
For keyed filesystems that recycle superblocks based on s_fs_info or
information contained therein s_fs_info must be kept as long as the
superblock is on the filesystem type super list. This isn't guaranteed
as s_fs_info will be freed latest in sb->kill_sb().

The fix is simply to perform notification and list removal in
kill_anon_super(). Any filesystem needs to free s_fs_info after they
call the kill_*() helpers. If they don't they risk use-after-free right
now so fixing it here is guaranteed that s_fs_info remain valid.

For block backed filesystems notifying in pass sb->kill_sb() in
deactivate_locked_super() remains unproblematic and is required because
multiple other block devices can be shut down after kill_block_super()
has been called from a filesystem's sb->kill_sb() handler. For example,
ext4 and xfs close additional devices. Block based filesystems don't
depend on s_fs_info (btrfs does use s_fs_info but also uses
kill_anon_super() and not kill_block_super().).

Sorry for that braino. Goal should be to unify this behavior during this
cycle obviously. But let's please do a simple bugfix now.

Fixes: 2c18a63b76 ("super: wait until we passed kill super")
Fixes: syzbot+5b64180f8d9e39d3f061@syzkaller.appspotmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: syzbot+5b64180f8d9e39d3f061@syzkaller.appspotmail.com
Message-Id: <20230828-vfs-super-fixes-v1-2-b37a4a04a88f@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-29 10:13:04 +02:00
Christian Brauner
345a5c4a0b
super: move lockdep assert
Fix braino and move the lockdep assertion after put_super() otherwise we
risk a use-after-free.

Fixes: 2c18a63b76 ("super: wait until we passed kill super")
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230828-vfs-super-fixes-v1-1-b37a4a04a88f@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-29 10:13:04 +02:00
Linus Torvalds
5b07aaca18 pstore updates for v6.6-rc1
- Greatly simplify compression support (Ard Biesheuvel).
 
 - Avoid crashes for corrupted offsets when prz size is 0 (Enlin Mu).
 
 - Expand range of usable record sizes (Yuxiao Zhang).
 
 - Fix kernel-doc warning (Matthew Wilcox).
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmTs5TUWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkn8D/9mdBm32Wfx/if84YejxHJpzHmV
 nKPRgib89vNZdL5ORP02ZTonJBZn4NC7KtJfBHSfdoW1U+5GCC/cHOpECUHQui9Q
 CN22VFm37JdmBZq2+YmPug5y7z94wbFkD79otCR9VlMt5uwbNIGxUaI10fK2M97n
 3avg/RZzz6kI9Y6BChZfBDLKXXi6ytnIRQOa9ZqZyDylN1nTLi8vqrxf0P8Am0jE
 1s2GumYj54NuuNTdqvlz0XhTyCM5pk5omTqlq1VW9Trr0fLa2CLvEBWxWo8G7odC
 Yav5p8e0jX0GjDFM3NHPgRcXTcY0vkWGnJLdZGNyEkxPq96GH09j5rhFOIo9+KPz
 Y3fhYWzZyNWjy7YujWupDyL6lozWObhOcjBRnFmW7gJHjoO2G0GT2ufW2fb9cD4q
 fTGPiX2Fum1Zl6b0CXF+j4wDaazsBxGGAGzTqj7yp2Je0rPJPotd69q8LT2bbVcP
 ZahXJsFNn/YmVKv9MhNZjOuxGZoR4Cgco114V+sU5aYZMcZ68fQNzTzMydkbbdch
 SMapAV9a99H1D8ldT9dhm+HlKZFzIrOtBDrDoIbF4qQB8OWhjEK6Ot3oBbXvLl7w
 72i1niDVRj+v/hUSc/7XYfZkUG7NYJQqXbaJp20LWvEs5OALdWRC3T6vXnvh53Qd
 9ErztYLmF6k3W/h/xA==
 =bu02
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:

 - Greatly simplify compression support (Ard Biesheuvel)

 - Avoid crashes for corrupted offsets when prz size is 0 (Enlin Mu)

 - Expand range of usable record sizes (Yuxiao Zhang)

 - Fix kernel-doc warning (Matthew Wilcox)

* tag 'pstore-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: Fix kernel-doc warning
  pstore: Support record sizes larger than kmalloc() limit
  pstore/ram: Check start of empty przs during init
  pstore: Replace crypto API compression with zlib_deflate library calls
  pstore: Remove worst-case compression size logic
2023-08-28 12:36:04 -07:00
Linus Torvalds
547635c6ac for-6.6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTskOwACgkQxWXV+ddt
 WDsNJw/8CCi41Z7e3LdJsQd2iy3/+oJZUvIGuT5YvshYxTLCbV7AL+diBPnSQs4Q
 /KFMGL7RZBgJzwVoSQtXnESXXgX8VOVfN1zY//k5g6z7BscCEQd73H/M0B8ciZy/
 aBygm9tJ7EtWbGZWNR8yad8YtOgl6xoClrPnJK/DCLwMGPy2o+fnKP3Y9FOKY5KM
 1Sl0Y4FlJ9dTJpxIwYbx4xmuyHrh2OivjU/KnS9SzQlHu0nl6zsIAE45eKem2/EG
 1figY5aFBYPpPYfopbLDalEBR3bQGiViZVJuNEop3AimdcMOXw9jBF3EZYUb5Tgn
 MleMDgmmjLGOE/txGhvTxKj9kci2aGX+fJn3jXbcIMksAA0OQFLPqzGvEQcrs6Ok
 HA0RsmAkS5fWNDCuuo4ZPXEyUPvluTQizkwyoulOfnK+UPJCWaRqbEBMTsvm6M6X
 wFT2czwLpaEU/W6loIZkISUhfbRqVoA3DfHy398QXNzRhSrg8fQJjma1f7mrHvTi
 CzU+OD5YSC2nXktVOnklyTr0XT+7HF69cumlDbr8TS8u1qu8n1keU/7M3MBB4xZk
 BZFJDz8pnsAqpwVA4T434E/w45MDnYlwBw5r+U8Xjyso8xlau+sYXKcim85vT2Q0
 yx/L91P6tdekR1y97p4aDdxw/PgTzdkNGMnsTBMVzgtCj+5pMmE=
 =N7Yn
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "No new features, the bulk of the changes are fixes, refactoring and
  cleanups. The notable fix is the scrub performance restoration after
  rewrite in 6.4, though still only partial.

  Fixes:

   - scrub performance drop due to rewrite in 6.4 partially restored:
      - do IO grouping by blg_plug/blk_unplug again
      - avoid unnecessary tree searches when processing stripes, in
        extent and checksum trees
      - the drop is noticeable on fast PCIe devices, -66% and restored
        to -33% of the original
      - backports to 6.4 planned

   - handle more corner cases of transaction commit during orphan
     cleanup or delayed ref processing

   - use correct fsid/metadata_uuid when validating super block

   - copy directory permissions and time when creating a stub subvolume

  Core:

   - debugging feature integrity checker deprecated, to be removed in
     6.7

   - in zoned mode, zones are activated just before the write, making
     error handling easier, now the overcommit mechanism can be enabled
     again which improves performance by avoiding more frequent flushing

   - v0 extent handling completely removed, deprecated long time ago

   - error handling improvements

   - tests:
      - extent buffer bitmap tests
      - pinned extent splitting tests

   - cleanups and refactoring:
      - compression writeback
      - extent buffer bitmap
      - space flushing, ENOSPC handling"

* tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits)
  btrfs: zoned: skip splitting and logical rewriting on pre-alloc write
  btrfs: tests: test invalid splitting when skipping pinned drop extent_map
  btrfs: tests: add a test for btrfs_add_extent_mapping
  btrfs: tests: add extent_map tests for dropping with odd layouts
  btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker()
  btrfs: scrub: don't go ordered workqueue for dev-replace
  btrfs: scrub: fix grouping of read IO
  btrfs: scrub: avoid unnecessary csum tree search preparing stripes
  btrfs: scrub: avoid unnecessary extent tree search preparing stripes
  btrfs: copy dir permission and time when creating a stub subvolume
  btrfs: remove pointless empty list check when reading delayed dir indexes
  btrfs: drop redundant check to use fs_devices::metadata_uuid
  btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super
  btrfs: use the correct superblock to compare fsid in btrfs_validate_super
  btrfs: simplify memcpy either of metadata_uuid or fsid
  btrfs: add a helper to read the superblock metadata_uuid
  btrfs: remove v0 extent handling
  btrfs: output extra debug info if we failed to find an inline backref
  btrfs: move the !zoned assert into run_delalloc_cow
  btrfs: consolidate the error handling in run_delalloc_nocow
  ...
2023-08-28 12:26:57 -07:00
Linus Torvalds
f678c890c6 affs-for-6.6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTsk7MACgkQxWXV+ddt
 WDsKjRAAoKY1qtApAMWl7vX+fw4pyoexrZISvzsJpCmPDV0SlKZ8CqOfYGhyT845
 /pYUB8oxCm8UgpbswCTDVzppFANVqJBRIQEQc9d6yej4bA62Z/4Z561xROE1BEsO
 UR2UIKw4gtpAmSlCoLYljt4Eg0izRt0MNmYvIk9ccQTIzVO8u2E8keNW76OSbAW1
 J9AH0KgrxPgK9GPN/MlWg8THyuWgP3p/Qf4QS6JcDRCNFr4gmNwxP8Cj1kifYVT/
 iMxghN+Z2iFPN8gHIeebjYpyoJJ4ZHRVL9mUZ0SLwcbt6ZfV5DnYn2yepkxxHfyL
 CaUrRNJHE3Cb4vyJgtvVoAHE78Bqyk4WHKirNnrHly5j8K9ba3PIEpL+XjDVUckA
 597xlrBsaQhlvMHXn+sMOMAMs7EVkUfR/gzSoLirWWOlLJ9skl3H8vK/BunK1bo+
 +YYwMJwPtv3HmCdMbb00Rm9D2vVSVYM4iydEjmW6ILWBgecQII5O14CmkSH9zoG7
 sip4O5oDkxmnoq9TuKPjS6AEPi+6UsbCsv8Z8boXdj97w4+bJIZVGbDHXvdAIEkj
 2NmrYgqCqcbB46cSIc2VF1Fx0WO8Jh+GTFYGAPGFrabLFDVWeuwtdJahbYnSrmuL
 whMVNSH4sooZflCWjwANhV8H5wlIatnn+xh+lX3TWpKiBeqIGpw=
 =wPHe
 -----END PGP SIGNATURE-----

Merge tag 'affs-for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull affs updates from David Sterba:
 "Two minor updates for AFFS:

   - reimplement writepage() address space callback on top of
     migrate_folio()

   - fix a build warning, local parameters 'toupper' collide with the
     standard ctype.h name"

* tag 'affs-for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  affs: rename local toupper() to fn() to avoid confusion
  affs: remove writepage implementation
2023-08-28 12:18:26 -07:00
Linus Torvalds
3bb156a556 fsverity updates for 6.6
Several cleanups for fs/verity/, including two commits that make the
 builtin signature support more cleanly separated from the base feature.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZOwXSRQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+kzAP9UsTtZjjQLvEDF6OYlysFLQppDrmk0
 zh9iH/Z7qT8BQwEA0azRHW5/FDLvkHZWsU0QCLjDpUPrNHc712aeM1pLpgc=
 =zOdL
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux

Pull fsverity updates from Eric Biggers:
 "Several cleanups for fs/verity/, including two commits that make the
  builtin signature support more cleanly separated from the base
  feature"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fsverity: skip PKCS#7 parser when keyring is empty
  fsverity: move sysctl registration out of signature.c
  fsverity: simplify handling of errors during initcall
  fsverity: explicitly check that there is no algorithm 0
2023-08-28 12:16:42 -07:00
Linus Torvalds
6016fc9162 New code for 6.6:
* Make large writes to the page cache fill sparse parts of the cache
    with large folios, then use large memcpy calls for the large folio.
  * Track the per-block dirty state of each large folio so that a
    buffered write to a single byte on a large folio does not result in a
    (potentially) multi-megabyte writeback IO.
  * Allow some directio completions to be performed in the initiating
    task's context instead of punting through a workqueue.  This will
    reduce latency for some io_uring requests.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZM0Z1AAKCRBKO3ySh0YR
 pp7BAQCzkKejCM0185tNIH/faHjzidSisNQkJ5HoB4Opq9U66AEA6IPuAdlPlM/J
 FPW1oPq33Yn7AV4wXjUNFfDLzVb/Fgg=
 =dFBU
 -----END PGP SIGNATURE-----

Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "We've got some big changes for this release -- I'm very happy to be
  landing willy's work to enable large folios for the page cache for
  general read and write IOs when the fs can make contiguous space
  allocations, and Ritesh's work to track sub-folio dirty state to
  eliminate the write amplification problems inherent in using large
  folios.

  As a bonus, io_uring can now process write completions in the caller's
  context instead of bouncing through a workqueue, which should reduce
  io latency dramatically. IOWs, XFS should see a nice performance bump
  for both IO paths.

  Summary:

   - Make large writes to the page cache fill sparse parts of the cache
     with large folios, then use large memcpy calls for the large folio.

   - Track the per-block dirty state of each large folio so that a
     buffered write to a single byte on a large folio does not result in
     a (potentially) multi-megabyte writeback IO.

   - Allow some directio completions to be performed in the initiating
     task's context instead of punting through a workqueue. This will
     reduce latency for some io_uring requests"

* tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
  iomap: support IOCB_DIO_CALLER_COMP
  io_uring/rw: add write support for IOCB_DIO_CALLER_COMP
  fs: add IOCB flags related to passing back dio completions
  iomap: add IOMAP_DIO_INLINE_COMP
  iomap: only set iocb->private for polled bio
  iomap: treat a write through cache the same as FUA
  iomap: use an unsigned type for IOMAP_DIO_* defines
  iomap: cleanup up iomap_dio_bio_end_io()
  iomap: Add per-block dirty state tracking to improve performance
  iomap: Allocate ifs in ->write_begin() early
  iomap: Refactor iomap_write_delalloc_punch() function out
  iomap: Use iomap_punch_t typedef
  iomap: Fix possible overflow condition in iomap_write_delalloc_scan
  iomap: Add some uptodate state handling helpers for ifs state bitmap
  iomap: Drop ifs argument from iomap_set_range_uptodate()
  iomap: Rename iomap_page to iomap_folio_state and others
  iomap: Copy larger chunks from userspace
  iomap: Create large folios in the buffered write path
  filemap: Allow __filemap_get_folio to allocate large folios
  filemap: Add fgf_t typedef
  ...
2023-08-28 11:59:52 -07:00
Linus Torvalds
dd2c0198a8 Changes since last update:
- Support xattr bloom filter to optimize negative xattr lookups;
 
  - Support DEFLATE compression algorithm as an alternative;
 
  - Fix a regression that ztailpacking pclusters don't release properly;
 
  - Avoid warning dedupe and fragments features anymore;
 
  - Some folio conversions and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCZOvhIBEceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBFgqAP4/gcxH5vhgxMunxmgBkSxMFBQf/W7CfOiN
 QkGHjSKl8gEA78EBwAJ3vDJ1JgQRTb9/9UBrtW7n2hzj/eVS/LIyYQI=
 =o3Bx
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "In this cycle, a xattr bloom filter feature is introduced to speed up
  negative xattr lookups, which was originally suggested by Alexander
  for Composefs use cases.

  Additionally, the DEFLATE algorithm is now supported, which can be
  used together with hardware accelerators for our cloud workloads. Each
  supported compression algorithm can be selected on a per-file basis
  for specific access patterns too.

  There are also some random fixes and cleanups as usual:

   - Support xattr bloom filter to optimize negative xattr lookups

   - Support DEFLATE compression algorithm as an alternative

   - Fix a regression that ztailpacking pclusters don't release properly

   - Avoid warning dedupe and fragments features anymore

   - Some folio conversions and cleanups"

* tag 'erofs-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: release ztailpacking pclusters properly
  erofs: don't warn dedupe and fragments features anymore
  erofs: adapt folios for z_erofs_read_folio()
  erofs: adapt folios for z_erofs_readahead()
  erofs: get rid of fe->backmost for cache decompression
  erofs: drop z_erofs_page_mark_eio()
  erofs: tidy up z_erofs_do_read_page()
  erofs: move preparation logic into z_erofs_pcluster_begin()
  erofs: avoid obsolete {collector,collection} terms
  erofs: simplify z_erofs_read_fragment()
  erofs: remove redundant erofs_fs_type declaration in super.c
  erofs: add necessary kmem_cache_create flags for erofs inode cache
  erofs: clean up redundant comment and adjust code alignment
  erofs: refine warning messages for zdata I/Os
  erofs: boost negative xattr lookup with bloom filter
  erofs: update on-disk format for xattr name filter
  erofs: DEFLATE compression support
2023-08-28 11:52:10 -07:00
Linus Torvalds
f20ae9cf5b File locking fixes for v6.6
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmTngQgACgkQAA5oQRlW
 ghXfYg//cbUlOoba3RLBmKTeN68BWaLTIgeibJ17ioOSjVnxxMqyt4tLH3FD1pnT
 2QisXiR3qrJav+sbwClpC23tafCHDfTo4uGbdXlehgAu4OSJ1+8gF2k1SsdAd4Qe
 tyfP1I2LO29/Yfres2Oy86JQi99Ds4slxjj63z/pYBdo4LEDJc83m/9limgMv4ro
 Geop6f9lV3Z03rOZIixQinMtOQTXW0NJbiXMEGzFS1Y05xQ34kX+v33MBAvtnbY9
 nhbWuJKYlgZZhH5CghZUAvKfq2a4suvpZbnzmEZLJ0DdRE0h/4ctIxTSK8tnrq5d
 07cs34NOGWljlguJc7cYotrKv51lu6yGbSCHyNdAEdBotuIXJ4Q3krAUHBnuGzy3
 fa/a7nozRO7RDzshGTO4dBRerTEyS7bWUEnIhRp6BHR2ny9jhK6zMa1ys3eAMG4F
 5a61AH/3qS+i8gQQSF9HFRyD+D0qDVMg2YCE/lKveF5tNBeKqrfj/Wz3HnY2zB/L
 luQY0esX5idWTH8YYJlntxgmWEZBvNgmuWcbDVu2Jj/jXUGP6S3dnY0MZio2Sbdr
 zM7/HenAkvXTlbtJsLmdGGZSMnINuiJi4sT2aYn/8JG4HbHFlAU89ILhIGrF3ayP
 669VpO6g/DoEpVRyOnzC3MCHE09SWoiHcXo4VCsz/270uXOpcVs=
 =f9CF
 -----END PGP SIGNATURE-----

Merge tag 'filelock-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:

 - new functionality for F_OFD_GETLK: requesting a type of F_UNLCK will
   find info about whatever lock happens to be first in the given range,
   regardless of type.

 - an OFD lock selftest

 - bugfix involving a UAF in a tracepoint

 - comment typo fix

* tag 'filelock-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock
  fs/locks: Fix typo
  selftests: add OFD lock tests
  fs/locks: F_UNLCK extension for F_OFD_GETLK
2023-08-28 11:47:24 -07:00
Linus Torvalds
b4a04f92a4 v6.6-fs.proc.uapi
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXT2QAKCRCRxhvAZXjc
 olkFAQCT4nRkRTpBvbiv4DgvCIy+URqLNfHGxCxdAX1B09o3UwEAyepf1tz7aFpB
 wB67V265JFDMWtvQkSx4ORNpAjZ9Kg0=
 =Opqi
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull procfs fixes from Christian Brauner:
 "Mode changes to files under /proc/<pid>/ aren't supported ever since
  commit 6d76fa58b0 ("Don't allow chmod() on the /proc/<pid>/ files").

  Due to an oversight in commit 1b3044e39a ("procfs: fix pthread
  cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD,
  mode changes on /proc/thread-self/comm were accidently allowed.

  Similar, mode changes for all files beneath /proc/<pid>/net/ are
  blocked but mode changes on /proc/<pid>/net itself were accidently
  allowed.

  Both issues come down to not using the generic proc_setattr() helper
  which blocks all mode changes. This is rectified with this pull
  request.

  This also removes a strange nolibc test that abused /proc/<pid>/net
  for testing mode changes. Using procfs for this test never made a lot
  of sense given procfs has special semantics for almost everything
  anway.

  Both changes are minor user-visible changes. It is however very
  unlikely that mode changes on proc/<pid>/net and
  /proc/thread-self/comm are something that userspace relies on"

* tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  procfs: block chmod on /proc/thread-self/comm
  proc: use generic setattr() for /proc/$PID/net
  selftests/nolibc: drop test chmod_net
2023-08-28 11:43:19 -07:00
Linus Torvalds
2e0afa7e78 v6.6-vfs.autofs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXUDgAKCRCRxhvAZXjc
 ogplAQCYXt+zcfs1GMhCUtPFzyyCwNsraMNzVwTdFbMz4R1JuQD9HL82VKyvMwmZ
 uo6uGVd9xN6cEy61Lpz9K8dn59uVAQE=
 =851o
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull autofs fixes from Christian Brauner:
 "This fixes a memory leak in autofs reported by syzkaller and a missing
  conversion from uninterruptible to interruptible wake up when autofs
  is in catatonic mode"

* tag 'v6.6-vfs.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  autofs: use wake_up() instead of wake_up_interruptible(()
  autofs: fix memory leak of waitqueues in autofs_catatonic_mode
2023-08-28 11:39:14 -07:00
Linus Torvalds
475d4df827 v6.6-vfs.fchmodat2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXT7QAKCRCRxhvAZXjc
 ort3AP0VIK/oJk5skgjpinQrCfvtVz0XOtawuBtn0f1weIfb6AD9Hg1rqOKnQD5z
 dkvn3xaEr3gPOVzqU5SvFwVoCM0cMwA=
 =24Ha
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fchmodat2 system call from Christian Brauner:
 "This adds the fchmodat2() system call. It is a revised version of the
  fchmodat() system call, adding a missing flag argument. Support for
  both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included.

  Adding this system call revision has been a longstanding request but
  so far has always fallen through the cracks. While the kernel
  implementation of fchmodat() does not have a flag argument the libc
  provided POSIX-compliant fchmodat(3) version does. Both glibc and musl
  have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW
  (see [1] and [2]).

  The workaround is brittle because it relies not just on O_PATH and
  O_NOFOLLOW semantics and procfs magic links but also on our rather
  inconsistent symlink semantics.

  This gives userspace a proper fchmodat2() system call that libcs can
  use to properly implement fchmodat(3) and allows them to get rid of
  their hacks. In this case it will immediately benefit them as the
  current workaround is already defunct because of aformentioned
  inconsistencies.

  In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use
  AT_EMPTY_PATH with fchmodat2(). This is already possible with
  fchownat() so there's no reason to not also support it for
  fchmodat2().

  The implementation is simple and comes with selftests. Implementation
  of the system call and wiring up the system call are done as separate
  patches even though they could arguably be one patch. But in case
  there are merge conflicts from other system call additions it can be
  beneficial to have separate patches"

Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 [2]

* tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  selftests: fchmodat2: remove duplicate unneeded defines
  fchmodat2: add support for AT_EMPTY_PATH
  selftests: Add fchmodat2 selftest
  arch: Register fchmodat2, usually as syscall 452
  fs: Add fchmodat2()
  Non-functional cleanup of a "__user * filename"
2023-08-28 11:25:27 -07:00
Linus Torvalds
511fb5bafe v6.6-vfs.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXpbgAKCRCRxhvAZXjc
 oi8PAQCtXelGZHmTcmevsO8p4Qz7hFpkonZ/TnxKf+RdnlNgPgD+NWi+LoRBpaAj
 xk4z8SqJaTTP4WXrG5JZ6o7EQkUL8gE=
 =2e9I
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull superblock updates from Christian Brauner:
 "This contains the super rework that was ready for this cycle. The
  first part changes the order of how we open block devices and allocate
  superblocks, contains various cleanups, simplifications, and a new
  mechanism to wait on superblock state changes.

  This unblocks work to ultimately limit the number of writers to a
  block device. Jan has already scheduled follow-up work that will be
  ready for v6.7 and allows us to restrict the number of writers to a
  given block device. That series builds on this work right here.

  The second part contains filesystem freezing updates.

  Overview:

  The generic superblock changes are rougly organized as follows
  (ignoring additional minor cleanups):

   (1) Removal of the bd_super member from struct block_device.

       This was a very odd back pointer to struct super_block with
       unclear rules. For all relevant places we have other means to get
       the same information so just get rid of this.

   (2) Simplify rules for superblock cleanup.

       Roughly, everything that is allocated during fs_context
       initialization and that's stored in fs_context->s_fs_info needs
       to be cleaned up by the fs_context->free() implementation before
       the superblock allocation function has been called successfully.

       After sget_fc() returned fs_context->s_fs_info has been
       transferred to sb->s_fs_info at which point sb->kill_sb() if
       fully responsible for cleanup. Adhering to these rules means that
       cleanup of sb->s_fs_info in fill_super() is to be avoided as it's
       brittle and inconsistent.

       Cleanup shouldn't be duplicated between sb->put_super() as
       sb->put_super() is only called if sb->s_root has been set aka
       when the filesystem has been successfully born (SB_BORN). That
       complexity should be avoided.

       This also means that block devices are to be closed in
       sb->kill_sb() instead of sb->put_super(). More details in the
       lower section.

   (3) Make it possible to lookup or create a superblock before opening
       block devices

       There's a subtle dependency on (2) as some filesystems did rely
       on fill_super() to be called in order to correctly clean up
       sb->s_fs_info. All these filesystems have been fixed.

   (4) Switch most filesystem to follow the same logic as the generic
       mount code now does as outlined in (3).

   (5) Use the superblock as the holder of the block device. We can now
       easily go back from block device to owning superblock.

   (6) Export and extend the generic fs_holder_ops and use them as
       holder ops everywhere and remove the filesystem specific holder
       ops.

   (7) Call from the block layer up into the filesystem layer when the
       block device is removed, allowing to shut down the filesystem
       without risk of deadlocks.

   (8) Get rid of get_super().

       We can now easily go back from the block device to owning
       superblock and can call up from the block layer into the
       filesystem layer when the device is removed. So no need to wade
       through all registered superblock to find the owning superblock
       anymore"

Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/

* tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits)
  super: use higher-level helper for {freeze,thaw}
  super: wait until we passed kill super
  super: wait for nascent superblocks
  super: make locking naming consistent
  super: use locking helpers
  fs: simplify invalidate_inodes
  fs: remove get_super
  block: call into the file system for ioctl BLKFLSBUF
  block: call into the file system for bdev_mark_dead
  block: consolidate __invalidate_device and fsync_bdev
  block: drop the "busy inodes on changed media" log message
  dasd: also call __invalidate_device when setting the device offline
  amiflop: don't call fsync_bdev in FDFMTBEG
  floppy: call disk_force_media_change when changing the format
  block: simplify the disk_force_media_change interface
  nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl
  xfs use fs_holder_ops for the log and RT devices
  xfs: drop s_umount over opening the log and RT devices
  ext4: use fs_holder_ops for the log device
  ext4: drop s_umount over opening the log device
  ...
2023-08-28 11:04:18 -07:00
Linus Torvalds
de16588a77 v6.6-vfs.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTxQAKCRCRxhvAZXjc
 okaVAP94WAlItvDRt/z2Wtzf0+RqPZeTXEdGTxua8+RxqCyYIQD+OO5nRfKQPHlV
 AqqGJMKItQMSMIYgB5ftqVhNWZfnHgM=
 =pSEW
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual filesystems.

  Features:

   - Block mode changes on symlinks and rectify our broken semantics

   - Report file modifications via fsnotify() for splice

   - Allow specifying an explicit timeout for the "rootwait" kernel
     command line option. This allows to timeout and reboot instead of
     always waiting indefinitely for the root device to show up

   - Use synchronous fput for the close system call

  Cleanups:

   - Get rid of open-coded lockdep workarounds for async io submitters
     and replace it all with a single consolidated helper

   - Simplify epoll allocation helper

   - Convert simple_write_begin and simple_write_end to use a folio

   - Convert page_cache_pipe_buf_confirm() to use a folio

   - Simplify __range_close to avoid pointless locking

   - Disable per-cpu buffer head cache for isolated cpus

   - Port ecryptfs to kmap_local_page() api

   - Remove redundant initialization of pointer buf in pipe code

   - Unexport the d_genocide() function which is only used within core
     vfs

   - Replace printk(KERN_ERR) and WARN_ON() with WARN()

  Fixes:

   - Fix various kernel-doc issues

   - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE

   - Fix a mainly theoretical issue in devpts

   - Check the return value of __getblk() in reiserfs

   - Fix a racy assert in i_readcount_dec

   - Fix integer conversion issues in various functions

   - Fix LSM security context handling during automounts that prevented
     NFS superblock sharing"

* tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
  cachefiles: use kiocb_{start,end}_write() helpers
  ovl: use kiocb_{start,end}_write() helpers
  aio: use kiocb_{start,end}_write() helpers
  io_uring: use kiocb_{start,end}_write() helpers
  fs: create kiocb_{start,end}_write() helpers
  fs: add kerneldoc to file_{start,end}_write() helpers
  io_uring: rename kiocb_end_write() local helper
  splice: Convert page_cache_pipe_buf_confirm() to use a folio
  libfs: Convert simple_write_begin and simple_write_end to use a folio
  fs/dcache: Replace printk and WARN_ON by WARN
  fs/pipe: remove redundant initialization of pointer buf
  fs: Fix kernel-doc warnings
  devpts: Fix kernel-doc warnings
  doc: idmappings: fix an error and rephrase a paragraph
  init: Add support for rootwait timeout parameter
  vfs: fix up the assert in i_readcount_dec
  fs: Fix one kernel-doc comment
  docs: filesystems: idmappings: clarify from where idmappings are taken
  fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
  vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing
  ...
2023-08-28 10:17:14 -07:00
Linus Torvalds
ecd7db2047 v6.6-vfs.tmpfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTkgAKCRCRxhvAZXjc
 ouZsAPwNBHB2aPKtzWURuKx5RX02vXTzHX+A/LpuDz5WBFe8zQD+NlaBa4j0MBtS
 rVYM+CjOXnjnsLc8W0euMnfYNvViKgQ=
 =L2+2
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull libfs and tmpfs updates from Christian Brauner:
 "This cycle saw a lot of work for tmpfs that required changes to the
  vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this
  cycle. Things will go back to mm next cycle.

  Features
  ========

   - By far the biggest work is the quota support for tmpfs. New tmpfs
     quota infrastructure is added to support it and a new QFMT_SHMEM
     uapi option is exposed.

     This offers user and group quotas to tmpfs (project quotas will be
     added later). Similar to other filesystems tmpfs quota are not
     supported within user namespaces yet.

   - Add support for user xattrs. While tmpfs already supports security
     xattrs (security.*) and POSIX ACLs for a long time it lacked
     support for user xattrs (user.*). With this pull request tmpfs will
     be able to support a limited number of user xattrs.

     This is accompanied by a fix (see below) to limit persistent simple
     xattr allocations.

   - Add support for stable directory offsets. Currently tmpfs relies on
     the libfs provided cursor-based mechanism for readdir. This causes
     issues when a tmpfs filesystem is exported via NFS.

     NFS clients do not open directories. Instead, each server-side
     readdir operation opens the directory, reads it, and then closes
     it. Since the cursor state for that directory is associated with
     the opened file it is discarded after each readdir operation. Such
     directory offsets are not just cached by NFS clients but also
     various userspace libraries based on these clients.

     As it stands there is no way to invalidate the caches when
     directory offsets have changed and the whole application depends on
     unchanging directory offsets.

     At LSFMM we discussed how to solve this problem and decided to
     support stable directory offsets. libfs now allows filesystems like
     tmpfs to use an xarrary to map a directory offset to a dentry. This
     mechanism is currently only used by tmpfs but can be supported by
     others as well.

  Fixes
  =====

   - Change persistent simple xattrs allocations in libfs from
     GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory
     cgroup limits. Since this is a change to libfs it affects both
     tmpfs and kernfs.

   - Correctly verify {g,u}id mount options.

     A new filesystem context is created via fsopen() which records the
     namespace that becomes the owning namespace of the superblock when
     fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are
     mountable in namespaces. However, fsconfig() calls can occur in a
     namespace different from the namespace where fsopen() has been
     called.

     Currently, when fsconfig() is called to set {g,u}id mount options
     the requested {g,u}id is mapped into a k{g,u}id according to the
     namespace where fsconfig() was called from. The resulting k{g,u}id
     is not guaranteed to be resolvable in the namespace of the
     filesystem (the one that fsopen() was called in).

     This means it's possible for an unprivileged user to create files
     owned by any group in a tmpfs mount since it's possible to set the
     setid bits on the tmpfs directory.

     The contract for {g,u}id mount options and {g,u}id values in
     general set from userspace has always been that they are translated
     according to the caller's idmapping. In so far, tmpfs has been
     doing the correct thing. But since tmpfs is mountable in
     unprivileged contexts it is also necessary to verify that the
     resulting {k,g}uid is representable in the namespace of the
     superblock to avoid such bugs.

     The new mount api's cross-namespace delegation abilities are
     already widely used. Having talked to a bunch of userspace this is
     the most faithful solution with minimal regression risks"

* tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs
  mm: invalidation check mapping before folio_contains
  tmpfs: trivial support for direct IO
  tmpfs,xattr: enable limited user extended attributes
  tmpfs: track free_ispace instead of free_inodes
  xattr: simple_xattr_set() return old_xattr to be freed
  tmpfs: verify {g,u}id mount options correctly
  shmem: move spinlock into shmem_recalc_inode() to fix quota support
  libfs: Remove parent dentry locking in offset_iterate_dir()
  libfs: Add a lock class for the offset map's xa_lock
  shmem: stable directory offsets
  shmem: Refactor shmem_symlink()
  libfs: Add directory operations for stable offsets
  shmem: fix quota lock nesting in huge hole handling
  shmem: Add default quota limit mount options
  shmem: quota support
  shmem: prepare shmem quota infrastructure
  quota: Check presence of quota operation structures instead of ->quota_read and ->quota_write callbacks
  shmem: make shmem_get_inode() return ERR_PTR instead of NULL
  shmem: make shmem_inode_acct_block() return error
2023-08-28 09:55:25 -07:00
Linus Torvalds
615e95831e v6.6-vfs.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
 oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
 XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
 =eJz5
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
 "This adds VFS support for multi-grain timestamps and converts tmpfs,
  xfs, ext4, and btrfs to use them. This carries acks from all relevant
  filesystems.

  The VFS always uses coarse-grained timestamps when updating the ctime
  and mtime after a change. This has the benefit of allowing filesystems
  to optimize away a lot of metadata updates, down to around 1 per
  jiffy, even when a file is under heavy writes.

  Unfortunately, this has always been an issue when we're exporting via
  NFSv3, which relies on timestamps to validate caches. A lot of changes
  can happen in a jiffy, so timestamps aren't sufficient to help the
  client decide to invalidate the cache.

  Even with NFSv4, a lot of exported filesystems don't properly support
  a change attribute and are subject to the same problems with timestamp
  granularity. Other applications have similar issues with timestamps
  (e.g., backup applications).

  If we were to always use fine-grained timestamps, that would improve
  the situation, but that becomes rather expensive, as the underlying
  filesystem would have to log a lot more metadata updates.

  This introduces fine-grained timestamps that are used when they are
  actively queried.

  This uses the 31st bit of the ctime tv_nsec field to indicate that
  something has queried the inode for the mtime or ctime. When this flag
  is set, on the next mtime or ctime update, the kernel will fetch a
  fine-grained timestamp instead of the usual coarse-grained one.

  As POSIX generally mandates that when the mtime changes, the ctime
  must also change the kernel always stores normalized ctime values, so
  only the first 30 bits of the tv_nsec field are ever used.

  Filesytems can opt into this behavior by setting the FS_MGTIME flag in
  the fstype. Filesystems that don't set this flag will continue to use
  coarse-grained timestamps.

  Various preparatory changes, fixes and cleanups are included:

   - Fixup all relevant places where POSIX requires updating ctime
     together with mtime. This is a wide-range of places and all
     maintainers provided necessary Acks.

   - Add new accessors for inode->i_ctime directly and change all
     callers to rely on them. Plain accesses to inode->i_ctime are now
     gone and it is accordingly rename to inode->__i_ctime and commented
     as requiring accessors.

   - Extend generic_fillattr() to pass in a request mask mirroring in a
     sense the statx() uapi. This allows callers to pass in a request
     mask to only get a subset of attributes filled in.

   - Rework timestamp updates so it's possible to drop the @now
     parameter the update_time() inode operation and associated helpers.

   - Add inode_update_timestamps() and convert all filesystems to it
     removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  tmpfs: add support for multigrain timestamps
  fs: add infrastructure for multigrain timestamps
  fs: drop the timespec64 argument from update_time
  xfs: have xfs_vn_update_time gets its own timestamp
  fat: make fat_update_time get its own timestamp
  fat: remove i_version handling from fat_update_time
  ubifs: have ubifs_update_time use inode_update_timestamps
  btrfs: have it use inode_update_timestamps
  fs: drop the timespec64 arg from generic_update_time
  fs: pass the request_mask to generic_fillattr
  fs: remove silly warning from current_time
  gfs2: fix timestamp handling on quota inodes
  fs: rename i_ctime field to __i_ctime
  selinux: convert to ctime accessor functions
  security: convert to ctime accessor functions
  apparmor: convert to ctime accessor functions
  sunrpc: convert to ctime accessor functions
  ...
2023-08-28 09:31:32 -07:00
Linus Torvalds
84ab1277ce v6.6-vfs.fs_context
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXUHQAKCRCRxhvAZXjc
 opWuAQC5wYyKWMwpxc3GaGcHiC7nq0uyYCcVgzeebsw1eGzFvgD9FoYRphC2pqi1
 p8qUexEK2aOZmPquFWmRDTRMcZ23YAk=
 =UKnx
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull mount API updates from Christian Brauner:
 "This introduces FSCONFIG_CMD_CREATE_EXCL which allows userspace to
  implement something like

      $ mount -t ext4 --exclusive /dev/sda /B

  which fails if a superblock for the requested filesystem does already
  exist instead of silently reusing an existing superblock.

  Without it, in the sequence

      $ move-mount -f xfs -o       source=/dev/sda4 /A
      $ move-mount -f xfs -o noacl,source=/dev/sda4 /B

  the initial mounter will create a superblock. The second mounter will
  reuse the existing superblock, creating a bind-mount (see [1] for the
  source of the move-mount binary).

  The problem is that reusing an existing superblock means all mount
  options other than read-only and read-write will be silently ignored
  even if they are incompatible requests. For example, the second mount
  has requested no POSIX ACL support but since the existing superblock
  is reused POSIX ACL support will remain enabled.

  Such silent superblock reuse can easily become a security issue.

  After adding support for FSCONFIG_CMD_CREATE_EXCL to mount(8) in
  util-linux this can be fixed:

      $ move-mount -f xfs --exclusive -o       source=/dev/sda4 /A
      $ move-mount -f xfs --exclusive -o noacl,source=/dev/sda4 /B
      Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed

  This requires the new mount api. With the old mount api it would be
  necessary to plumb this through every legacy filesystem's
  file_system_type->mount() method. If they want this feature they are
  most welcome to switch to the new mount api"

Link: https://github.com/brauner/move-mount-beneath [1]
Link: https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner
Link: https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner
Link: https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner
Link: https://lore.kernel.org/lkml/20230824-anzog-allheilmittel-e8c63e429a79@brauner/

* tag 'v6.6-vfs.fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: add FSCONFIG_CMD_CREATE_EXCL
  fs: add vfs_cmd_reconfigure()
  fs: add vfs_cmd_create()
  super: remove get_tree_single_reconf()
2023-08-28 09:00:09 -07:00
Baokun Li
768d612f79 ext4: fix slab-use-after-free in ext4_es_insert_extent()
Yikebaer reported an issue:
==================================================================
BUG: KASAN: slab-use-after-free in ext4_es_insert_extent+0xc68/0xcb0
fs/ext4/extents_status.c:894
Read of size 4 at addr ffff888112ecc1a4 by task syz-executor/8438

CPU: 1 PID: 8438 Comm: syz-executor Not tainted 6.5.0-rc5 #1
Call Trace:
 [...]
 kasan_report+0xba/0xf0 mm/kasan/report.c:588
 ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894
 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680
 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462
 ext4_zero_range fs/ext4/extents.c:4622 [inline]
 ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721
 [...]

Allocated by task 8438:
 [...]
 kmem_cache_zalloc include/linux/slab.h:693 [inline]
 __es_alloc_extent fs/ext4/extents_status.c:469 [inline]
 ext4_es_insert_extent+0x672/0xcb0 fs/ext4/extents_status.c:873
 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680
 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462
 ext4_zero_range fs/ext4/extents.c:4622 [inline]
 ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721
 [...]

Freed by task 8438:
 [...]
 kmem_cache_free+0xec/0x490 mm/slub.c:3823
 ext4_es_try_to_merge_right fs/ext4/extents_status.c:593 [inline]
 __es_insert_extent+0x9f4/0x1440 fs/ext4/extents_status.c:802
 ext4_es_insert_extent+0x2ca/0xcb0 fs/ext4/extents_status.c:882
 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680
 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462
 ext4_zero_range fs/ext4/extents.c:4622 [inline]
 ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721
 [...]
==================================================================

The flow of issue triggering is as follows:
1. remove es
      raw es               es  removed  es1
|-------------------| -> |----|.......|------|

2. insert es
  es   insert   es1      merge with es  es1     merge with es and free es1
|----|.......|------| -> |------------|------| -> |-------------------|

es merges with newes, then merges with es1, frees es1, then determines
if es1->es_len is 0 and triggers a UAF.

The code flow is as follows:
ext4_es_insert_extent
  es1 = __es_alloc_extent(true);
  es2 = __es_alloc_extent(true);
  __es_remove_extent(inode, lblk, end, NULL, es1)
    __es_insert_extent(inode, &newes, es1) ---> insert es1 to es tree
  __es_insert_extent(inode, &newes, es2)
    ext4_es_try_to_merge_right
      ext4_es_free_extent(inode, es1) --->  es1 is freed
  if (es1 && !es1->es_len)
    // Trigger UAF by determining if es1 is used.

We determine whether es1 or es2 is used immediately after calling
__es_remove_extent() or __es_insert_extent() to avoid triggering a
UAF if es1 or es2 is freed.

Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com>
Closes: https://lore.kernel.org/lkml/CALcu4raD4h9coiyEBL4Bm0zjDwxC2CyPiTwsP3zFuhot6y9Beg@mail.gmail.com
Fixes: 2a69c45008 ("ext4: using nofail preallocation in ext4_es_insert_extent()")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230815070808.3377171-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:13 -04:00
Eric Biggers
af494af385 libfs: remove redundant checks of s_encoding
Now that neither ext4 nor f2fs allows inodes with the casefold flag to
be instantiated when unsupported, it's unnecessary to repeatedly check
for support later on during random filesystem operations.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230814182903.37267-4-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:13 -04:00
Eric Biggers
b814279395 ext4: remove redundant checks of s_encoding
Now that ext4 does not allow inodes with the casefold flag to be
instantiated when unsupported, it's unnecessary to repeatedly check for
support later on during random filesystem operations.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230814182903.37267-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:13 -04:00
Eric Biggers
8216776ccf ext4: reject casefold inode flag without casefold feature
It is invalid for the casefold inode flag to be set without the casefold
superblock feature flag also being set.  e2fsck already considers this
case to be invalid and handles it by offering to clear the casefold flag
on the inode.  __ext4_iget() also already considered this to be invalid,
sort of, but it only got so far as logging an error message; it didn't
actually reject the inode.  Make it reject the inode so that other code
doesn't have to handle this case.  This matches what f2fs does.

Note: we could check 's_encoding != NULL' instead of
ext4_has_feature_casefold().  This would make the check robust against
the casefold feature being enabled by userspace writing to the page
cache of the mounted block device.  However, it's unsolvable in general
for filesystems to be robust against concurrent writes to the page cache
of the mounted block device.  Though this very particular scenario
involving the casefold feature is solvable, we should not pretend that
we can support this model, so let's just check the casefold feature.
tune2fs already forbids enabling casefold on a mounted filesystem.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230814182903.37267-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:13 -04:00
Ruan Jinjie
0f6bc57971 ext4: use LIST_HEAD() to initialize the list_head in mballoc.c
Use LIST_HEAD() to initialize the list_head instead of open-coding it.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/20230812071839.3481909-1-ruanjinjie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:13 -04:00
Liu Song
03de20bed2 ext4: do not mark inode dirty every time when appending using delalloc
In the delalloc append write scenario, if inode's i_size is extended due
to buffer write, there are delalloc writes pending in the range up to
i_size, and no need to touch i_disksize since writeback will push
i_disksize up to i_size eventually. Offers significant performance
improvement in high-frequency append write scenarios.

I conducted tests in my 32-core environment by launching 32 concurrent
threads to append write to the same file. Each write operation had a
length of 1024 bytes and was repeated 100000 times. Without using this
patch, the test was completed in 7705 ms. However, with this patch, the
test was completed in 5066 ms, resulting in a performance improvement of
34%.

Moreover, in test scenarios of Kafka version 2.6.2, using packet size of
2K, with this patch resulted in a 10% performance improvement.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810154333.84921-1-liusong@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:13 -04:00
Theodore Ts'o
bb15cea20f ext4: rename s_error_work to s_sb_upd_work
The most common use that s_error_work will get scheduled is now the
periodic update of the superblock.  So rename it to s_sb_upd_work.

Also rename the function flush_stashed_error_work() to
update_super_work().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Vitaliy Kuznetsov
ff0722de89 ext4: add periodic superblock update check
This patch introduces a mechanism to periodically check and update
the superblock within the ext4 file system. The main purpose of this
patch is to keep the disk superblock up to date. The update will be
performed if more than one hour has passed since the last update, and
if more than 16MB of data have been written to disk.

This check and update is performed within the ext4_journal_commit_callback
function, ensuring that the superblock is written while the disk is
active, rather than based on a timer that may trigger during disk idle
periods.

Discussion https://www.spinics.net/lists/linux-ext4/msg85865.html

Signed-off-by: Vitaliy Kuznetsov <vk.en.mail@gmail.com>
Link: https://lore.kernel.org/r/20230810143852.40228-1-vk.en.mail@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Brian Foster
194505b55d ext4: drop dio overwrite only flag and associated warning
The commit referenced below opened up concurrent unaligned dio under
shared locking for pure overwrites. In doing so, it enabled use of
the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected
-EAGAIN returns as an extra precaution, since ext4 does not retry
writes in such cases. The flag itself is advisory in this case since
ext4 checks for unaligned I/Os and uses appropriate locking up
front, rather than on a retry in response to -EAGAIN.

As it turns out, the warning check is susceptible to false positives
because there are scenarios where -EAGAIN can be expected from lower
layers without necessarily having IOCB_NOWAIT set on the iocb. For
example, one instance of the warning has been seen where io_uring
sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on
the bio. This results in -EAGAIN if the block layer is unable to
allocate a request, etc. [Note that there is an outstanding patch to
untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on
IOCB_NOWAIT, which would also address this instance of the warning.]

Another instance of the warning has been reproduced by syzbot. A dio
write is interrupted down in __get_user_pages_locked() waiting on
the mm lock and returns -EAGAIN up the stack. If the iomap dio
iteration layer has made no progress on the write to this point,
-EAGAIN returns up to the filesystem and triggers the warning.

This use of the overwrite flag in ext4 is precautionary and
half-baked. I.e., ext4 doesn't actually implement overwrite checking
in the iomap callbacks when the flag is set, so the only extra
verification it provides are i_size checks in the generic iomap dio
layer. Combined with the tendency for false positives, the added
verification is not worth the extra trouble. Remove the flag,
associated warning, and update the comments to document when
concurrent unaligned dio writes are allowed and why said flag is not
used.

Cc: stable@kernel.org
Reported-by: syzbot+5050ad0fb47527b1808a@syzkaller.appspotmail.com
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes: 310ee0902b ("ext4: allow concurrent unaligned dio overwrites")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810165559.946222-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Wang Jianjian
68228da51c ext4: add correct group descriptors and reserved GDT blocks to system zone
When setup_system_zone, flex_bg is not initialized so it is always 1.
Use a new helper function, ext4_num_base_meta_blocks() which does not
depend on sbi->s_log_groups_per_flex being initialized.

[ Squashed two patches in the Link URL's below together into a single
  commit, which is simpler to review/understand.  Also fix checkpatch
  warnings. --TYT ]

Cc: stable@kernel.org
Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com>
Link: https://lore.kernel.org/r/tencent_21AF0D446A9916ED5C51492CC6C9A0A77B05@qq.com
Link: https://lore.kernel.org/r/tencent_D744D1450CC169AEA77FCF0A64719909ED05@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Cai Xinchen
b6c7d6dc8a ext4: remove unused function declaration
These functions do not have its function implementation.
So those function declaration is useless. Remove these

Signed-off-by: Cai Xinchen <caixinchen1@huawei.com>
Link: https://lore.kernel.org/r/20230802030025.173148-1-caixinchen1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Su Hui
a50bda1474 ext4: mballoc: avoid garbage value from err
clang's static analysis warning: fs/ext4/mballoc.c
line 4178, column 6, Branch condition evaluates to a garbage value.

err is uninitialized and will be judged when 'len <= 0' or
it first enters the loop while the condition "!ext4_sb_block_valid()"
is true. Although this can't make problems now, it's better to
correct it.

Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20230725043310.1227621-1-suhui@nfschina.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Lu Hongfei
79ebf48c44 ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple()
Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Link: https://lore.kernel.org/r/20230707115907.26637-1-luhongfei@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Lu Hongfei
89cadf6e22 ext4: change the type of blocksize in ext4_mb_init_cache()
The return value type of i_blocksize() is 'unsigned int', so the
type of blocksize has been modified from 'int' to 'unsigned int'
to ensure data type consistency.

Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Link: https://lore.kernel.org/r/20230707105516.9156-1-luhongfei@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Zhihao Cheng
1524773425 ext4: fix unttached inode after power cut with orphan file feature enabled
Running generic/475(filesystem consistent tests after power cut) could
easily trigger unattached inode error while doing fsck:
  Unattached zero-length inode 39405.  Clear? no

  Unattached inode 39405
  Connect to /lost+found? no

Above inconsistence is caused by following process:
       P1                       P2
ext4_create
 inode = ext4_new_inode_start_handle  // itable records nlink=1
 ext4_add_nondir
   err = ext4_add_entry  // ENOSPC
    ext4_append
     ext4_bread
      ext4_getblk
       ext4_map_blocks // returns ENOSPC
   drop_nlink(inode) // won't be updated into disk inode
   ext4_orphan_add(handle, inode)
    ext4_orphan_file_add
 ext4_journal_stop(handle)
		      jbd2_journal_commit_transaction // commit success
              >> power cut <<
ext4_fill_super
 ext4_load_and_init_journal   // itable records nlink=1
 ext4_orphan_cleanup
  ext4_process_orphan
   if (inode->i_nlink)        // true, inode won't be deleted

Then, allocated inode will be reserved on disk and corresponds to no
dentries, so e2fsck reports 'unattached inode' problem.

The problem won't happen if orphan file feature is disabled, because
ext4_orphan_add() will update disk inode in orphan list mode. There
are several places not updating disk inode while putting inode into
orphan area, such as ext4_add_nondir(), ext4_symlink() and whiteout
in ext4_rename(). Fix it by updating inode into disk in all error
branches of these places.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217605
Fixes: 02f310fcf4 ("ext4: Speedup ext4 orphan inode handling")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230628132011.650383-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Zhang Yi
2dfba3bb40 jbd2: correct the end of the journal recovery scan range
We got a filesystem inconsistency issue below while running generic/475
I/O failure pressure test with fast_commit feature enabled.

 Symlink /p3/d3/d1c/d6c/dd6/dce/l101 (inode #132605) is invalid.

If fast_commit feature is enabled, a special fast_commit journal area is
appended to the end of the normal journal area. The journal->j_last
point to the first unused block behind the normal journal area instead
of the whole log area, and the journal->j_fc_last point to the first
unused block behind the fast_commit journal area. While doing journal
recovery, do_one_pass(PASS_SCAN) should first scan the normal journal
area and turn around to the first block once it meet journal->j_last,
but the wrap() macro misuse the journal->j_fc_last, so the recovering
could not read the next magic block (commit block perhaps) and would end
early mistakenly and missing tN and every transaction after it in the
following example. Finally, it could lead to filesystem inconsistency.

 | normal journal area                             | fast commit area |
 +-------------------------------------------------+------------------+
 | tN(rere) | tN+1 |~| tN-x |...| tN-1 | tN(front) |       ....       |
 +-------------------------------------------------+------------------+
                     /                             /                  /
                start               journal->j_last journal->j_fc_last

This patch fix it by use the correct ending journal->j_last.

Fixes: 5b849b5f96 ("jbd2: fast commit recovery path")
Cc: stable@kernel.org
Reported-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/linux-ext4/20230613043120.GB1584772@mit.edu/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230626073322.3956567-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:27:12 -04:00
Zhang Yi
ee5c807137 ext4: ext4_get_{dev}_journal return proper error value
ext4_get_journal() and ext4_get_dev_journal() return NULL if they failed
to init journal, making them return proper error value instead, also
rename them to ext4_open_{inode,dev}_journal().

[ Folded fix to ext4_calculate_overhead() to check for an ERR_PTR
  instead of NULL. ]

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-13-yi.zhang@huaweicloud.com
Reported-by: syzbot+b3123e6d9842e526de39@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20230826011029.2023140-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-27 11:13:39 -04:00
Linus Torvalds
6f0edbb833 18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4 issues
or aren't considered suitable for a -stable backport.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZOjuGgAKCRDdBJ7gKXxA
 jkLlAQDY9sYxhQZp1PFLirUIPeOBjEyifVy6L6gCfk9j0snLggEA2iK+EtuJt2Dc
 SlMfoTq29zyU/YgfKKwZEVKtPJZOHQU=
 =oTcj
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4
  issues or aren't considered suitable for a -stable backport"

* tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  shmem: fix smaps BUG sleeping while atomic
  selftests: cachestat: catch failing fsync test on tmpfs
  selftests: cachestat: test for cachestat availability
  maple_tree: disable mas_wr_append() when other readers are possible
  madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
  madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check
  madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check
  mm: multi-gen LRU: don't spin during memcg release
  mm: memory-failure: fix unexpected return value in soft_offline_page()
  radix tree: remove unused variable
  mm: add a call to flush_cache_vmap() in vmap_pfn()
  selftests/mm: FOLL_LONGTERM need to be updated to 0x100
  nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()
  mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast
  selftests: cgroup: fix test_kmem_basic less than error
  mm: enable page walking API to lock vmas during the walk
  smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
  mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
2023-08-25 11:44:43 -07:00
Daeho Jeong
3b71661214 f2fs: use finish zone command when closing a zone
Use the finish zone command first when a zone should be closed.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-08-25 10:30:37 -07:00
Alexander Aring
7c53e847ff dlm: fix plock lookup when using multiple lockspaces
All posix lock ops, for all lockspaces (gfs2 file systems) are
sent to userspace (dlm_controld) through a single misc device.
The dlm_controld daemon reads the ops from the misc device
and sends them to other cluster nodes using separate, per-lockspace
cluster api communication channels.  The ops for a single lockspace
are ordered at this level, so that the results are received in
the same sequence that the requests were sent.  When the results
are sent back to the kernel via the misc device, they are again
funneled through the single misc device for all lockspaces.  When
the dlm code in the kernel processes the results from the misc
device, these results will be returned in the same sequence that
the requests were sent, on a per-lockspace basis.  A recent change
in this request/reply matching code missed the "per-lockspace"
check (fsid comparison) when matching request and reply, so replies
could be incorrectly matched to requests from other lockspaces.

Cc: stable@vger.kernel.org
Reported-by: Barry Marson <bmarson@redhat.com>
Fixes: 57e2c2f2d9 ("fs: dlm: fix mismatch of plock results from userspace")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-08-25 10:31:39 -05:00
Steve French
09ee7a3bf8 [SMB3] send channel sequence number in SMB3 requests after reconnects
The ChannelSequence field in the SMB3 header is supposed to be
increased after reconnect to allow the server to distinguish
requests from before and after the reconnect.  We had always
been setting it to zero.  There are cases where incrementing
ChannelSequence on requests after network reconnects can reduce
the chance of data corruptions.

See MS-SMB2 3.2.4.1 and 3.2.7.1

Signed-off-by: Steve French <stfrench@microsoft.com>
Cc: stable@vger.kernel.org # 5.16+
2023-08-24 23:37:06 -05:00
Oleg Nesterov
dce8f8ed1d document while_each_thread(), change first_tid() to use for_each_thread()
Add the comment to explain that while_each_thread(g,t) is not rcu-safe
unless g is stable (e.g.  current).  Even if g is a group leader and thus
can't exit before t, t or another sub-thread can exec and remove g from
the thread_group list.

The only lockless user of while_each_thread() is first_tid() and it is
fine in that it can't loop forever, yet for_each_thread() looks better and
I am going to change while_each_thread/next_thread.

Link: https://lkml.kernel.org/r/20230823170806.GA11724@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:25:15 -07:00
Matthew Wilcox (Oracle)
1d024e7a8d mm: remove enum page_entry_size
Remove the unnecessary encoding of page order into an enum and pass the
page order directly.  That lets us get rid of pe_order().

The switch constructs have to be changed to if/else constructs to prevent
GCC from warning on builds with 3-level page tables where PMD_ORDER and
PUD_ORDER have the same value.

If you are looking at this commit because your driver stopped compiling,
look at the previous commit as well and audit your driver to be sure it
doesn't depend on mmap_lock being held in its ->huge_fault method.

[willy@infradead.org: use "order %u" to match the (non dev_t) style]
  Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org
Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:20:30 -07:00
Matthew Wilcox (Oracle)
051ddcfeb1 mm: move PMD_ORDER to pgtable.h
Patch series "Change calling convention for ->huge_fault", v2.

There are two unrelated changes to the calling convention for
->huge_fault.  I've bundled them together to help people notice the
change.  The first is to improve scalability of DAX page faults by
allowing them to be handled under the VMA lock.  The second is to remove
enum page_entry_size since it's really unnecessary.  The changelogs and
documentation updates hopefully work to that end.


This patch (of 3):

Allow this to be used in generic code.  Also add PUD_ORDER.

Link: https://lkml.kernel.org/r/20230818202335.2739663-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230818202335.2739663-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:20:29 -07:00
Jann Horn
004a9a38e2 mm: userfaultfd: remove stale comment about core dump locking
Since commit 7f3bfab52c ("mm/gup: take mmap_lock in get_dump_page()"),
which landed in v5.10, core dumping doesn't enter fault handling without
holding the mmap_lock anymore.  Remove the stale parts of the comments,
but leave the behavior as-is - letting core dumping block on userfault
handling would be a bad idea and could lead to deadlocks if the dumping
process was handling its own userfaults.

Link: https://lkml.kernel.org/r/20230815212216.264445-1-jannh@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:20:27 -07:00
Matthew Wilcox (Oracle)
f9bff0e318 minmax: add in_range() macro
Patch series "New page table range API", v6.

This patchset changes the API used by the MM to set up page table entries.
The four APIs are:

    set_ptes(mm, addr, ptep, pte, nr)
    update_mmu_cache_range(vma, addr, ptep, nr)
    flush_dcache_folio(folio) 
    flush_icache_pages(vma, page, nr)

flush_dcache_folio() isn't technically new, but no architecture
implemented it, so I've done that for them.  The old APIs remain around
but are mostly implemented by calling the new interfaces.

The new APIs are based around setting up N page table entries at once. 
The N entries belong to the same PMD, the same folio and the same VMA, so
ptep++ is a legitimate operation, and locking is taken care of for you. 
Some architectures can do a better job of it than just a loop, but I have
hesitated to make too deep a change to architectures I don't understand
well.

One thing I have changed in every architecture is that PG_arch_1 is now a
per-folio bit instead of a per-page bit when used for dcache clean/dirty
tracking.  This was something that would have to happen eventually, and it
makes sense to do it now rather than iterate over every page involved in a
cache flush and figure out if it needs to happen.

The point of all this is better performance, and Fengwei Yin has measured
improvement on x86.  I suspect you'll see improvement on your architecture
too.  Try the new will-it-scale test mentioned here:
https://lore.kernel.org/linux-mm/20230206140639.538867-5-fengwei.yin@intel.com/
You'll need to run it on an XFS filesystem and have
CONFIG_TRANSPARENT_HUGEPAGE set.

This patchset is the basis for much of the anonymous large folio work
being done by Ryan, so it's received quite a lot of testing over the last
few months.


This patch (of 38):

Determine if a value lies within a range more efficiently (subtraction +
comparison vs two comparisons and an AND).  It also has useful (under some
circumstances) behaviour if the range exceeds the maximum value of the
type.  Convert all the conflicting definitions of in_range() within the
kernel; some can use the generic definition while others need their own
definition.

Link: https://lkml.kernel.org/r/20230802151406.3735276-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230802151406.3735276-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:20:18 -07:00
Suren Baghdasaryan
29a22b9e08 mm: handle userfaults under VMA lock
Enable handle_userfault to operate under VMA lock by releasing VMA lock
instead of mmap_lock and retrying.  Note that FAULT_FLAG_RETRY_NOWAIT
should never be used when handling faults under per-VMA lock protection
because that would break the assumption that lock is dropped on retry.

[surenb@google.com: fix a lockdep issue in vma_assert_write_locked]
  Link: https://lkml.kernel.org/r/20230712195652.969194-1-surenb@google.com
Link: https://lkml.kernel.org/r/20230630211957.1341547-7-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:20:17 -07:00
Linus Torvalds
f8d6ff4490 nfsd-6.5 fixes:
- Close race window when handling FREE_STATEID operations
 - Fix regression in /proc/fs/nfsd/v4_end_grace introduced in v6.5-rc
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmTntHEACgkQM2qzM29m
 f5cGgQ//ZvUW1Vvp+s86puw+EyEtwu15Ms19kTYNR+AKebpPz/c9K9iEF3nmZXEL
 bRn25fELtzXYi8rqavrv8fMj7dQhmkT3DE0WaJcTtCLD5N5bGDO3mQeoQ1fKGR1r
 rHITp0jC25Viur7kXhXU6qIcvu0VthK+feW/DMlkKsmSlQE5V4utxUGYZp8gfZGU
 7cbYRpCqF2J1bJSPxH/lKpg5ZHztpZW6aPXG7frHcg04qsfqrMRS0HqG8KYaAKXh
 BObBqSYDo8agOa3u361pBZoVZHF2/7gFXlZKIZdp+6F5/B1IjoN+7eWnI7hFxiH7
 zf5jLa9xlWrXr2vQTuPEJa9dCr756Ixzq7IJ7ZzIMOpVypixZ04jBLfnuhcnayu9
 8k/0CFqQwmfvIcXgJEpTJ+OKm0kDqI3n7WE9gkeYBkRewEvJQXaFZ/vqTYi7bp9H
 eWlwQ4bHE5touERBMp0HmDdct/ZdUn8dS6MDcdGFXrVf5m+Jt6hZCXTnpU3Ah+zF
 d0uK4IEwJ2yC9FhBqOYZ6+XBr1JA+40vdnHOBvKdpAzQnIdwnNa4rzR0Eab6+m4i
 fmhI63s9slPBcBMroRC0mhftcdkd7LjBhhWbsDu8nemKmmHcOKzwTda78EayQYnm
 /zJUVr8BqzkgaJG1PUn9y0g4IOfgTiokDmBdLu6bTAanRtekhVY=
 =BF07
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Two last-minute one-liners for v6.5-rc. One got lost in the shuffle,
  and the other was reported just this morning"

   - Close race window when handling FREE_STATEID operations

   - Fix regression in /proc/fs/nfsd/v4_end_grace introduced in v6.5-rc"

* tag 'nfsd-6.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix a thinko introduced by recent trace point changes
  nfsd: Fix race to FREE_STATEID and cl_revoked
2023-08-24 14:30:47 -07:00
Olga Kornievskaia
51d674a5e4 NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server
After receiving the location(s) of the DS server(s) in the
GETDEVINCEINFO, create the request for the clientid to such
server and indicate that the client is connecting to a DS.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Trond Myklebust
537935f72e NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver
Ensure that the connect timeout for the pNFS flexfiles driver is of the
same order as the I/O timeout, so that we can fail over quickly when
trying to read from a data server that is down.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Trond Myklebust
88975a5596 NFS: Fix a potential data corruption
We must ensure that the subrequests are joined back into the head before
we can retransmit a request. If the head was not on the commit lists,
because the server wrote it synchronously, we still need to add it back
to the retransmission list.
Add a call that mirrors the effect of nfs_cancel_remove_inode() for
O_DIRECT.

Fixes: ed5d588fe4 ("NFS: Try to join page groups before an O_DIRECT retransmission")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Kinglong Mee
14e7316a3c nfs: fix redundant readdir request after get eof
When a directory contains 17 files (except . and ..), nfs client sends
a redundant readdir request after get eof.

A simple reproduce,
At NFS server, create a directory with 17 files under exported directory.
 # mkdir test
 # cd test
 # for i in {0..16}  ; do touch $i; done

At NFS client, no matter mounting through nfsv3 or nfsv4,
does ls (or ll) at the created test directory.

A tshark output likes following (for nfsv4),

 # tshark -i eth0 tcp port 2049 -Tfields -e ip.src -e ip.dst -e nfs -e nfs.cookie4

srcip   dstip   SEQUENCE, PUTFH, READDIR        0
dstip   srcip   SEQUENCE PUTFH READDIR  909539109313539306,2108391201987888856,2305312124304486544,2566335452463141496,2978225129081509984,4263037479923412583,4304697173036510679,4666703455469210097,4759208201298769007,4776701232145978803,5338408478512081262,5949498658935544804,5971526429894832903,6294060338267709855,6528840566229532529,8600463293536422524,9223372036854775807
srcip   dstip
srcip   dstip   SEQUENCE, PUTFH, READDIR        9223372036854775807
dstip   srcip   SEQUENCE PUTFH READDIR

The READDIR with cookie 9223372036854775807(0x7FFFFFFFFFFFFFFF) is redundant.

Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Dan Carpenter
08b45fcb2d nfs/blocklayout: Use the passed in gfp flags
This allocation should use the passed in GFP_ flags instead of
GFP_KERNEL.  One places where this matters is in filelayout_pg_init_write()
which uses GFP_NOFS as the allocation flags.

Fixes: 5c83746a0c ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
huzhi001@208suo.com
a841c9cb9b filemap: Fix errors in file.c
The following checkpatch errors are removed:
ERROR: "foo * bar" should be "foo *bar"
"foo * bar" should be "foo *bar"

Signed-off-by: ZhiHu <huzhi001@208suo.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Fedor Pchelkin
96562c45af NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
It is an almost improbable error case but when page allocating loop in
nfs4_get_device_info() fails then we should only free the already
allocated pages, as __free_page() can't deal with NULL arguments.

Found by Linux Verification Center (linuxtesting.org).

Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
GUO Zihua
08be82ba0c NFS: Move common includes outside ifdef
module.h, clnt.h, addr.h and dns_resolve.h is always included whether
CONFIG_NFS_USE_KERNEL_DNS is set or not and their order does not seems
to matter.

Move them outside the ifdef to simplify code and avoid checkincludes
message.

Signed-off-by: GUO Zihua <guozihua@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-08-24 13:24:15 -04:00
Chuck Lever
8073a98e95 NFSD: Fix a thinko introduced by recent trace point changes
The fixed commit erroneously removed a call to nfsd_end_grace(),
which makes calls to write_v4_end_grace() a no-op.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202308241229.68396422-oliver.sang@intel.com
Fixes: 39d432fc76 ("NFSD: trace nfsctl operations")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-24 10:56:28 -04:00
Will Shiu
74f6f59126 locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock
As following backtrace, the struct file_lock request , in posix_lock_inode
is free before ftrace function using.
Replace the ftrace function ahead free flow could fix the use-after-free
issue.

[name:report&]===============================================
BUG:KASAN: use-after-free in trace_event_raw_event_filelock_lock+0x80/0x12c
[name:report&]Read at addr f6ffff8025622620 by task NativeThread/16753
[name:report_hw_tags&]Pointer tag: [f6], memory tag: [fe]
[name:report&]
BT:
Hardware name: MT6897 (DT)
Call trace:
 dump_backtrace+0xf8/0x148
 show_stack+0x18/0x24
 dump_stack_lvl+0x60/0x7c
 print_report+0x2c8/0xa08
 kasan_report+0xb0/0x120
 __do_kernel_fault+0xc8/0x248
 do_bad_area+0x30/0xdc
 do_tag_check_fault+0x1c/0x30
 do_mem_abort+0x58/0xbc
 el1_abort+0x3c/0x5c
 el1h_64_sync_handler+0x54/0x90
 el1h_64_sync+0x68/0x6c
 trace_event_raw_event_filelock_lock+0x80/0x12c
 posix_lock_inode+0xd0c/0xd60
 do_lock_file_wait+0xb8/0x190
 fcntl_setlk+0x2d8/0x440
...
[name:report&]
[name:report&]Allocated by task 16752:
...
 slab_post_alloc_hook+0x74/0x340
 kmem_cache_alloc+0x1b0/0x2f0
 posix_lock_inode+0xb0/0xd60
...
 [name:report&]
 [name:report&]Freed by task 16752:
...
  kmem_cache_free+0x274/0x5b0
  locks_dispose_list+0x3c/0x148
  posix_lock_inode+0xc40/0xd60
  do_lock_file_wait+0xb8/0x190
  fcntl_setlk+0x2d8/0x440
  do_fcntl+0x150/0xc18
...

Signed-off-by: Will Shiu <Will.Shiu@mediatek.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-08-24 10:42:19 -04:00
Jakub Wilk
bd4c4680c0 fs/locks: Fix typo
Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2023-08-24 10:42:19 -04:00
Luís Henriques
d9ae977d2d ceph: switch ceph_lookup/atomic_open() to use new fscrypt helper
Instead of setting the no-key dentry, use the new
fscrypt_prepare_lookup_partial() helper.  We still need to mark the
directory as incomplete if the directory was just unlocked.

In ceph_atomic_open() this fixes a bug where a dentry is incorrectly
set with DCACHE_NOKEY_NAME when 'dir' has been evicted but the key is
still available (for example, where there's a drop_caches).

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:37 +02:00
Xiubo Li
295fc4aa7d ceph: fix updating i_truncate_pagecache_size for fscrypt
When fscrypt is enabled we will align the truncate size up to the
CEPH_FSCRYPT_BLOCK_SIZE always, so if we truncate the size in the
same block more than once, the latter ones will be skipped being
invalidated from the page caches.

This will force invalidating the page caches by using the smaller
size than the real file size.

At the same time add more debug log and fix the debug log for
truncate code.

Link: https://tracker.ceph.com/issues/58834
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Xiubo Li
1464de9f81 ceph: wait for OSD requests' callbacks to finish when unmounting
The sync_filesystem() will flush all the dirty buffer and submit the
osd reqs to the osdc and then is blocked to wait for all the reqs to
finish. But the when the reqs' replies come, the reqs will be removed
from osdc just before the req->r_callback()s are called. Which means
the sync_filesystem() will be woke up by leaving the req->r_callback()s
are still running.

This will be buggy when the waiter require the req->r_callback()s to
release some resources before continuing. So we need to make sure the
req->r_callback()s are called before removing the reqs from the osdc.

WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b

We need to increase the blocker counter to make sure all the osd
requests' callbacks have been finished just before calling the
kill_anon_super() when unmounting.

Link: https://tracker.ceph.com/issues/58126
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Xiubo Li
e3dfcab208 ceph: drop messages from MDS when unmounting
When unmounting all the dirty buffers will be flushed and after
the last osd request is finished the last reference of the i_count
will be released. Then it will flush the dirty cap/snap to MDSs,
and the unmounting won't wait the possible acks, which will ihold
the inodes when updating the metadata locally but makes no sense
any more, of this. This will make the evict_inodes() to skip these
inodes.

If encrypt is enabled the kernel generate a warning when removing
the encrypt keys when the skipped inodes still hold the keyring:

WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S  6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS:  00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b

Later the kernel will crash when iput() the inodes and dereferencing
the "sb->s_master_keys", which has been released by the
generic_shutdown_super().

Link: https://tracker.ceph.com/issues/59162
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Luís Henriques
abd4fc7758 ceph: prevent snapshot creation in encrypted locked directories
With snapshot names encryption we can not allow snapshots to be created in
locked directories because the names wouldn't be encrypted.  This patch
forces the directory to be unlocked to allow a snapshot to be created.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Luís Henriques
dd66df0053 ceph: add support for encrypted snapshot names
Since filenames in encrypted directories are encrypted and shown as
a base64-encoded string when the directory is locked, make snapshot
names show a similar behaviour.

When creating a snapshot, .snap directories for every subdirectory will
show the snapshot name in the "long format":

  # mkdir .snap/my-snap
  # ls my-dir/.snap/
  _my-snap_1099511627782

Encrypted snapshots will need to be able to handle these by
encrypting/decrypting only the snapshot part of the string ('my-snap').

Also, since the MDS prevents snapshot names to be bigger than 240
characters it is necessary to adapt CEPH_NOHASH_NAME_MAX to accommodate
this extra limitation.

[ idryomov: drop const on !CONFIG_FS_ENCRYPTION branch too ]

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Luís Henriques
b422f11504 ceph: invalidate pages when doing direct/sync writes
When doing a direct/sync write, we need to invalidate the page cache in
the range being written to. If we don't do this, the cache will include
invalid data as we just did a write that avoided the page cache.

In the event that invalidation fails, just ignore the error. That likely
just means that we raced with another task doing a buffered write, in
which case we want to leave the page intact anyway.

[ jlayton: minor comment update ]

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Jeff Layton
f0fe1e54cf ceph: plumb in decryption during reads
Force the use of sparse reads when the inode is encrypted, and add the
appropriate code to decrypt the extent map after receiving.

Note that the crypto block may be smaller than a page, but the reverse
cannot be true.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Jeff Layton
d55207717d ceph: add encryption support to writepage and writepages
Allow writepage to issue encrypted writes. Extend out the requested size
and offset to cover complete blocks, and then encrypt and write them to
the OSDs.

Add the appropriate machinery to write back dirty data with encryption.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Jeff Layton
33a5f1709a ceph: add read/modify/write to ceph_sync_write
When doing a synchronous write on an encrypted inode, we have no
guarantee that the caller is writing crypto block-aligned data. When
that happens, we must do a read/modify/write cycle.

First, expand the range to cover complete blocks. If we had to change
the original pos or length, issue a read to fill the first and/or last
pages, and fetch the version of the object from the result.

We then copy data into the pages as usual, encrypt the result and issue
a write prefixed by an assertion that the version hasn't changed. If it has
changed then we restart the whole thing again.

If there is no object at that position in the file (-ENOENT), we prefix
the write on an exclusive create of the object instead.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Jeff Layton
b294fa295f ceph: align data in pages in ceph_sync_write
Encrypted files will need to be dealt with in block-sized chunks and
once we do that, the way that ceph_sync_write aligns the data in the
bounce buffer won't be acceptable.

Change it to align the data the same way it would be aligned in the
pagecache.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Jeff Layton
8cff8f5374 ceph: don't use special DIO path for encrypted inodes
Eventually I want to merge the synchronous and direct read codepaths,
possibly via new netfs infrastructure. For now, the direct path is not
crypto-enabled, so use the sync read/write paths instead.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:36 +02:00
Xiubo Li
5c64737d25 ceph: add truncate size handling support for fscrypt
This will transfer the encrypted last block contents to the MDS
along with the truncate request only when the new size is smaller
and not aligned to the fscrypt BLOCK size. When the last block is
located in the file hole, the truncate request will only contain
the header.

The MDS could fail to do the truncate if there has another client
or process has already updated the RADOS object which contains
the last block, and will return -EAGAIN, then the kclient needs
to retry it. The RMW will take around 50ms, and will let it retry
20 times for now.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:35 +02:00
Xiubo Li
d4d5188715 ceph: add object version support for sync read
Turn the guts of ceph_sync_read into a new helper that takes an inode
and an offset instead of a kiocb struct, and make ceph_sync_read call
the helper as a wrapper.

Make the new helper always return the last object's version.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:35 +02:00
Jeff Layton
77cdb7e17e ceph: add infrastructure for file encryption and decryption
...and allow test_dummy_encryption to bypass content encryption
if mounted with test_dummy_encryption=clear.

[ xiubli: remove test_dummy_encryption=clear support per Ilya ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:35 +02:00
Jeff Layton
0d91f0ad6a ceph: handle fscrypt fields in cap messages from MDS
Handle the new fscrypt_file and fscrypt_auth fields in cap messages. Use
them to populate new fields in cap_extra_info and update the inode with
those values.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:35 +02:00
Jeff Layton
16be62fc8a ceph: size handling in MClientRequest, cap updates and inode traces
For encrypted inodes, transmit a rounded-up size to the MDS as the
normal file size and send the real inode size in fscrypt_file field.
Also, fix up creates and truncates to also transmit fscrypt_file.

When we get an inode trace from the MDS, grab the fscrypt_file field if
the inode is encrypted, and use it to populate the i_size field instead
of the regular inode size field.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:35 +02:00
Luís Henriques
14e034a61c ceph: mark directory as non-complete after loading key
When setting a directory's crypt context, ceph_dir_clear_complete()
needs to be called otherwise if it was complete before, any existing
(old) dentry will still be valid.

This patch adds a wrapper around __fscrypt_prepare_readdir() which will
ensure a directory is marked as non-complete if key status changes.

[ xiubli: revise commit title per Milind ]

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:35 +02:00
Luís Henriques
e127e03009 ceph: allow encrypting a directory while not having Ax caps
If a client doesn't have Fx caps on a directory, it will get errors while
trying encrypt it:

ceph: handle_cap_grant: cap grant attempt to change fscrypt_auth on non-I_NEW inode (old len 0 new len 48)
fscrypt (ceph, inode 1099511627812): Error -105 getting encryption context

A simple way to reproduce this is to use two clients:

    client1 # mkdir /mnt/mydir

    client2 # ls /mnt/mydir

    client1 # fscrypt encrypt /mnt/mydir
    client1 # echo hello > /mnt/mydir/world

This happens because, in __ceph_setattr(), we only initialize
ci->fscrypt_auth if we have Ax and ceph_fill_inode() won't use the
fscrypt_auth received if the inode state isn't I_NEW.  Fix it by allowing
ceph_fill_inode() to also set ci->fscrypt_auth if the inode doesn't have
it set already.

Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:35 +02:00
Jeff Layton
94af047092 ceph: add some fscrypt guardrails
Add the appropriate calls into fscrypt for various actions, including
link, rename, setattr, and the open codepaths.

Disable fallocate for encrypted inodes -- hopefully, just for now.

If we have an encrypted inode, then the client will need to re-encrypt
the contents of the new object. Disable copy offload to or from
encrypted inodes.

Set i_blkbits to crypto block size for encrypted inodes -- some of the
underlying infrastructure for fscrypt relies on i_blkbits being aligned
to crypto blocksize.

Report STATX_ATTR_ENCRYPTED on encrypted inodes.

[ lhenriques: forbid encryption with striped layouts ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:35 +02:00
Jeff Layton
79f2f6ad87 ceph: create symlinks with encrypted and base64-encoded targets
When creating symlinks in encrypted directories, encrypt and
base64-encode the target with the new inode's key before sending to the
MDS.

When filling a symlinked inode, base64-decode it into a buffer that
we'll keep in ci->i_symlink. When get_link is called, decrypt the buffer
into a new one that will hang off i_link.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:35 +02:00
Xiubo Li
af9ffa6df7 ceph: add support to readdir for encrypted names
To make it simpler to decrypt names in a readdir reply (i.e. before
we have a dentry), add a new ceph_encode_encrypted_fname()-like helper
that takes a qstr pointer instead of a dentry pointer.

Once we've decrypted the names in a readdir reply, we no longer need the
crypttext, so overwrite them in ceph_mds_reply_dir_entry with the
unencrypted names. Then in both ceph_readdir_prepopulate() and
ceph_readdir() we will use the dencrypted name directly.

[ jlayton: convert some BUG_ONs into error returns ]

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:34 +02:00
Xiubo Li
3859af9eba ceph: pass the request to parse_reply_info_readdir()
Instead of passing just the r_reply_info to the readdir reply parser,
pass the request pointer directly instead. This will facilitate
implementing readdir on fscrypted directories.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:34 +02:00
Jeff Layton
855290962c ceph: make ceph_fill_trace and ceph_get_name decrypt names
When we get a dentry in a trace, decrypt the name so we can properly
instantiate the dentry or fill out ceph_get_name() buffer.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:34 +02:00
Jeff Layton
457117f077 ceph: add helpers for converting names for userland presentation
Define a new ceph_fname struct that we can use to carry information
about encrypted dentry names. Add helpers for working with these
objects, including ceph_fname_to_usr which formats an encrypted filename
for userland presentation.

[ xiubli: fix resulting name length check -- neither name_len nor
  ctext_len should exceed NAME_MAX ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:34 +02:00
Jeff Layton
c526760181 ceph: make d_revalidate call fscrypt revalidator for encrypted dentries
If we have a dentry which represents a no-key name, then we need to test
whether the parent directory's encryption key has since been added.  Do
that before we test anything else about the dentry.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:34 +02:00
Jeff Layton
cb3524a8bd ceph: set DCACHE_NOKEY_NAME flag in ceph_lookup/atomic_open()
This is required so that we know to invalidate these dentries when the
directory is unlocked.

Atomic open can act as a lookup if handed a dentry that is negative on
the MDS. Ensure that we set DCACHE_NOKEY_NAME on the dentry in
atomic_open, if we don't have the key for the parent. Otherwise, we can
end up validating the dentry inappropriately if someone later adds a
key.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:34 +02:00
Jeff Layton
4ac4c23eaa ceph: decode alternate_name in lease info
Ceph is a bit different from local filesystems, in that we don't want
to store filenames as raw binary data, since we may also be dealing
with clients that don't support fscrypt.

We could just base64-encode the encrypted filenames, but that could
leave us with filenames longer than NAME_MAX. It turns out that the
MDS doesn't care much about filename length, but the clients do.

To manage this, we've added a new "alternate name" field that can be
optionally added to any dentry that we'll use to store the binary
crypttext of the filename if its base64-encoded value will be longer
than NAME_MAX. When a dentry has one of these names attached, the MDS
will send it along in the lease info, which we can then store for
later usage.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:34 +02:00
Jeff Layton
24865e75c1 ceph: send alternate_name in MClientRequest
In the event that we have a filename longer than CEPH_NOHASH_NAME_MAX,
we'll need to hash the tail of the filename. The client however will
still need to know the full name of the file if it has a key.

To support this, the MClientRequest field has grown a new alternate_name
field that we populate with the full (binary) crypttext of the filename.
This is then transmitted to the clients in readdir or traces as part of
the dentry lease.

Add support for populating this field when the filenames are very long.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:24:34 +02:00
Jeff Layton
3fd945a79e ceph: encode encrypted name in ceph_mdsc_build_path and dentry release
Allow ceph_mdsc_build_path to encrypt and base64 encode the filename
when the parent is encrypted and we're sending the path to the MDS. In
a similar fashion, encode encrypted dentry names if including a dentry
release in a request.

In most cases, we just encrypt the filenames and base64 encode them,
but when the name is longer than CEPH_NOHASH_NAME_MAX, we use a similar
scheme to fscrypt proper, and hash the remaning bits with sha256.

When doing this, we then send along the full crypttext of the name in
the new alternate_name field of the MClientRequest. The MDS can then
send that along in readdir responses and traces.

[ idryomov: drop duplicate include reported by Abaci Robot ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-24 11:22:37 +02:00
Greg Ungerer
9549fb354e
riscv: support the elf-fdpic binfmt loader
Add support for enabling and using the binfmt_elf_fdpic program loader
on RISC-V platforms. The most important change is to setup registers
during program load to pass the mapping addresses to the new process.

One of the interesting features of the elf-fdpic loader is that it
also allows appropriately compiled ELF format binaries to be loaded on
nommu systems. Appropriate being those compiled with -pie.

Signed-off-by: Greg Ungerer <gerg@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230711130754.481209-3-gerg@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-08-23 14:17:43 -07:00
Greg Ungerer
b922bf04d2
binfmt_elf_fdpic: support 64-bit systems
The binfmt_flat_fdpic code has a number of 32-bit specific data
structures associated with it. Extend it to be able to support and
be used on 64-bit systems as well.

The new code defines a number of key 64-bit variants of the core
elf-fdpic data structures - along side the existing 32-bit sized ones.
A common set of generic named structures are defined to be either
the 32-bit or 64-bit ones as required at compile time. This is a
similar technique to that used in the ELF binfmt loader.

For example:

  elf_fdpic_loadseg is either elf32_fdpic_loadseg or elf64_fdpic_loadseg
  elf_fdpic_loadmap is either elf32_fdpic_loadmap or elf64_fdpic_loadmap

the choice based on ELFCLASS32 or ELFCLASS64.

Signed-off-by: Greg Ungerer <gerg@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230711130754.481209-2-gerg@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-08-23 14:17:42 -07:00