For backchannel requests that lookup the appropriate nfs_client, use the
state-management rpc_clnt's rpc_timeout parameters for the backchannel's
response. When the nfs_client cannot be found, fall back to using the
xprt's default timeout parameters.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The kfree() function was called in one case by
the bl_resolve_deviceid() function during error handling
even if the passed data structure member contained a null pointer.
This issue was detected by using the Coccinelle software.
Thus use an other label.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFS already has writepages and migrate_folio, so it does not need to
implement writepage. The writepage operation is deprecated as it leads
to worse performance under high memory pressure due to folios being
written out in LRU order rather than sequentially within a file.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that we're calculating how large a remaining IO should be based
on the current request's offset, we no longer need to track bytes_left on
each struct nfs_direct_req. Drop the field, and clean up the direct
request tracepoints.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of relying on the value of the 'bytes_left' field, we should
calculate the layout size based on the offset of the request that is
being written out.
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fixes: 954998b60c ("NFS: Fix error handling for O_DIRECT write scheduling")
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
With this we can see the dentry -> inode linkage that's being
revalidated. A fileid of 0 means "negative dentry".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We do async renames in other cases besides sillyrenames now. This
tracepoint name is now misleading.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add a call to the v4 d_revalidate entrypoint, just like the v3 one.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Once the client has processed the CB_LAYOUTRECALL, but has not yet
successfully returned the layout, the server is supposed to switch to
returning NFS4ERR_RETURNCONFLICT. This patch ensures that we handle
that return value correctly.
Fixes: 183d9e7b11 ("pnfs: rework LAYOUTGET retry handling")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the server is recalling a layout, and sends us a list of referring
calls that we can see are complete, then we should just trust that the
stateid argument is correct, even if the sequence id doesn't match the
one we hold.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the server gives us a set of referring calls, to tell us that the
NFSv4.1 callback needs to be ordered with respect to those calls, then
we may want to make that information available to the operations. In
certain cases, it may allow them to optimise their behaviour due to the
extra knowledge.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The subjective cred (task->cred) can potentially be overridden and
subsquently freed in non-RCU context, which could lead to a panic if we
try to use it in cred_fscmp(). Use __task_cred(), which returns the
objective cred (task->real_cred) instead.
Fixes: 0eb43812c0 ("NFS: Clear the file access cache upon login")
Fixes: 5e9a7b9c2e ("NFS: Fix up a sparse warning")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Again we have claimed regressions for walking a directory tree, this time
with the "find" utility which always tries to optimize away asking for any
attributes until it has a complete list of entries. This behavior makes
the readdir plus heuristic do the wrong thing, which causes a storm of
GETATTRs to determine each entry's type in order to continue the walk.
For v4 add the type attribute to each READDIR request to include it no
matter the heuristic. This allows a simple `find` command to proceed
quickly through a directory tree.
Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We noticed a SCSI device that refused to allow READ CAPACITY when the
device had a PR with exclusive access, registrants only. The result of
this situation is that the blocklayout driver adds a pnfs_block_dev of zero
length which always fails the offset_in_map tests. Instead of continuously
trying to do pNFS for this case, just mark the device as unavailable which
will allow the client to fallback to the MDS for the duration of
PNFS_DEVICE_RETRY_TIMEOUT.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The error path for blocklayout's device lookup is missing a reference drop
for the case where a lookup finds the device, but the device is marked with
NFS_DEVICEID_UNAVAILABLE.
Fixes: b3dce6a2f0 ("pnfs/blocklayout: handle transient devices")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
Remove sentinel elements ctl_table struct. Special attention was placed in
making sure that an empty directory for fs/verity was created when
CONFIG_FS_VERITY_BUILTIN_SIGNATURES is not defined. In this case we use the
register sysctl call that expects a size.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Track the file position above which the server is not expected to have any
data (the "zero point") and preemptively assume that we can satisfy
requests by filling them with zeroes locally rather than attempting to
download them if they're over that line - even if we've written data back
to the server. Assume that any data that was written back above that
position is held in the local cache. Note that we have to split requests
that straddle the line.
Make use of this to optimise away some reads from the server. We need to
set the zero point in the following circumstances:
(1) When we see an extant remote inode and have no cache for it, we set
the zero_point to i_size.
(2) On local inode creation, we set zero_point to 0.
(3) On local truncation down, we reduce zero_point to the new i_size if
the new i_size is lower.
(4) On local truncation up, we don't change zero_point.
(5) On local modification, we don't change zero_point.
(6) On remote invalidation, we set zero_point to the new i_size.
(7) If stored data is discarded from the pagecache or culled from fscache,
we must set zero_point above that if the data also got written to the
server.
(8) If dirty data is written back to the server, but not fscache, we must
set zero_point above that.
(9) If a direct I/O write is made, set zero_point above that.
Assuming the above, any read from the server at or above the zero_point
position will return all zeroes.
The zero_point value can be stored in the cache, provided the above rules
are applied to it by any code that culls part of the local cache.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Remove ->begin_cache_operation() in favour of just calling fscache directly.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Christian Brauner <christian@brauner.io>
cc: linux-fsdevel@vger.kernel.org
cc: linux-cachefs@redhat.com
Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure. The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance. Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.
This patch (of 6):
The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg. While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker). It has caused us a lot of issues during our
development.
This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects. The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.
It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call. Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count). list_lru_putback also allows for explicit
memcg and NUMA node selection.
Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
generic_copy_file_range() is just a wrapper around splice_file_range(),
which caps the maximum copy length.
The only caller of splice_file_range(), namely __ceph_copy_file_range()
is already ready to cope with short copy.
Move the length capping into splice_file_range() and replace the exported
symbol generic_copy_file_range() with a simple inline helper.
Suggested-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-fsdevel/20231204083849.GC32438@lst.de/
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231212094440.250945-3-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.
Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is hard to find where mapping->private_lock, mapping->private_list and
mapping->private_data are used, due to private_XXX being a relatively
common name for variables and structure members in the kernel. To fit
with other members of struct address_space, rename them all to have an
i_ prefix. Tested with an allmodconfig build.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20231117215823.2821906-1-willy@infradead.org
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Highlights include:
Bugfixes:
- SUNRPC: A fix to re-probe the target RPC port after an ECONNRESET error
- SUNRPC: Handle allocation errors from rpcb_call_async()
- SUNRPC: Fix a use-after-free condition in rpc_pipefs
- SUNRPC: fix up various checks for timeouts
- NFSv4.1: Handle NFS4ERR_DELAY errors during session trunking
- NFSv4.1: fix SP4_MACH_CRED protection for pnfs IO
- NFSv4: Ensure that we test all delegations when the server notifies
us that it may have revoked some of them
Features:
- Allow knfsd processes to break out of NFS4ERR_DELAY loops when
re-exporting NFSv4.x by setting appropriate values for the
'delay_retrans' module parameter.
- nfs: Convert nfs_symlink() to use a folio
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmVL62IACgkQZwvnipYK
API4Ww/+I75hmEdI4i/6v8WnrLWLWzmCgybez2AfrKYtuYmEcDZTf2K4pNVEJxFG
PJ4TYRpaSgwxCXqoup5INOdL5gS8g3JbAlTqrDQ8nYGeBXETN9tN5n2Xj8liHk2l
OrnYzgTjC2pwpWDZq/W1LCQMatCbs9XhpmyBvqZH5r7tAOVGkrk1ICZ/r+/nGiN8
LAbm/I0M1Jp0iJisN/i/0CsEgLQMCfeQVrtEMCZGsoVS79Mr/W1hF3KiGognI/xz
FvEXnZKauw9npu7U7ckhHZcHd8oQxby0Q0Xny/IpgiO2Z1YqfKCvJSK1sjBt/lFu
7fe2HGMFfcTMx/bn/qdUJR351607rBi4h3t1OfK4KIxV1gSLUyS/RR7ayFx5A0gM
pFJcC0ZnKoiVr2vSNMMguennbyjScqNOC5ECb+bpMAyDV7suF9e/khK12CdYwqFm
cpeUUD35GZT+RjP3htI92Pj0bqBOHx+o7qEi+MEel3t90us8PPBrWywHk7Tw40vJ
eUxH0XQtZVswH6TvekNdoMXx8TXSAbx4I6Hw3fmNGWLRp494p3GVSmWQmctcvCi7
Y6E1P3YT8G1OeI9+fIQr5Wp2F9sOdFPrb3BrAojj9ndJ3ZqQAcx+gY5z2RVnvvur
PTsLInFxS8+WvvqPtQCZZ5UxnrcPFCk1js0rg6EqdjZmq6+OqJU=
=TNRl
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Bugfixes:
- SUNRPC:
- re-probe the target RPC port after an ECONNRESET error
- handle allocation errors from rpcb_call_async()
- fix a use-after-free condition in rpc_pipefs
- fix up various checks for timeouts
- NFSv4.1:
- Handle NFS4ERR_DELAY errors during session trunking
- fix SP4_MACH_CRED protection for pnfs IO
- NFSv4:
- Ensure that we test all delegations when the server notifies
us that it may have revoked some of them
Features:
- Allow knfsd processes to break out of NFS4ERR_DELAY loops when
re-exporting NFSv4.x by setting appropriate values for the
'delay_retrans' module parameter
- nfs: Convert nfs_symlink() to use a folio"
* tag 'nfs-for-6.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
nfs: Convert nfs_symlink() to use a folio
SUNRPC: Fix RPC client cleaned up the freed pipefs dentries
NFSv4.1: fix SP4_MACH_CRED protection for pnfs IO
SUNRPC: Add an IS_ERR() check back to where it was
NFSv4.1: fix handling NFS4ERR_DELAY when testing for session trunking
nfs41: drop dependency between flexfiles layout driver and NFSv3 modules
NFSv4: fairly test all delegations on a SEQ4_ revocation
SUNRPC: SOFTCONN tasks should time out when on the sending list
SUNRPC: Force close the socket when a hard error is reported
SUNRPC: Don't skip timeout checks in call_connect_status()
SUNRPC: ECONNRESET might require a rebind
NFSv4/pnfs: Allow layoutget to return EAGAIN for softerr mounts
NFSv4: Add a parameter to limit the number of retries after NFS4ERR_DELAY
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
Use the folio APIs, saving about four calls to compound_head().
Convert back to a page in each of the individual protocol implementations.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the client is doing pnfs IO and Kerberos is configured and EXCHANGEID
successfully negotiated SP4_MACH_CRED and WRITE/COMMIT are on the
list of state protected operations, then we need to make sure to
choose the DS's rpc_client structure instead of the MDS's one.
Fixes: fb91fb0ee7 ("NFS: Move call to nfs4_state_protect_write() to nfs4_write_setup()")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently when client sends an EXCHANGE_ID for a possible trunked
connection, for any error that happened, the trunk will be thrown
out. However, an NFS4ERR_DELAY is a transient error that should be
retried instead.
Fixes: e818bd085b ("NFSv4.1 remove xprt from xprt_switch if session trunking test fails")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The flexfiles layout driver depends on NFSv3 module as data servers
might be configure to provide nfsv3 only.
Disabling the nfsv3 protocol completely disables the flexfiles layout driver,
however, the data server still might support v4.1 protocol. Thus the strond
couling betwwen flexfiles and nfsv3 modules should be relaxed, as layout driver
will return UNSUPPORTED if not matching protocol is found.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When the client is required to use TEST_STATEID to discover which
delegation(s) have been revoked, it may continually test delegations at the
head of the list if the server continues to be unsatisfied and send
SEQ4_STATUS_RECALLABLE_STATE_REVOKED. For a large number of delegations
this behavior is prone to live-lock because the client may never be able to
test and free revoked state at the end of the list since the
SEQ4_STATUS_RECALLABLE_STATE_REVOKED will cause us to flag delegations at
the head of the list to be tested. This problem is further exacerbated by
the state manager's willingness to be scheduled out on a busy system while
testing the list of delegations.
Keep a generation counter for each attempt to test all delegations, and
skip delegations that have already been tested in the current pass.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Torkil Svensgaard <torkil@drcmr.dk>
Tested-by: Ruben Vestergaard <rubenv@drcmr.dk>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/3cUWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsEoEACBGPSiOmfSWdH3TOnIG270PD24
jGjg8KFv7RC/JTOdYmpLl0okdlGT9LvjN/ToSSDEw3PIayxoXUdhkbYy0MYtiV3m
yz2ozDTzJuplQX/W2fPE+nXSzIwHao2zjPPFjHnT7lt8IIjhgjiOtLfZ2gGUkW99
Mdu2aWh3u0r4tC8OS23++yN5ibRc5l72efsjDWjZ0aPXnxE1bjmLMiIPiizpndIf
beasPuDBs98sJVYouemCwnsPXuXOPz3Q1Cpo/fTd+TMTJCLSemCQZCTuOBU0acI/
ZjLCgCaJU1yIYKBMtrIN4G9kITZniXX3/Nm4o6NQMVlcCqMeNaHuflomqWoqWfhE
UPbRo2eghZOaMNiCKLLvZDIqPrh1IcsiEl6Ef3W4hICc42GTK96IuGisIvDXwQ4N
/SzTOupJuN42noh3z1M3XuZy5RoXJ99IYDNY5CTKf9IdqvA0bbGkU3nb1gZH/xw9
BjTqKzR/7K1kTXuSgagDZ1Wceej9pZxhX7E3IHYsP8ZOvKug3EeL4yybVwQ3HRfq
Qnzcp/qPB9cOkLSQXveRTFTsj2mX28Gixct/iDuc1jIYwGQlY1gI6dcUcqby6ptM
BrQti7eR2NH2+T3aE2UVCIWsZVhx7NaSF+z8JxfAuu56jicc4xJVsi8zrNveWX5M
m2VXyBl3121BVtKi4w==
=0iVF
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"One of the more voluminous set of changes is for adding the new
__counted_by annotation[1] to gain run-time bounds checking of
dynamically sized arrays with UBSan.
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R.
Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem
Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas
Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees
Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)"
* tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits)
hwmon: (acpi_power_meter) replace open-coded kmemdup_nul
reset: Annotate struct reset_control_array with __counted_by
kexec: Annotate struct crash_mem with __counted_by
virtio_console: Annotate struct port_buffer with __counted_by
ima: Add __counted_by for struct modsig and use struct_size()
MAINTAINERS: Include stackleak paths in hardening entry
string: Adjust strtomem() logic to allow for smaller sources
hardening: x86: drop reference to removed config AMD_IOMMU_V2
randstruct: Fix gcc-plugin performance mode to stay in group
mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by
drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by
irqchip/imx-intmux: Annotate struct intmux_data with __counted_by
KVM: Annotate struct kvm_irq_routing_table with __counted_by
virt: acrn: Annotate struct vm_memory_region_batch with __counted_by
hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by
sparc: Annotate struct cpuinfo_tree with __counted_by
isdn: kcapi: replace deprecated strncpy with strscpy_pad
isdn: replace deprecated strncpy with strscpy
NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by
nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by
...
This release completes the SunRPC thread scheduler work that was
begun in v6.6. The scheduler can now find an svc thread to wake in
constant time and without a list walk. Thanks again to Neil Brown
for this overhaul.
Lorenzo Bianconi contributed infrastructure for a netlink-based
NFSD control plane. The long-term plan is to provide the same
functionality as found in /proc/fs/nfsd, plus some interesting
additions, and then migrate the NFSD user space utilities to
netlink.
A long series to overhaul NFSD's NFSv4 operation encoding was
applied in this release. The goals are to bring this family of
encoding functions in line with the matching NFSv4 decoding
functions and with the NFSv2 and NFSv3 XDR functions, preparing
the way for better memory safety and maintainability.
A further improvement to NFSD's write delegation support was
contributed by Dai Ngo. This adds a CB_GETATTR callback,
enabling the server to retrieve cached size and mtime data from
clients holding write delegations. If the server can retrieve
this information, it does not have to recall the delegation in
some cases.
The usual panoply of bug fixes and minor improvements round out
this release. As always I am grateful to all contributors,
reviewers, and testers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmU5IuoACgkQM2qzM29m
f5eVsg//bVp8S93ci/oDlKfzOwH2fO5e5rna91wrDpJxkd51h6KTx55dSRG5sjAZ
EywIVOann6xCtsixAPyff5Cweg2dWvzQRsy1ZnvWQ1qZBzD5KAJY5LPkeSFUCKBo
Zani/qTOYbxzgFMjZx+yDSXDPKG68WYZBQK59SI7mURu4SYdk8aRyNY8mjHfr0Vh
Aqrcny4oVtXV4sL5P5G/2FUW7WKT3olA3jSYlRRNMhbs2qpEemRCCrspOEMMad+b
t1+ZCg+U27PMranvOJnof4RU7peZbaxDWA0gyiUbivVXVtZn9uOs0ffhktkvechL
ePc33dqdp2ITdKIPA6JlaRv5WflKXQw0YYM9Kv5mcR4A2el7owL4f/pMlPhtbYwJ
IOJv15KdKVN979G2e6WMYiKK+iHfaUUguhMEXnfnGoAajHOZNQiUEo3iFQAD7LDc
DvMF8d9QqYmB9IW8FOYaRRfZGJOQHf3TL79Nd08z/bn5swvlvfj77leux9Sb+0/m
Luk2Xvz2AJVSXE31wzabaGHkizN+BtH+e4MMbXUHBPW5jE9v7XOnEUFr4UdZyr9P
Gl87A7NcrzNjJWT5TrnzM4sOslNsx46Aeg+VuNt2fSRn2dm6iBu2B8s0N4imx6dV
PX1y9VSLq5WRhjrFZ1qeiZdsuTaQtrEiNDoRIQR6nCJPAV80iFk=
=B4wJ
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"This release completes the SunRPC thread scheduler work that was begun
in v6.6. The scheduler can now find an svc thread to wake in constant
time and without a list walk. Thanks again to Neil Brown for this
overhaul.
Lorenzo Bianconi contributed infrastructure for a netlink-based NFSD
control plane. The long-term plan is to provide the same functionality
as found in /proc/fs/nfsd, plus some interesting additions, and then
migrate the NFSD user space utilities to netlink.
A long series to overhaul NFSD's NFSv4 operation encoding was applied
in this release. The goals are to bring this family of encoding
functions in line with the matching NFSv4 decoding functions and with
the NFSv2 and NFSv3 XDR functions, preparing the way for better memory
safety and maintainability.
A further improvement to NFSD's write delegation support was
contributed by Dai Ngo. This adds a CB_GETATTR callback, enabling the
server to retrieve cached size and mtime data from clients holding
write delegations. If the server can retrieve this information, it
does not have to recall the delegation in some cases.
The usual panoply of bug fixes and minor improvements round out this
release. As always I am grateful to all contributors, reviewers, and
testers"
* tag 'nfsd-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (127 commits)
svcrdma: Fix tracepoint printk format
svcrdma: Drop connection after an RDMA Read error
NFSD: clean up alloc_init_deleg()
NFSD: Fix frame size warning in svc_export_parse()
NFSD: Rewrite synopsis of nfsd_percpu_counters_init()
nfsd: Clean up errors in nfs3proc.c
nfsd: Clean up errors in nfs4state.c
NFSD: Clean up errors in stats.c
NFSD: simplify error paths in nfsd_svc()
NFSD: Clean up nfsd4_encode_seek()
NFSD: Clean up nfsd4_encode_offset_status()
NFSD: Clean up nfsd4_encode_copy_notify()
NFSD: Clean up nfsd4_encode_copy()
NFSD: Clean up nfsd4_encode_test_stateid()
NFSD: Clean up nfsd4_encode_exchange_id()
NFSD: Clean up nfsd4_do_encode_secinfo()
NFSD: Clean up nfsd4_encode_access()
NFSD: Clean up nfsd4_encode_readdir()
NFSD: Clean up nfsd4_encode_entry4()
NFSD: Add an nfsd4_encode_nfs_cookie4() helper
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
=m4pB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppWAAKCRCRxhvAZXjc
okB2AP4jjoRErJBwj245OIDJqzoj4m4UVOVd0MH2AkiSpANczwD/TToChdpusY2y
qAYg1fQoGMbDVlb7Txaj9qI9ieCf9w0=
=2PXg
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs xattr updates from Christian Brauner:
"The 's_xattr' field of 'struct super_block' currently requires a
mutable table of 'struct xattr_handler' entries (although each handler
itself is const). However, no code in vfs actually modifies the
tables.
This changes the type of 's_xattr' to allow const tables, and modifies
existing file systems to move their tables to .rodata. This is
desirable because these tables contain entries with function pointers
in them; moving them to .rodata makes it considerably less likely to
be modified accidentally or maliciously at runtime"
* tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
const_structs.checkpatch: add xattr_handler
net: move sockfs_xattr_handlers to .rodata
shmem: move shmem_xattr_handlers to .rodata
overlayfs: move xattr tables to .rodata
xfs: move xfs_xattr_handlers to .rodata
ubifs: move ubifs_xattr_handlers to .rodata
squashfs: move squashfs_xattr_handlers to .rodata
smb: move cifs_xattr_handlers to .rodata
reiserfs: move reiserfs_xattr_handlers to .rodata
orangefs: move orangefs_xattr_handlers to .rodata
ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata
ntfs3: move ntfs_xattr_handlers to .rodata
nfs: move nfs4_xattr_handlers to .rodata
kernfs: move kernfs_xattr_handlers to .rodata
jfs: move jfs_xattr_handlers to .rodata
jffs2: move jffs2_xattr_handlers to .rodata
hfsplus: move hfsplus_xattr_handlers to .rodata
hfs: move hfs_xattr_handlers to .rodata
gfs2: move gfs2_xattr_handlers_max to .rodata
fuse: move fuse_xattr_handlers to .rodata
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc
ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS
P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4=
=7yD1
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Rename and export helpers that get write access to a mount. They
are used in overlayfs to get write access to the upper mount.
- Print the pretty name of the root device on boot failure. This
helps in scenarios where we would usually only print
"unknown-block(1,2)".
- Add an internal SB_I_NOUMASK flag. This is another part in the
endless POSIX ACL saga in a way.
When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
the umask because if the relevant inode has POSIX ACLs set it might
take the umask from there. But if the inode doesn't have any POSIX
ACLs set then we apply the umask in the filesytem itself. So we end
up with:
(1) no SB_POSIXACL -> strip umask in vfs
(2) SB_POSIXACL -> strip umask in filesystem
The umask semantics associated with SB_POSIXACL allowed filesystems
that don't even support POSIX ACLs at all to raise SB_POSIXACL
purely to avoid umask stripping. That specifically means NFS v4 and
Overlayfs. NFS v4 does it because it delegates this to the server
and Overlayfs because it needs to delegate umask stripping to the
upper filesystem, i.e., the filesystem used as the writable layer.
This went so far that SB_POSIXACL is raised eve on kernels that
don't even have POSIX ACL support at all.
Stop this blatant abuse and add SB_I_NOUMASK which is an internal
superblock flag that filesystems can raise to opt out of umask
handling. That should really only be the two mentioned above. It's
not that we want any filesystems to do this. Ideally we have all
umask handling always in the vfs.
- Make overlayfs use SB_I_NOUMASK too.
- Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
IS_POSIXACL() if the kernel doesn't have support for it. This is a
very old patch but it's only possible to do this now with the wider
cleanup that was done.
- Follow-up work on fake path handling from last cycle. Citing mostly
from Amir:
When overlayfs was first merged, overlayfs files of regular files
and directories, the ones that are installed in file table, had a
"fake" path, namely, f_path is the overlayfs path and f_inode is
the "real" inode on the underlying filesystem.
In v6.5, we took another small step by introducing of the
backing_file container and the file_real_path() helper. This change
allowed vfs and filesystem code to get the "real" path of an
overlayfs backing file. With this change, we were able to make
fsnotify work correctly and report events on the "real" filesystem
objects that were accessed via overlayfs.
This method works fine, but it still leaves the vfs vulnerable to
new code that is not aware of files with fake path. A recent
example is commit db1d1e8b98 ("IMA: use vfs_getattr_nosec to get
the i_version"). This commit uses direct referencing to f_path in
IMA code that otherwise uses file_inode() and file_dentry() to
reference the filesystem objects that it is measuring.
This contains work to switch things around: instead of having
filesystem code opt-in to get the "real" path, have generic code
opt-in for the "fake" path in the few places that it is needed.
Is it far more likely that new filesystems code that does not use
the file_dentry() and file_real_path() helpers will end up causing
crashes or averting LSM/audit rules if we keep the "fake" path
exposed by default.
This change already makes file_dentry() moot, but for now we did
not change this helper just added a WARN_ON() in ovl_d_real() to
catch if we have made any wrong assumptions.
After the dust settles on this change, we can make file_dentry() a
plain accessor and we can drop the inode argument to ->d_real().
- Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
change but it really isn't and I would like to see everyone on
their tippie toes for any possible bugs from this work.
Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
files since a very long time because of the nasty interactions
between the SCM_RIGHTS file descriptor garbage collection. So
extending it makes a lot of sense but it is a subtle change. There
are almost no places that fiddle with file rcu semantics directly
and the ones that did mess around with struct file internal under
rcu have been made to stop doing that because it really was always
dodgy.
I forgot to put in the link tag for this change and the discussion
in the commit so adding it into the merge message:
https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com
Cleanups:
- Various smaller pipe cleanups including the removal of a spin lock
that was only used to protect against writes without pipe_lock()
from O_NOTIFICATION_PIPE aka watch queues. As that was never
implemented remove the additional locking from pipe_write().
- Annotate struct watch_filter with the new __counted_by attribute.
- Clarify do_unlinkat() cleanup so that it doesn't look like an extra
iput() is done that would cause issues.
- Simplify file cleanup when the file has never been opened.
- Use module helper instead of open-coding it.
- Predict error unlikely for stale retry.
- Use WRITE_ONCE() for mount expiry field instead of just commenting
that one hopes the compiler doesn't get smart.
Fixes:
- Fix readahead on block devices.
- Fix writeback when layztime is enabled and inodes whose timestamp
is the only thing that changed reside on wb->b_dirty_time. This
caused excessively large zombie memory cgroup when lazytime was
enabled as such inodes weren't handled fast enough.
- Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"
* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
file, i915: fix file reference for mmap_singleton()
vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
chardev: Simplify usage of try_module_get()
ovl: rely on SB_I_NOUMASK
fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
fs: store real path instead of fake path in backing file f_path
fs: create helper file_user_path() for user displayed mapped file path
fs: get mnt_writers count for an open backing file's real path
vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
vfs: predict the error in retry_estale as unlikely
backing file: free directly
vfs: fix readahead(2) on block devices
io_uring: use files_lookup_fd_locked()
file: convert to SLAB_TYPESAFE_BY_RCU
vfs: shave work on failed file open
fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
watch_queue: Annotate struct watch_filter with __counted_by
fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
fs/pipe: remove unnecessary spinlock from pipe_write()
...
Convert block device handling to use bdev_open_by_dev/path() and pass
the handle around.
CC: linux-nfs@vger.kernel.org
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-25-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
If we're using the 'softerr' mount option, we may want to allow
layoutget to return EAGAIN to allow knfsd server threads to return a
JUKEBOX/DELAY error to the client instead of busy waiting.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When using a 'softerr' mount, the NFSv4 client can get stuck waiting
forever while the server just returns NFS4ERR_DELAY. Among other things,
this causes the knfsd server threads to busy wait.
Add a parameter that tells the NFSv4 client how many times to retry
before giving up.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
SB_POSIXACL must be set when a filesystem supports POSIX ACLs, but NFSv4
also sets this flag to prevent the VFS from applying the umask on
newly-created files. NFSv4 doesn't support POSIX ACLs however, which
causes confusion when other subsystems try to test for them.
Add a new SB_I_NOUMASK flag that allows filesystems to opt-in to umask
stripping without advertising support for POSIX ACLs. Set the new flag
on NFSv4 instead of SB_POSIXACL.
Also, move mode_strip_umask to namei.h and convert init_mknod and
init_mkdir to use it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230911-acl-fix-v3-1-b25315333f6c@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This patches fixes commit 51d674a5e4 "NFSv4.1: use
EXCHGID4_FLAG_USE_PNFS_DS for DS server", purpose of that
commit was to mark EXCHANGE_ID to the DS with the appropriate
flag.
However, connection to MDS can return both EXCHGID4_FLAG_USE_PNFS_DS
and EXCHGID4_FLAG_USE_PNFS_MDS set but previous patch would only
remember the USE_PNFS_DS and for the 2nd EXCHANGE_ID send that
to the MDS.
Instead, just mark the pnfs path exclusively.
Fixes: 51d674a5e4 ("NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that we check the layout pointer and validity after dereferencing
it in ff_layout_mirror_prepare_stats.
Fixes: 08e2e5bc6c ("pNFS/flexfiles: Clean up layoutstats")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We are not allowed to call pnfs_mark_matching_lsegs_return() without
also holding a reference to the layout header, since doing so could lead
to the reference count going to zero when we call
pnfs_layout_remove_lseg(). This again can lead to a hang when we get to
nfs4_evict_inode() and are unable to clear the layout pointer.
pnfs_layout_return_unused_byserver() is guilty of this behaviour, and
has been seen to trigger the refcount warning prior to a hang.
Fixes: b6d49ecd10 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
svc threads are currently stopped using kthread_stop(). This requires
identifying a specific thread. However we don't care which thread
stops, just as long as one does.
So instead, set a flag in the svc_pool to say that a thread needs to
die, and have each thread check this flag instead of calling
kthread_should_stop(). The first thread to find and clear this flag
then moves towards exiting.
This removes an explicit dependency on sp_all_threads which will make a
future patch simpler.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Using svc_recv() for (NFSv4.1) back-channel handling means we have just
one mechanism for waking threads.
Also change kthread_freezable_should_stop() in nfs4_callback_svc() to
kthread_should_stop() as used elsewhere.
kthread_freezable_should_stop() effectively adds a try_to_freeze() call,
and svc_recv() already contains that at an appropriate place.
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The test robot complained that, in some build configurations, the
@error variable in bc_svc_process's only caller is set but never
used. This happens because dprintk() is the only consumer of that
value.
- Remove the dprintk() call sites in favor of the svc_process
tracepoint
- The @error variable and the return value of bc_svc_process() are
now unused, so get rid of them.
- The @serv parameter is set to rqstp->rq_serv by the only caller,
and bc_svc_process() then uses it only to set rqstp->rq_serv. It
can be removed.
- Rename bc_svc_process() according to the convention that
globally-visible RPC server functions have names that begin with
"svc_"; and because it is globally-visible, give it a proper
kdoc comment.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202308121314.HA8Rq2XG-lkp@intel.com/
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Once a folio's private data has been cleared, it's possible for another
process to clear the folio->mapping (e.g. via invalidate_complete_folio2
or evict_mapping_folio), so it wouldn't be safe to call
nfs_page_to_inode() after that.
Fixes: 0c493b5cf1 ("NFS: Convert buffered writes to use folios")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The Linux NFS server strips the SUID and SGID from the file mode
on ALLOCATE op.
Modify _nfs42_proc_fallocate to add NFS_INO_REVAL_FORCED to
nfs_set_cache_invalid's argument to force update of the file
mode suid/sgid bit.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This makes it harder for accidental or malicious changes to
nfs4_xattr_handlers at runtime.
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Link: https://lore.kernel.org/r/20230930050033.41174-19-wedsonaf@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct nfs4_ff_layout_segment.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: linux-nfs@vger.kernel.org
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915201434.never.346-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct nfs4_file_layout_dsaddr.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: linux-nfs@vger.kernel.org
Link: https://lore.kernel.org/r/20230915201427.never.771-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Commit 4dc73c6791 reintroduces the deadlock that was fixed by commit
aeabb3c961 ("NFSv4: Fix a NFSv4 state manager deadlock") because it
prevents the setup of new threads to handle reboot recovery, while the
older recovery thread is stuck returning delegations.
Fixes: 4dc73c6791 ("NFSv4: keep state manager thread active if swap is enabled")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the NFS4CLNT_RUN_MANAGER flag got set just before we cleared
NFS4CLNT_MANAGER_RUNNING, then we might have won the race against
nfs4_schedule_state_manager(), and are responsible for handling the
recovery situation.
Fixes: aeabb3c961 ("NFSv4: Fix a NFSv4 state manager deadlock")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently, if the OPEN compound experiencing an error and needs to
get the file attributes separately, it will send a stand alone
GETATTR but it would use the filehandle from the results of
the OPEN compound. In case of the CLAIM_FH OPEN, nfs_openres's fh
is zero value. That generate a GETATTR that's sent with a zero
value filehandle, and results in the server returning an error.
Instead, for the CLAIM_FH OPEN, take the filehandle that was used
in the PUTFH of the OPEN compound.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently, when GETDEVICEINFO returns multiple locations where each
is a different IP but the server's identity is same as MDS, then
nfs4_set_ds_client() finds the existing nfs_client structure which
has the MDS's max_connect value (and if it's 1), then the 1st IP
on the DS's list will get dropped due to MDS trunking rules. Other
IPs would be added as they fall under the pnfs trunking rules.
For the list of IPs the 1st goes thru calling nfs4_set_ds_client()
which will eventually call nfs4_add_trunk() and call into
rpc_clnt_test_and_add_xprt() which has the check for MDS trunking.
The other IPs (after the 1st one), would call rpc_clnt_add_xprt()
which doesn't go thru that check.
nfs4_add_trunk() is called when MDS trunking is happening and it
needs to enforce the usage of max_connect mount option of the
1st mount. However, this shouldn't be applied to pnfs flow.
Instead, this patch proposed to treat MDS=DS as DS trunking and
make sure that MDS's max_connect limit does not apply to the
1st IP returned in the GETDEVICEINFO list. It does so by
marking the newly created client with a new flag NFS_CS_PNFS
which then used to pass max_connect value to use into the
rpc_clnt_test_and_add_xprt() instead of the existing rpc
client's max_connect value set by the MDS connection.
For example, mount was done without max_connect value set
so MDS's rpc client has cl_max_connect=1. Upon calling into
rpc_clnt_test_and_add_xprt() and using rpc client's value,
the caller passes in max_connect value which is previously
been set in the pnfs path (as a part of handling
GETDEVICEINFO list of IPs) in nfs4_set_ds_client().
However, when NFS_CS_PNFS flag is not set and we know we
are doing MDS trunking, comparing a new IP of the same
server, we then set the max_connect value to the
existing MDS's value and pass that into
rpc_clnt_test_and_add_xprt().
Fixes: dc48e0abee ("SUNRPC enforce creation of no more than max_connect xprts")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
With IPv6, connect() can occasionally return EINVAL if a route is
unavailable. If this happens during I/O to a data server, we want to
report it using LAYOUTERROR as an inability to connect.
Fixes: dd52128afd ("NFSv4.1/pnfs Ensure flexfiles reports all connection related errors")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that all requests are put back onto the commit list so that they
can be rescheduled.
Fixes: 4daaeba938 ("NFS: Fix nfs_direct_write_reschedule_io()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that nfs_clear_request_commit() updates the correct counters when
it removes them from the commit list.
Fixes: ed5d588fe4 ("NFS: Try to join page groups before an O_DIRECT retransmission")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we hit a fatal error when retransmitting, we do need to record the
removal of the request from the count of written bytes.
Fixes: 031d73ed76 ("NFS: Fix O_DIRECT accounting of number of bytes read/written")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The dreq fields are protected by the dreq->lock.
Fixes: 954998b60c ("NFS: Fix error handling for O_DIRECT write scheduling")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we fail to schedule a request for transmission, there are 2
possibilities:
1) Either we hit a fatal error, and we just want to drop the remaining
requests on the floor.
2) We were asked to try again, in which case we should allow the
outstanding RPC calls to complete, so that we can recoalesce requests
and try again.
Fixes: d600ad1f2b ("NFS41: pop some layoutget errors to application")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
New Features:
* Enable the NFS v4.2 READ_PLUS operation by default
Stable Fixes:
* NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
* NFS: Fix a potential data corruption
Bugfixes:
* Fix various READ_PLUS issues including:
* smatch warnings
* xdr size calculations
* scratch buffer handling
* 32bit / highmem xdr page handling
* Fix checkpatch errors in file.c
* Fix redundant readdir request after an EOF
* Fix handling of COPY ERR_OFFLOAD_NO_REQ
* Fix assignment of xprtdata.cred
Cleanups:
* Remove unused xprtrdma function declarations
* Clean up an integer overflow check to avoid a warning
* Clean up #includes in dns_resolve.c
* Clean up nfs4_get_device_info so we don't pass a NULL pointer to __free_page()
* Clean up sunrpc TCP socket timeout configuration
* Guard against READDIR loops when entry names are too long
* Use EXCHID4_FLAG_USE_PNFS_DS for DS servers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmTwzwYACgkQ18tUv7Cl
QOtIhBAA+BOh7MB6yjlyctFxABJiXz2x2Dehxy7Ox15LfnyStqQAUEpk35CXWvjC
iNxpZJ486+WrzM76WGEaRbECK9nTQLK1yacR3V1zpnDwHWIJA6VHN6qU4JAfSMu7
XbhWkHWry6d7PXhvqHlaiYvPX2pF39wUzfH+vLlzS2QLIkpT6LnG0zVRJTQvLCmq
zE5xD+NCQ1Dpo9VnouuzW7VVfm532hI7GQNrpo0E0vWKgeQD+/fOpDu23MW8A1Ua
ZgVMAc7vScgDZH8/20Ze5PH4jAEB4gwEIzjreQlIXr7Tf+mE7qn435lgOuvdMQCW
eHhdNriZ2X6HMLhNFFpup8bkRKGCCTooTHC1W66n9CuxIAuVT5DNwBbakpagHSZf
J4ho81hEgBfc5zppISVjV6eFK4brM0rF9AliaIw8r/qGcMmO1CILi9tLGiheiJcT
LuId7U2sE/vfIa6SiBt7rx37/MkrgLlAgjpk4dCRJQW+gKVBi09sMGnDlgaRvCZz
T0WCsK4DgI9q2rScpwJYJbNWbC2Q8qUtYWW9LSRvwhbNdm/VbRnEHWA7eOwqqm8r
KkkF4chyoTJqpnF3SjxT/lyFk6GwsD+wXafOmEeuFA6Si3dHDU9i3aUf+cCXhwRI
uUzCUHYcCKnv4QVGPuAbIdxMgueNCuLoeWgTClVlqidv7GRyz7Y=
=rjmq
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Enable the NFS v4.2 READ_PLUS operation by default
Stable Fixes:
- NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
- NFS: Fix a potential data corruption
Bugfixes:
- Fix various READ_PLUS issues including:
- smatch warnings
- xdr size calculations
- scratch buffer handling
- 32bit / highmem xdr page handling
- Fix checkpatch errors in file.c
- Fix redundant readdir request after an EOF
- Fix handling of COPY ERR_OFFLOAD_NO_REQ
- Fix assignment of xprtdata.cred
Cleanups:
- Remove unused xprtrdma function declarations
- Clean up an integer overflow check to avoid a warning
- Clean up #includes in dns_resolve.c
- Clean up nfs4_get_device_info so we don't pass a NULL pointer
to __free_page()
- Clean up sunrpc TCP socket timeout configuration
- Guard against READDIR loops when entry names are too long
- Use EXCHID4_FLAG_USE_PNFS_DS for DS servers"
* tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (22 commits)
pNFS: Fix assignment of xprtdata.cred
NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQ
NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN
NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server
NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver
SUNRPC: Don't override connect timeouts in rpc_clnt_add_xprt()
SUNRPC: Allow specification of TCP client connect timeout at setup
SUNRPC: Refactor and simplify connect timeout
SUNRPC: Set the TCP_SYNCNT to match the socket timeout
NFS: Fix a potential data corruption
nfs: fix redundant readdir request after get eof
nfs/blocklayout: Use the passed in gfp flags
filemap: Fix errors in file.c
NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
NFS: Move common includes outside ifdef
SUNRPC: clean up integer overflow check
xprtrdma: Remove unused function declaration rpcrdma_bc_post_recv()
NFS: Enable the READ_PLUS operation by default
SUNRPC: kmap() the xdr pages during decode
NFSv4.2: Rework scratch handling for READ_PLUS (again)
...
I'm thrilled to announce that the Linux in-kernel NFS server now
offers NFSv4 write delegations. A write delegation enables a client
to cache data and metadata for a single file more aggressively,
reducing network round trips and server workload. Many thanks to Dai
Ngo for contributing this facility, and to Jeff Layton and Neil
Brown for reviewing and testing it.
This release also sees the removal of all support for DES- and
triple-DES-based Kerberos encryption types in the kernel's SunRPC
implementation. These encryption types have been deprecated by the
Internet community for years and are considered insecure. This
change affects both the in-kernel NFS client and server.
The server's UDP and TCP socket transports have now fully adopted
David Howells' new bio_vec iterator so that no more than one
sendmsg() call is needed to transmit each RPC message. In
particular, this helps kTLS optimize record boundaries when sending
RPC-with-TLS replies, and it takes the server a baby step closer to
handling file I/O via folios.
We've begun work on overhauling the SunRPC thread scheduler to
remove a costly linked-list walk when looking for an idle RPC
service thread to wake. The pre-requisites are included in this
release. Thanks to Neil Brown for his ongoing work on this
improvement.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmTwoa0ACgkQM2qzM29m
f5cZvw/8CmFVNC27aMrJEhRRhwwrXbLzUkWh9GCYkG98PHYiLxLTvZ6qELXAax/a
UjSgIDSRcWl4z8M/tyBQtgsw7NADr+7XWqEoXR7HZ5pEEC/KNGM0oQWQ92ojjKYy
JmHdB02uaDJfcd9ioFTU13cw7q2BQfoe2xLI8yqis2vcVSu92AM7aIw+cvJIpwQB
inA3TIIsYTV/gPByXSfEtvmYACadoFiMvfvYwaWhjFS9MdSzFmcVG0Dp3EFIig29
odmWEofcz6uIvUWvUswWEGdoSu7uOKIztSuAI4PlTwaofUaSKG6e5kmtpr3cLERD
Uhg2lm5JgqkXBd7QHObNimJ4DtQzFwHmhA08qo8rd/zba75mn/Hr5IF0q3Rxs99J
SRYHcAeP8afKn5Ge0yzoTgCNcqhfz8KLRfoCQX49mljr+muNxld4nMklD2KdUwJi
XEB512/q3E4nUgopXZiSJYQYAq1CfdR5WpGipZ9X0XK9HZBDF/qhXGtk1YQuNWyj
ZxJS3bfBza4oVIvP5/ehjCIQwOvqkcrC5zZGDIgDvw9Q6L3L1wqmVntsdCLCLRcJ
jB4IOsj+DECfJ6w2vP2SZ3GeMtFnyuTQjhUTkjPuAKTBBiKo4Tj0o/agpfDYbWZy
1l3a2yH5jqJgkm4MaVh3YHRJGc0ub0ccpIrs3QQ4jvjMLQ/3Gcs=
=XGHs
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"I'm thrilled to announce that the Linux in-kernel NFS server now
offers NFSv4 write delegations. A write delegation enables a client to
cache data and metadata for a single file more aggressively, reducing
network round trips and server workload. Many thanks to Dai Ngo for
contributing this facility, and to Jeff Layton and Neil Brown for
reviewing and testing it.
This release also sees the removal of all support for DES- and
triple-DES-based Kerberos encryption types in the kernel's SunRPC
implementation. These encryption types have been deprecated by the
Internet community for years and are considered insecure. This change
affects both the in-kernel NFS client and server.
The server's UDP and TCP socket transports have now fully adopted
David Howells' new bio_vec iterator so that no more than one sendmsg()
call is needed to transmit each RPC message. In particular, this helps
kTLS optimize record boundaries when sending RPC-with-TLS replies, and
it takes the server a baby step closer to handling file I/O via
folios.
We've begun work on overhauling the SunRPC thread scheduler to remove
a costly linked-list walk when looking for an idle RPC service thread
to wake. The pre-requisites are included in this release. Thanks to
Neil Brown for his ongoing work on this improvement"
* tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits)
Documentation: Add missing documentation for EXPORT_OP flags
SUNRPC: Remove unused declaration rpc_modcount()
SUNRPC: Remove unused declarations
NFSD: da_addr_body field missing in some GETDEVICEINFO replies
SUNRPC: Remove return value of svc_pool_wake_idle_thread()
SUNRPC: make rqst_should_sleep() idempotent()
SUNRPC: Clean up svc_set_num_threads
SUNRPC: Count ingress RPC messages per svc_pool
SUNRPC: Deduplicate thread wake-up code
SUNRPC: Move trace_svc_xprt_enqueue
SUNRPC: Add enum svc_auth_status
SUNRPC: change svc_xprt::xpt_flags bits to enum
SUNRPC: change svc_rqst::rq_flags bits to enum
SUNRPC: change svc_pool::sp_flags bits to enum
SUNRPC: change cache_head.flags bits to enum
SUNRPC: remove timeout arg from svc_recv()
SUNRPC: change svc_recv() to return void.
SUNRPC: call svc_process() from svc_recv().
nfsd: separate nfsd_last_thread() from nfsd_put()
nfsd: Simplify code around svc_exit_thread() call in nfsd()
...
NFS switch to open coding kill_anon_super in 7b14a21389
("nfs: don't call bdi_unregister") to avoid the extra bdi_unregister
call. At that point bdi_destroy was called in nfs_free_server and
thus it required a later freeing of the anon dev_t. But since
0db10944a7 ("nfs: Convert to separately allocated bdi") the bdi has
been free implicitly by the sb destruction, so this isn't needed
anymore.
By not open coding kill_anon_super, nfs now inherits the fix in
dc3216b141 ("super: ensure valid info"), and we remove the only
open coded version of kill_anon_super.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230831052940.256193-1-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The comma at the end of the line was leftover from an earlier refactor
of the _nfs4_pnfs_v3_ds_connect() function. This is technically valid C,
so the compilers didn't catch it, but if I'm understanding how it works
correctly it assigns the return value of rpc_clnt_add_xprtr() to
xprtdata.cred.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: a12f996d34 ("NFSv4/pNFS: Use connections to a DS that are all of the same protocol family")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the client sent a synchronous copy and the server replied with
ERR_OFFLOAD_NO_REQ indicating that it wants an asynchronous
copy instead, the client should retry with asynchronous copy.
Fixes: 539f57b3e0 ("NFS handle COPY ERR_OFFLOAD_NO_REQS")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 64cfca85ba asserts the only valid return values for
nfs2/3_decode_dirent should not include -ENAMETOOLONG, but for a server
that sends a filename3 which exceeds MAXNAMELEN in a READDIR response the
client's behavior will be to endlessly retry the operation.
We could map -ENAMETOOLONG into -EBADCOOKIE, but that would produce
truncated listings without any error. The client should return an error
for this case to clearly assert that the server implementation must be
corrected.
Fixes: 64cfca85ba ("NFS: Return valid errors from nfs2/3_decode_dirent()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In addition to the benefits of using an enum rather than a set of
macros, we now have a named type that can improve static type
checking of function return values.
As part of this change, I removed a stale comment from svcauth.h;
the return values from current implementations of the
auth_ops::release method are all zero/negative errno, not the SVC_OK
enum values as the old comment suggested.
Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Most svc threads have no interest in a timeout.
nfsd sets it to 1 hour, but this is a wart of no significance.
lockd uses the timeout so that it can call nlmsvc_retry_blocked().
It also sometimes calls svc_wake_up() to ensure this is called.
So change lockd to be consistent and always use svc_wake_up() to trigger
nlmsvc_retry_blocked() - using a timer instead of a timeout to
svc_recv().
And change svc_recv() to not take a timeout arg.
This makes the sp_threads_timedout counter always zero.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_recv() currently returns a 0 on success or one of two errors:
- -EAGAIN means no message was successfully received
- -EINTR means the thread has been told to stop
Previously nfsd would stop as the result of a signal as well as
following kthread_stop(). In that case the difference was useful: EINTR
means stop unconditionally. EAGAIN means stop if kthread_should_stop(),
continue otherwise.
Now threads only exit when kthread_should_stop() so we don't need the
distinction.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
All callers of svc_recv() go on to call svc_process() on success.
Simplify callers by having svc_recv() do that for them.
This loses one call to validate_process_creds() in nfsd. That was
debugging code added 14 years ago. I don't think we need to keep it.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The original implementation of nfsd used signals to stop threads during
shutdown.
In Linux 2.3.46pre5 nfsd gained the ability to shutdown threads
internally it if was asked to run "0" threads. After this user-space
transitioned to using "rpc.nfsd 0" to stop nfsd and sending signals to
threads was no longer an important part of the API.
In commit 3ebdbe5203 ("SUNRPC: discard svo_setup and rename
svc_set_num_threads_sync()") (v5.17-rc1~75^2~41) we finally removed the
use of signals for stopping threads, using kthread_stop() instead.
This patch makes the "obvious" next step and removes the ability to
signal nfsd threads - or any svc threads. nfsd stops allowing signals
and we don't check for their delivery any more.
This will allow for some simplification in later patches.
A change worth noting is in nfsd4_ssc_setup_dul(). There was previously
a signal_pending() check which would only succeed when the thread was
being shut down. It should really have tested kthread_should_stop() as
well. Now it just does the latter, not the former.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support tracking
KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the GENERIC_IOREMAP
ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency improvements
("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation, from
Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code ("Two
minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table range
API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM subsystem
documentation ("Improve mm documentation").
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
=Afws
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").
* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTxQAKCRCRxhvAZXjc
okaVAP94WAlItvDRt/z2Wtzf0+RqPZeTXEdGTxua8+RxqCyYIQD+OO5nRfKQPHlV
AqqGJMKItQMSMIYgB5ftqVhNWZfnHgM=
=pSEW
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual filesystems.
Features:
- Block mode changes on symlinks and rectify our broken semantics
- Report file modifications via fsnotify() for splice
- Allow specifying an explicit timeout for the "rootwait" kernel
command line option. This allows to timeout and reboot instead of
always waiting indefinitely for the root device to show up
- Use synchronous fput for the close system call
Cleanups:
- Get rid of open-coded lockdep workarounds for async io submitters
and replace it all with a single consolidated helper
- Simplify epoll allocation helper
- Convert simple_write_begin and simple_write_end to use a folio
- Convert page_cache_pipe_buf_confirm() to use a folio
- Simplify __range_close to avoid pointless locking
- Disable per-cpu buffer head cache for isolated cpus
- Port ecryptfs to kmap_local_page() api
- Remove redundant initialization of pointer buf in pipe code
- Unexport the d_genocide() function which is only used within core
vfs
- Replace printk(KERN_ERR) and WARN_ON() with WARN()
Fixes:
- Fix various kernel-doc issues
- Fix refcount underflow for eventfds when used as EFD_SEMAPHORE
- Fix a mainly theoretical issue in devpts
- Check the return value of __getblk() in reiserfs
- Fix a racy assert in i_readcount_dec
- Fix integer conversion issues in various functions
- Fix LSM security context handling during automounts that prevented
NFS superblock sharing"
* tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
cachefiles: use kiocb_{start,end}_write() helpers
ovl: use kiocb_{start,end}_write() helpers
aio: use kiocb_{start,end}_write() helpers
io_uring: use kiocb_{start,end}_write() helpers
fs: create kiocb_{start,end}_write() helpers
fs: add kerneldoc to file_{start,end}_write() helpers
io_uring: rename kiocb_end_write() local helper
splice: Convert page_cache_pipe_buf_confirm() to use a folio
libfs: Convert simple_write_begin and simple_write_end to use a folio
fs/dcache: Replace printk and WARN_ON by WARN
fs/pipe: remove redundant initialization of pointer buf
fs: Fix kernel-doc warnings
devpts: Fix kernel-doc warnings
doc: idmappings: fix an error and rephrase a paragraph
init: Add support for rootwait timeout parameter
vfs: fix up the assert in i_readcount_dec
fs: Fix one kernel-doc comment
docs: filesystems: idmappings: clarify from where idmappings are taken
fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
=eJz5
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
After receiving the location(s) of the DS server(s) in the
GETDEVINCEINFO, create the request for the clientid to such
server and indicate that the client is connecting to a DS.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that the connect timeout for the pNFS flexfiles driver is of the
same order as the I/O timeout, so that we can fail over quickly when
trying to read from a data server that is down.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We must ensure that the subrequests are joined back into the head before
we can retransmit a request. If the head was not on the commit lists,
because the server wrote it synchronously, we still need to add it back
to the retransmission list.
Add a call that mirrors the effect of nfs_cancel_remove_inode() for
O_DIRECT.
Fixes: ed5d588fe4 ("NFS: Try to join page groups before an O_DIRECT retransmission")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When a directory contains 17 files (except . and ..), nfs client sends
a redundant readdir request after get eof.
A simple reproduce,
At NFS server, create a directory with 17 files under exported directory.
# mkdir test
# cd test
# for i in {0..16} ; do touch $i; done
At NFS client, no matter mounting through nfsv3 or nfsv4,
does ls (or ll) at the created test directory.
A tshark output likes following (for nfsv4),
# tshark -i eth0 tcp port 2049 -Tfields -e ip.src -e ip.dst -e nfs -e nfs.cookie4
srcip dstip SEQUENCE, PUTFH, READDIR 0
dstip srcip SEQUENCE PUTFH READDIR 909539109313539306,2108391201987888856,2305312124304486544,2566335452463141496,2978225129081509984,4263037479923412583,4304697173036510679,4666703455469210097,4759208201298769007,4776701232145978803,5338408478512081262,5949498658935544804,5971526429894832903,6294060338267709855,6528840566229532529,8600463293536422524,9223372036854775807
srcip dstip
srcip dstip SEQUENCE, PUTFH, READDIR 9223372036854775807
dstip srcip SEQUENCE PUTFH READDIR
The READDIR with cookie 9223372036854775807(0x7FFFFFFFFFFFFFFF) is redundant.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This allocation should use the passed in GFP_ flags instead of
GFP_KERNEL. One places where this matters is in filelayout_pg_init_write()
which uses GFP_NOFS as the allocation flags.
Fixes: 5c83746a0c ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The following checkpatch errors are removed:
ERROR: "foo * bar" should be "foo *bar"
"foo * bar" should be "foo *bar"
Signed-off-by: ZhiHu <huzhi001@208suo.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
It is an almost improbable error case but when page allocating loop in
nfs4_get_device_info() fails then we should only free the already
allocated pages, as __free_page() can't deal with NULL arguments.
Found by Linux Verification Center (linuxtesting.org).
Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
module.h, clnt.h, addr.h and dns_resolve.h is always included whether
CONFIG_NFS_USE_KERNEL_DNS is set or not and their order does not seems
to matter.
Move them outside the ifdef to simplify code and avoid checkincludes
message.
Signed-off-by: GUO Zihua <guozihua@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that the remaining issues have been worked out, including some
unexpected 32 bit issues, we can safely enable the feature by default.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I found that the read code might send multiple requests using the same
nfs_pgio_header, but nfs4_proc_read_setup() is only called once. This is
how we ended up occasionally double-freeing the scratch buffer, but also
means we set a NULL pointer but non-zero length to the xdr scratch
buffer. This results in an oops the first time decoding needs to copy
something to scratch, which frequently happens when decoding READ_PLUS
hole segments.
I fix this by moving scratch handling into the pageio read code. I
provide a function to allocate scratch space for decoding read replies,
and free the scratch buffer when the nfs_pgio_header is freed.
Fixes: fbd2a05f29 (NFSv4.2: Rework scratch handling for READ_PLUS)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I bump the decode_read_plus_maxsz to account for hole segments, but I
need to subtract out this increase when calling
rpc_prepare_reply_pages() so the common case of single data segment
replies can be directly placed into the xdr pages without needing to be
shifted around.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: d3b00a802c ("NFS: Replace the READ_PLUS decoding code")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Smatch reports:
fs/nfs/nfs42xdr.c:1131 decode_read_plus() warn: missing error code? 'status'
Which Dan suggests to fix by doing a hardcoded "return 0" from the
"if (segments == 0)" check.
Additionally, smatch reports that the "status = -EIO" assignment is not
used. This patch addresses both these issues.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/r/202305222209.6l5VM2lL-lkp@intel.com/
Fixes: d3b00a802c ("NFS: Replace the READ_PLUS decoding code")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Highlights include:
Stable fixes
- NFS: Fix a use after free in nfs_direct_join_group()
Bugfixes
- NFS: Fix a sysfs server name memory leak
- NFS: Fix a lock recovery hang in NFSv4.0
- NFS: Fix page free in the error path for nfs42_proc_getxattr
- NFS: Fix page free in the error path for __nfs4_get_acl_uncached
- SUNRPC/rdma: Fix receive buffer dma-mapping after a server disconnect
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmTjgEEACgkQZwvnipYK
APItFA//WzGcKbujlMXpiRdvUg6k6CfG/ikBRB1UwQEyZjK/tVZ96qt6UuHGNMbz
b8GaGls7NRYJKezAcMSW9QMMPYVyG0PLwxOW6BPwsZS61Zn6HMeM1YRboaZEid7f
JrUNhbUXHl6bVWrBNEtcr3IN/5ERU4sGCAa4A3uWdNxGyffD/avrK06/bfmE/SJi
+7LVPp0M9rM5X5Z1c407TbWfg+L81Q9t0tTz7II3Ba9i2BzQ0uhQhyVUQAGF767u
Vua4XWTRoqG1es+tA4iuwZ3KtaqXoaMRDWPLGTkmBrY+pAo+u4IPzY5LCwfUu6kI
vttkZU5b0b05+UomJ1d+Muzr8uEjRmBhIHZsP6lgVVmuNzqkDb0gCGkfix87J+RO
0QmDZ9D0ftJxsb8fSdp8iy8NqmqJ6X4FhsylRtANEuCrf8+zrkUlBJi47CCwpYDD
8gq6SoTfA8MmiSgzrBuYkJe2HSx7c2csDl3xp5KrJX2IHODjbzlHC05fNadTWc6W
0jQvq1cJ2xBYDNSxkG0Trsd3lTTao3rZC4M7imVVjTTOHS8X1LNCLkbZ7LVnA8rn
0F+lp/h1qs/daXSp0aMG5wyvZNkx5rsJ23o+InNCjiCh3cDvoi9mg6DN5bQK8Foy
Iqd2MTgxrMaF/FUbdGLdnFX4GQkgFPng8TpdX8sqqm1JHUprpqg=
=nd41
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
- fix a use after free in nfs_direct_join_group() (Cc: stable)
- fix sysfs server name memory leak
- fix lock recovery hang in NFSv4.0
- fix page free in the error path for nfs42_proc_getxattr() and
__nfs4_get_acl_uncached()
- SUNRPC/rdma: fix receive buffer dma-mapping after a server disconnect
* tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
xprtrdma: Remap Receive buffers after a reconnect
NFSv4: fix out path in __nfs4_get_acl_uncached
NFSv4.2: fix error handling in nfs42_proc_getxattr
NFS: Fix sysfs server name memory leak
NFS: Fix a use after free in nfs_direct_join_group()
NFSv4: Fix dropped lock for racing OPEN and delegation return
Another highly rare error case when a page allocating loop (inside
__nfs4_get_acl_uncached, this time) is not properly unwound on error.
Since pages array is allocated being uninitialized, need to free only
lower array indices. NULL checks were useful before commit 62a1573fcf
("NFSv4 fix acl retrieval over krb5i/krb5p mounts") when the array had
been initialized to zero on stack.
Found by Linux Verification Center (linuxtesting.org).
Fixes: 62a1573fcf ("NFSv4 fix acl retrieval over krb5i/krb5p mounts")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
There is a slight issue with error handling code inside
nfs42_proc_getxattr(). If page allocating loop fails then we free the
failing page array element which is NULL but __free_page() can't deal with
NULL args.
Found by Linux Verification Center (linuxtesting.org).
Fixes: a1f26739cc ("NFSv4.2: improve page handling for GETXATTR")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Free the formatted server index string after it has been duplicated by
kobject_rename().
Fixes: 1c7251187d ("NFS: add superblock sysfs entries")
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fscache has an optimisation by which reads from the cache are skipped
until we know that (a) there's data there to be read and (b) that data
isn't entirely covered by pages resident in the netfs pagecache. This is
done with two flags manipulated by fscache_note_page_release():
if (...
test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) &&
test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags))
clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags);
where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to
indicate that netfslib should download from the server or clear the page
instead.
The fscache_note_page_release() function is intended to be called from
->releasepage() - but that only gets called if PG_private or PG_private_2
is set - and currently the former is at the discretion of the network
filesystem and the latter is only set whilst a page is being written to
the cache, so sometimes we miss clearing the optimisation.
Fix this by following Willy's suggestion[1] and adding an address_space
flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call
->release_folio() if it's set, even if PG_private or PG_private_2 aren't
set.
Note that this would require folio_test_private() and page_has_private() to
become more complicated. To avoid that, in the places[*] where these are
used to conditionalise calls to filemap_release_folio() and
try_to_release_page(), the tests are removed the those functions just
jumped to unconditionally and the test is performed there.
[*] There are some exceptions in vmscan.c where the check guards more than
just a call to the releaser. I've added a function, folio_needs_release()
to wrap all the checks for that.
AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from
fscache and cleared in ->evict_inode() before truncate_inode_pages_final()
is called.
Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared
and the optimisation cancelled if a cachefiles object already contains data
when we open it.
[dwysocha@redhat.com: call folio_mapping() inside folio_needs_release()]
Link: 902c990e31
Link: https://lkml.kernel.org/r/20230628104852.3391651-3-dhowells@redhat.com
Fixes: 1f67e6d0b1 ("fscache: Provide a function to note the release of a page")
Fixes: 047487c947 ("cachefiles: Implement the I/O routines")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Tested-by: SeongJae Park <sj@kernel.org>
Cc: Daire Byrne <daire.byrne@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Be more careful when tearing down the subrequests of an O_DIRECT write
as part of a retransmission.
Reported-by: Chris Mason <clm@fb.com>
Fixes: ed5d588fe4 ("NFS: Try to join page groups before an O_DIRECT retransmission")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.
Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-55-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The interface for fcntl expects the argument passed for the command
F_SETLEASE to be of type int. The current code wrongly treats it as
a long. In order to avoid access to undefined bits, we should explicitly
cast the argument to int.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: David Laight <David.Laight@ACULAB.com>
Cc: Mark Rutland <Mark.Rutland@arm.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Cc: linux-morello@op-lists.linaro.org
Signed-off-by: Luca Vizzarro <Luca.Vizzarro@arm.com>
Message-Id: <20230414152459.816046-3-Luca.Vizzarro@arm.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Commmit f5ea16137a ("NFSv4: Retry LOCK on OLD_STATEID during delegation
return") attempted to solve this problem by using nfs4's generic async error
handling, but introduced a regression where v4.0 lock recovery would hang.
The additional complexity introduced by overloading that error handling is
not necessary for this case. This patch expects that commit to be
reverted.
The problem as originally explained in the above commit is:
There's a small window where a LOCK sent during a delegation return can
race with another OPEN on client, but the open stateid has not yet been
updated. In this case, the client doesn't handle the OLD_STATEID error
from the server and will lose this lock, emitting:
"NFS: nfs4_handle_delegation_recall_error: unhandled error -10024".
Fix this by using the old_stateid refresh helpers if the server replies
with OLD_STATEID.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Highlights include:
Stable fixes and other bugfixes:
- nfs: don't report STATX_BTIME in ->getattr
- Revert "NFSv4: Retry LOCK on OLD_STATEID during delegation return"
since it breaks NFSv4 state recovery.
- NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
- Fix the NFSv4.2 xattr cache shrinker_id
- Force a ctime update after a NFSv4.2 SETXATTR call
Features and cleanups:
- NFS and RPC over TLS client code from Chuck Lever.
- Support for use of abstract unix socket addresses with the rpcbind
daemon.
- Sysfs API to allow shutdown of the kernel RPC client and prevent
umount() hangs if the server is known to be permanently down.
- XDR cleanups from Anna.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmSgmmUACgkQZwvnipYK
APJwUA/+J6uEjJFoigSDU5dpCwQr4pHZPgUn3T2heplcyalGMxLo1VjDTVuFXb+a
NZqdUZF2ePmYqss/UYzJC7R6/z9OanVBcpiGqp66foJt9ncs9BSm5AzdV5Gvi4VX
6SrBM98nSqvD47l45LQ90bqIdR6WgMP9OiDC257PzYnaMZJcB0xObD4HWXh1zbIz
3xynJTSQnRGbv9I5EjJJGVIHDWLfSKY61NUXjrUcmMZ2L39ITNy0CRi8sIdj3oY/
A2Iz52IHtAhE77+EetThPskbTLa07raQSWRo3X6XJqCKiJIXa5giNDoG/zLq6sOT
hi1AV7Tdxaed2EYibeRWzsSVQIClBb7T/hdro5dWs5u/bxM6Bt+yY90ZWUMZVOAQ
/kGTYQXhI31vUgRaEN+2xci0wKDy9wqyAWcD8u8Gz01KaK09sfJSIvvYn+srSeaz
wEUQHZCdBGtNFVP2q18q4x8BN27uObh1DdMvNhrxrA7YraXSQvL/rIIsD0jmDInb
6olMm9g9nZSHgq62+CYs2v7J/AJKQzE7PsWrTMJDX1rso+/Lyc6x7oUGxv2IFt5H
VZVZNdstKeNzfcnNKsGG2ZbufhasKHqiHJxJTdNOuOi0YBi+ixtJVRpupId3+6aZ
ysng0IfzqiWSuiq5Axjreva+480IDSMW+7cqcw5urKEfYY5uVcc=
=leGh
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Stable fixes and other bugfixes:
- nfs: don't report STATX_BTIME in ->getattr
- Revert 'NFSv4: Retry LOCK on OLD_STATEID during delegation return'
since it breaks NFSv4 state recovery.
- NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
- Fix the NFSv4.2 xattr cache shrinker_id
- Force a ctime update after a NFSv4.2 SETXATTR call
Features and cleanups:
- NFS and RPC over TLS client code from Chuck Lever
- Support for use of abstract unix socket addresses with the rpcbind
daemon
- Sysfs API to allow shutdown of the kernel RPC client and prevent
umount() hangs if the server is known to be permanently down
- XDR cleanups from Anna"
* tag 'nfs-for-6.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (33 commits)
Revert "NFSv4: Retry LOCK on OLD_STATEID during delegation return"
NFS: Don't cleanup sysfs superblock entry if uninitialized
nfs: don't report STATX_BTIME in ->getattr
NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION
NFSv4.2: fix wrong shrinker_id
NFSv4: Clean up some shutdown loops
NFS: Cancel all existing RPC tasks when shutdown
NFS: add sysfs shutdown knob
NFS: add a sysfs link to the acl rpc_client
NFS: add a sysfs link to the lockd rpc_client
NFS: Add sysfs links to sunrpc clients for nfs_clients
NFS: add superblock sysfs entries
NFS: Make all of /sys/fs/nfs network-namespace unique
NFS: Open-code the nfs_kset kset_create_and_add()
NFS: rename nfs_client_kobj to nfs_net_kobj
NFS: rename nfs_client_kset to nfs_kset
NFS: Add an "xprtsec=" NFS mount option
NFS: Have struct nfs_client carry a TLS policy field
SUNRPC: Add a TCP-with-TLS RPC transport class
SUNRPC: Capture CMSG metadata on client-side receive
...
Olga Kornievskaia reports that this patch breaks NFSv4.0 state recovery.
It also introduces additional complexity in the error paths for cases not
related to the original problem. Let's revert it for now, and address the
original problem in another manner.
This reverts commit f5ea16137a.
Fixes: f5ea16137a ("NFSv4: Retry LOCK on OLD_STATEID during delegation return")
Reported-by: Kornievskaia, Olga <Olga.Kornievskaia@netapp.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Its possible to end up in nfs_free_server() before the server's superblock
sysfs entry has been initialized, in which case calling kobject_put() will
emit a WARNING. Check if the kobject has been initialized before cleaning
it up.
Fixes: 1c7251187d ("NFS: add superblock sysfs entries")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
- Convert strreplace() to return string start (Andy Shevchenko)
- Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
- Add missing function prototypes seen with W=1 (Arnd Bergmann)
- Fix strscpy() kerndoc typo (Arne Welzel)
- Replace strlcpy() with strscpy() across many subsystems which were
either Acked by respective maintainers or were trivial changes that
went ignored for multiple weeks (Azeem Shaikh)
- Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
- Add KUnit tests for strcat()-family
- Enable KUnit tests of FORTIFY wrappers under UML
- Add more complete FORTIFY protections for strlcat()
- Add missed disabling of FORTIFY for all arch purgatories.
- Enable -fstrict-flex-arrays=3 globally
- Tightening UBSAN_BOUNDS when using GCC
- Improve checkpatch to check for strcpy, strncpy, and fake flex arrays
- Improve use of const variables in FORTIFY
- Add requested struct_size_t() helper for types not pointers
- Add __counted_by macro for annotating flexible array size members
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmSbftQWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj0MD/9X9jzJzCmsAU+yNldeoAzC84Sk
GVU3RBxGcTNysL1gZXynkIgigw7DWc4htMGeSABHHwQRVP65JCH1Kw/VqIkyumbx
9LdX6IklMJb4pRT4PVU3azebV4eNmSjlur2UxMeW54Czm91/6I8RHbJOyAPnOUmo
2oomGdP/hpEHtKR7hgy8Axc6w5ySwQixh2V5sVZG3VbvCS5WKTmTXbs6puuRT5hz
iHt7v+7VtEg/Qf1W7J2oxfoghvVBsaRrSLrExWT/oZYh1ZxM7DsCAAoG/IsDgHGA
9LBXiRECgAFThbHVxLvvKZQMXdVk0i8iXLX43XMKC0wTA+NTyH7wlcQQ4RWNMuo8
sfA9Qm9gMArXaf64aymr3Uwn20Zan0391HdlbhOJZAE6v3PPJbleUnM58AzD2d3r
5Lz6AIFBxDImy+3f9iDWgacCT5/PkeiXTHzk9QnKhJyKKtRA58XJxj4q2+rPnGJP
n4haXqoxD5FJbxdXiGKk31RS0U5HBug7wkOcUrTqDHUbc/QNU2b7dxTKUx+zYtCU
uV5emPzpF4H4z+91WpO47n9gkMAfwV0lt9S2dwS8pxsgqctbmIan+Jgip7rsqZ2G
OgLXBsb43eEs+6WgO8tVt/ZHYj9ivGMdrcNcsIfikzNs/xweUJ53k2xSEn2xEa5J
cwANDmkL6QQK7yfeeg==
=s0j1
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"There are three areas of note:
A bunch of strlcpy()->strscpy() conversions ended up living in my tree
since they were either Acked by maintainers for me to carry, or got
ignored for multiple weeks (and were trivial changes).
The compiler option '-fstrict-flex-arrays=3' has been enabled
globally, and has been in -next for the entire devel cycle. This
changes compiler diagnostics (though mainly just -Warray-bounds which
is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_
coverage. In other words, there are no new restrictions, just
potentially new warnings. Any new FORTIFY warnings we've seen have
been fixed (usually in their respective subsystem trees). For more
details, see commit df8fc4e934.
The under-development compiler attribute __counted_by has been added
so that we can start annotating flexible array members with their
associated structure member that tracks the count of flexible array
elements at run-time. It is possible (likely?) that the exact syntax
of the attribute will change before it is finalized, but GCC and Clang
are working together to sort it out. Any changes can be made to the
macro while we continue to add annotations.
As an example of that last case, I have a treewide commit waiting with
such annotations found via Coccinelle:
https://git.kernel.org/linus/adc5b3cb48a049563dc673f348eab7b6beba8a9b
Also see commit dd06e72e68 for more details.
Summary:
- Fix KMSAN vs FORTIFY in strlcpy/strlcat (Alexander Potapenko)
- Convert strreplace() to return string start (Andy Shevchenko)
- Flexible array conversions (Arnd Bergmann, Wyes Karny, Kees Cook)
- Add missing function prototypes seen with W=1 (Arnd Bergmann)
- Fix strscpy() kerndoc typo (Arne Welzel)
- Replace strlcpy() with strscpy() across many subsystems which were
either Acked by respective maintainers or were trivial changes that
went ignored for multiple weeks (Azeem Shaikh)
- Remove unneeded cc-option test for UBSAN_TRAP (Nick Desaulniers)
- Add KUnit tests for strcat()-family
- Enable KUnit tests of FORTIFY wrappers under UML
- Add more complete FORTIFY protections for strlcat()
- Add missed disabling of FORTIFY for all arch purgatories.
- Enable -fstrict-flex-arrays=3 globally
- Tightening UBSAN_BOUNDS when using GCC
- Improve checkpatch to check for strcpy, strncpy, and fake flex
arrays
- Improve use of const variables in FORTIFY
- Add requested struct_size_t() helper for types not pointers
- Add __counted_by macro for annotating flexible array size members"
* tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (54 commits)
netfilter: ipset: Replace strlcpy with strscpy
uml: Replace strlcpy with strscpy
um: Use HOST_DIR for mrproper
kallsyms: Replace all non-returning strlcpy with strscpy
sh: Replace all non-returning strlcpy with strscpy
of/flattree: Replace all non-returning strlcpy with strscpy
sparc64: Replace all non-returning strlcpy with strscpy
Hexagon: Replace all non-returning strlcpy with strscpy
kobject: Use return value of strreplace()
lib/string_helpers: Change returned value of the strreplace()
jbd2: Avoid printing outside the boundary of the buffer
checkpatch: Check for 0-length and 1-element arrays
riscv/purgatory: Do not use fortified string functions
s390/purgatory: Do not use fortified string functions
x86/purgatory: Do not use fortified string functions
acpi: Replace struct acpi_table_slit 1-element array with flex-array
clocksource: Replace all non-returning strlcpy with strscpy
string: use __builtin_memcpy() in strlcpy/strlcat
staging: most: Replace all non-returning strlcpy with strscpy
drm/i2c: tda998x: Replace all non-returning strlcpy with strscpy
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
1ADPteUOTA==
=zP4Y
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr
J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK
Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY
AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/
sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s
ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f
tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+
36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a
J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF
j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA
maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB
M3VxWD3GZQ==
=KhW4
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux
Pull splice updates from Jens Axboe:
"This kills off ITER_PIPE to avoid a race between truncate,
iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
memory corruption.
Instead, we either use (a) filemap_splice_read(), which invokes the
buffered file reading code and splices from the pagecache into the
pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
into it and then pushes the filled pages into the pipe; or (c) handle
it in filesystem-specific code.
Summary:
- Rename direct_splice_read() to copy_splice_read()
- Simplify the calculations for the number of pages to be reclaimed
in copy_splice_read()
- Turn do_splice_to() into a helper, vfs_splice_read(), so that it
can be used by overlayfs and coda to perform the checks on the
lower fs
- Make vfs_splice_read() jump to copy_splice_read() to handle
direct-I/O and DAX
- Provide shmem with its own splice_read to handle non-existent pages
in the pagecache. We don't want a ->read_folio() as we don't want
to populate holes, but filemap_get_pages() requires it
- Provide overlayfs with its own splice_read to call down to a lower
layer as overlayfs doesn't provide ->read_folio()
- Provide coda with its own splice_read to call down to a lower layer
as coda doesn't provide ->read_folio()
- Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
and random files as they just copy to the output buffer and don't
splice pages
- Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation
- Make cifs use filemap_splice_read()
- Replace pointers to generic_file_splice_read() with pointers to
filemap_splice_read() as DIO and DAX are handled in the caller;
filesystems can still provide their own alternate ->splice_read()
op
- Remove generic_file_splice_read()
- Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
was the only user"
* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
splice: kdoc for filemap_splice_read() and copy_splice_read()
iov_iter: Kill ITER_PIPE
splice: Remove generic_file_splice_read()
splice: Use filemap_splice_read() instead of generic_file_splice_read()
cifs: Use filemap_splice_read()
trace: Convert trace/seq to use copy_splice_read()
zonefs: Provide a splice-read wrapper
xfs: Provide a splice-read wrapper
orangefs: Provide a splice-read wrapper
ocfs2: Provide a splice-read wrapper
ntfs3: Provide a splice-read wrapper
nfs: Provide a splice-read wrapper
f2fs: Provide a splice-read wrapper
ext4: Provide a splice-read wrapper
ecryptfs: Provide a splice-read wrapper
ceph: Provide a splice-read wrapper
afs: Provide a splice-read wrapper
9p: Add splice_read wrapper
net: Make sock_splice_read() use copy_splice_read() by default
tty, proc, kernfs, random: Use copy_splice_read()
...
NFS doesn't properly support reporting the btime in getattr (yet), but
61a968b4f0 mistakenly added it to the request_mask. This causes statx
for STATX_BTIME to report a zeroed out btime instead of properly
clearing the flag.
Cc: stable@vger.kernel.org # v6.3+
Fixes: 61a968b4f0 ("nfs: report the inode version in getattr if requested")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2214134
Reported-by: Boyang Xue <bxue@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When the client received NFS4ERR_BADSESSION, it schedules recovery
and start the state manager thread which in turn freezes the
session table and does not allow for any new requests to use the
no-longer valid session. However, it is possible that before
the state manager thread runs, a new operation would use the
released slot that received BADSESSION and was therefore not
updated its sequence number. Such re-use of the slot can lead
the application errors.
Fixes: 5c441544f0 ("NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Currently, the list_lru::shrinker_id corresponding to the nfs4_xattr
shrinkers is wrong:
>>> prog["nfs4_xattr_cache_lru"].shrinker_id
(int)0
>>> prog["nfs4_xattr_entry_lru"].shrinker_id
(int)0
>>> prog["nfs4_xattr_large_entry_lru"].shrinker_id
(int)0
>>> prog["nfs4_xattr_cache_shrinker"].id
(int)18
>>> prog["nfs4_xattr_entry_shrinker"].id
(int)19
>>> prog["nfs4_xattr_large_entry_shrinker"].id
(int)20
This is not what we expect, which will cause these shrinkers
not to be found in shrink_slab_memcg().
We should assign shrinker::id before calling list_lru_init_memcg(),
so that the corresponding list_lru::shrinker_id will be assigned
the correct value like below:
>>> prog["nfs4_xattr_cache_lru"].shrinker_id
(int)16
>>> prog["nfs4_xattr_entry_lru"].shrinker_id
(int)17
>>> prog["nfs4_xattr_large_entry_lru"].shrinker_id
(int)18
>>> prog["nfs4_xattr_cache_shrinker"].id
(int)16
>>> prog["nfs4_xattr_entry_shrinker"].id
(int)17
>>> prog["nfs4_xattr_large_entry_shrinker"].id
(int)18
So just do it.
Fixes: 95ad37f90c ("NFSv4.2: add client side xattr caching.")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If a SEQUENCE call receives -EIO for a shutdown client, it will retry the
RPC call. Instead of doing that for a shutdown client, just bail out.
Likewise, if the state manager decides to perform recovery for a shutdown
client, it will continuously retry. As above, just bail out.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Walk existing RPC tasks and cancel them with -EIO when the client is
shutdown.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Within each nfs_server sysfs tree, add an entry named "shutdown". Writing
1 to this file will set the cl_shutdown bit on the rpc_clnt structs
associated with that mount. If cl_shutdown is set, the task scheduler
immediately returns -EIO for new tasks.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After lockd is started, add a symlink for lockd's rpc_client under
NFS' superblock sysfs.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
For the general and state management nfs_client under each mount, create
symlinks to their respective rpc_client sysfs entries.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Create a sysfs directory for each mount that corresponds to the mount's
nfs_server struct. As the mount is being constructed, use the name
"server-n", but rename it to the "MAJOR:MINOR" of the mount after assigning
a device_id. The rename approach allows us to populate the mount's directory
with links to the various rpc_client objects during the mount's
construction. The naming convention (MAJOR:MINOR) can be used to reference
a particular NFS mount's sysfs tree.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Expand the NFS network-namespaced sysfs from /sys/fs/nfs/net down one level
into /sys/fs/nfs by moving the "net" kobject onto struct
nfs_netns_client and setting it up during network namespace init.
This prepares the way for superblock kobjects within /sys/fs/nfs that will
only be visible to matching network namespaces.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In preparation to make objects below /sys/fs/nfs namespace aware, we need
to define our own kobj_type for the nfs kset so that we can add the
.child_ns_type member in a following patch. No functional change here, only
the unrolling of kset_create_and_add().
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Match the variable names to the sysfs structure.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Be brief and match the subsystem name. There's no need to distinguish this
kset variable from the server.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
After some discussion, we decided that controlling transport layer
security policy should be separate from the setting for the user
authentication flavor. To accomplish this, add a new NFS mount
option to select a transport layer security policy for RPC
operations associated with the mount point.
xprtsec=none - Transport layer security is forced off.
xprtsec=tls - Establish an encryption-only TLS session. If
the initial handshake fails, the mount fails.
If TLS is not available on a reconnect, drop
the connection and try again.
xprtsec=mtls - Both sides authenticate and an encrypted
session is created. If the initial handshake
fails, the mount fails. If TLS is not available
on a reconnect, drop the connection and try
again.
To support client peer authentication (mtls), the handshake daemon
will have configurable default authentication material (certificate
or pre-shared key). In the future, mount options can be added that
can provide this material on a per-mount basis.
Updates to mount.nfs (to support xprtsec=auto) and nfs(5) will be
sent under separate cover.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The new field is used to match struct nfs_clients that have the same
TLS policy setting.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add some missing observability to the fs_context tracepoints
added by commit 33ce83ef0b ("NFS: Replace fs_context-related
dprintk() call sites with tracepoints").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Otherwise, `stat` will report a stale value to users.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fold them into the other NFS v4.2 operations in the right spots and
adjust spacing to keep the same style.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
I add commends above each function to match the style of the other
nfs4_xdr_dec_*() functions. I also remove the unnecessary #ifdef
CONFIG_NFS_V4_2 that was added around this code, since we are already in
a v4.2-only file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
They should be in the nfs4_xdr_enc_*() section, and not at the bottom of
the file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Move them out of the encode_*() section and into the decode_*() section
where it belongs.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Move the function to be with the other encode_*() functions, instead of
in the middle of the nfs4_xdr_enc_*() section.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder. Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.
For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All callers of generic_perform_write need to updated ki_pos, move it into
common code.
Link: https://lkml.kernel.org/r/20230601145904.1385409-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "cleanup the filemap / direct I/O interaction", v4.
This series cleans up some of the generic write helper calling conventions
and the page cache writeback / invalidation for direct I/O. This is a
spinoff from the no-bufferhead kernel project, for which we'll want to an
use iomap based buffered write path in the block layer.
This patch (of 12):
The last user of current->backing_dev_info disappeared in commit
b9b1335e64 ("remove bdi_congested() and wb_congested() and related
functions"). Remove the field and all assignments to it.
Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de
Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims. It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
Check for strscpy()'s return value of -E2BIG on truncate for safe
replacement with strlcpy().
This is part of a tree-wide cleanup to remove the strlcpy() function
entirely from the kernel [2].
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230512155749.1356958-1-azeemshaikh38@gmail.com
kfree()-ing the scratch page isn't enough, we also need to set the pointer
back to NULL to avoid a double-free in the case of a resend.
Fixes: fbd2a05f29 (NFSv4.2: Rework scratch handling for READ_PLUS)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
kmap_atomic() is deprecated in favor of kmap_local_{folio,page}().
Therefore, replace kmap_atomic() with kmap_local_folio() in
nfs_readdir_folio_array_append().
kmap_atomic() disables page-faults and preemption (the latter only for
!PREEMPT_RT kernels), However, the code within the mapping/un-mapping in
nfs_readdir_folio_array_append() does not depend on the above-mentioned
side effects.
Therefore, a mere replacement of the old API with the new one is all that
is required (i.e., there is no need to explicitly add any calls to
pagefault_disable() and/or preempt_disable()).
Tested with (x)fstests in a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Fixes: ec108d3cc7 ("NFS: Convert readdir page array functions to use a folio")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Dan has been improving on the smatch error pointer checks, and pointed
at another case where the __filemap_get_folio() conversion to error
pointers had been overlooked. This time because it was hidden behind
the filemap_grab_folio() helper function that is a wrapper around it.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix another case of an incorrect check for the returned 'folio' value
from __filemap_get_folio().
The failure case used to return NULL, but was changed by commit
66dabbb65d ("mm: return an ERR_PTR from __filemap_get_folio").
But in the meantime, commit ec108d3cc7 ("NFS: Convert readdir page
array functions to use a folio") added a new user of that function.
And my merge of the two did not fix this up correctly.
The ext4 merge had the same issue, but that one had been caught in
linux-next and got properly fixed while merging.
Fixes: 0127f25b5d ("Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs")
Cc: Anna Schumaker <anna@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The big ticket item for this release is support for RPC-with-TLS
[RFC 9289] has been added to the Linux NFS server. The goal is to
provide a simple-to-deploy, low-overhead in-transit confidentiality
and peer authentication mechanism. It can supplement NFS Kerberos
and it can protect the use of legacy non-cryptographic user
authentication flavors such as AUTH_SYS. The TLS Record protocol is
handled entirely by kTLS, meaning it can use either software
encryption or offload encryption to smart NICs.
Work continues on improving NFSD's open file cache. Among the many
clean-ups in that area is a patch to convert the rhashtable to use
the list-hashing version of that data structure.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRK/JMACgkQM2qzM29m
f5cF5A/+JZFRSPlfSYt0YHzUQQSDdYn5n/IG9TwJQd62xheu083WuKRaCOYYoOhg
06nZd6p7nuF1E0n2ZWOKSE6YkBSE6z4M6KrQlm6lCe/nmxYCR87IYfJCXuL+Yf0e
/LdL4OTvDHzY5ec1DreERldPIUJ8GFzwChH8/z4XwbNDR7qJkF/gf8YxpFr+8K+j
Cfyl8woZiEze/Nvxy1YtAqa7HMEpitt0aWJN55rHwTh9c3b0nmDzziYFcVqXgybJ
/qUHfHBak66ll8RqhcQ3BMuyfszwASERbPsaZ2a5F/RaxLL5ZWfFyhgQwm+PZWT+
J5DdSBwLEQYtKQGD41A1aorP6v/u2QelfWrl4S7/qjRpREp8Ba2IU4fYLjGb1499
Imk68BA7NwFp87tdMi/7en1VVgina4U/S3X71aUYWe+C0g48BfTrVwq4SVbQSAo4
1638vbZnrJbsJMr9OaaysKWfv4KZB020Ji1KVwuqmgy5F8kdfJCCQ2UR/fHuJ3DY
R0Zrd1Ryjwr83viP+Xj0ERiW405gPdCT0RJqoA7rznRPCqT5M42tf5z65uO7iZeE
C1udgDaoQOtioKlem6FcDXLkryf986slGA7V91lat/Jt8A5jLKQfjVe3Q+kaaqXP
ka1DQnYelzMzILQQs39cqW5pShrH8e3tfRZ7JhdBgrpxVXz9ZZM=
=lA2+
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The big ticket item for this release is that support for RPC-with-TLS
[RFC 9289] has been added to the Linux NFS server.
The goal is to provide a simple-to-deploy, low-overhead in-transit
confidentiality and peer authentication mechanism. It can supplement
NFS Kerberos and it can protect the use of legacy non-cryptographic
user authentication flavors such as AUTH_SYS. The TLS Record protocol
is handled entirely by kTLS, meaning it can use either software
encryption or offload encryption to smart NICs.
Aside from that, work continues on improving NFSD's open file cache.
Among the many clean-ups in that area is a patch to convert the
rhashtable to use the list-hashing version of that data structure"
* tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
NFSD: Handle new xprtsec= export option
SUNRPC: Support TLS handshake in the server-side TCP socket code
NFSD: Clean up xattr memory allocation flags
NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop
SUNRPC: Clear rq_xid when receiving a new RPC Call
SUNRPC: Recognize control messages in server-side TCP socket code
SUNRPC: Be even lazier about releasing pages
SUNRPC: Convert svc_xprt_release() to the release_pages() API
SUNRPC: Relocate svc_free_res_pages()
nfsd: simplify the delayed disposal list code
SUNRPC: Ignore return value of ->xpo_sendto
SUNRPC: Ensure server-side sockets have a sock->file
NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
SUNRPC: return proper error from get_expiry()
lockd: add some client-side tracepoints
nfs: move nfs_fhandle_hash to common include file
lockd: server should unlock lock if client rejects the grant
lockd: fix races in client GRANTED_MSG wait logic
lockd: move struct nlm_wait to lockd.h
...
New Features:
* Convert the readdir path to use folios
* Convert the NFS fscache code to use netfs
Bugfixes and Cleanups:
* Always send a RECLAIM_COMPLETE after establishing a lease
* Simplify sysctl registrations and other cleanups
* Handle out-of-order write replies on NFS v3
* Have sunrpc call_bind_status use standard hard/soft task semantics
* Other minor cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmRMI04ACgkQ18tUv7Cl
QOuCNQ//SkQm8aOM4DkYFeDIObye6xMzgtWrB25grYNG4a/DcYqb5kNcbmI5l1tE
Tus8KMZAWSpwa0m8ALctzp+pZQWQkY/svsqqHrKIGUHBI8F0OinVCqc2MzNN75WX
m/1wELW6ek9RBL5BoJtAPt+Qu8/jP6KD64Zot7snBeUrzreaZDcz0HM+EcQhi7X7
qd5XS0/cA2eLEBBQcQdFpRhHvgW12BMYM/zp3/ER5H52L2iAlZunGWw+Nqs8ueOR
D7K2+CF1sV1k6hYbLWNoaF2J6PZr5dRpc6gSq4fLP4WUKjqQwmQp8cm9iLpf6jGa
a+Y7t8aj7vup8jVCVGWYWZA2G2gi6jWmxxWudkJwfAa1E45t1B4/C0udwlxR20OO
XI2Bhe5YwTURgSOvOS9QTZJpQN4qfpEL0NoAmAT5fAHBQ2CXDrMlSIxPS7U6LO9q
YqwIHcAHvYVnbD45IUh2Zjbp65mRb1VkU6WzOyK1/sNHEyYpubIWXB/yLaA3oGge
V3xUgvlTzLVzzyQfwiRfzAD1P5/USaXE/B36c4itfCr5rJnAfsiBP3gk0o9yq18J
3Yb6olrmc9CzeA7PN88uEus4VZHbaE9OktRFIjJ22jlLQEY4xougdE5asY1XX8F+
OKLLLeeCrsbvrANB9XcLVsLqdMYvsd0VaCX9HtN3UP+7Lod5T10=
=gpBC
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Convert the readdir path to use folios
- Convert the NFS fscache code to use netfs
Bugfixes and Cleanups:
- Always send a RECLAIM_COMPLETE after establishing a lease
- Simplify sysctl registrations and other cleanups
- Handle out-of-order write replies on NFS v3
- Have sunrpc call_bind_status use standard hard/soft task semantics
- Other minor cleanups"
* tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4.2: Rework scratch handling for READ_PLUS
NFS: Cleanup unused rpc_clnt variable
NFS: set varaiable nfs_netfs_debug_id storage-class-specifier to static
SUNRPC: remove the maximum number of retries in call_bind_status
NFS: Convert readdir page array functions to use a folio
NFS: Convert the readdir array-of-pages into an array-of-folios
NFSv3: handle out-of-order write replies.
NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API
NFS: Convert buffered read paths to use netfs when fscache is enabled
NFS: Configure support for netfs when NFS fscache is configured
NFS: Rename readpage_async_filler to nfs_read_add_folio
sunrpc: simplify one-level sysctl registration for debug_table
sunrpc: move sunrpc_table and proc routines above
sunrpc: simplify one-level sysctl registration for xs_tunables_table
sunrpc: simplify one-level sysctl registration for xr_tunables_table
nfs: simplify two-level sysctl registration for nfs_cb_sysctls
nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
lockd: simplify two-level sysctl registration for nlm_sysctls
NFSv4.1: Always send a RECLAIM_COMPLETE after establishing lease
Instead of using a tiny, static scratch buffer, we should use a kmalloc()-ed
buffer that is allocated when checking for read plus usage. This lets us
use the buffer before decoding any part of the READ_PLUS operation
instead of setting it right before segment decoding, meaning it should
be a little more robust.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
- updates to scripts/gdb from Glenn Washburn
- kexec cleanups from Bjorn Helgaas
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr+6wAKCRDdBJ7gKXxA
jn4NAP4u/hj/kR2dxYehcVLuQqJspCRZZBZlAReFJyHNQO6voAEAk0NN9rtG2+/E
r0G29CJhK+YL0W6mOs8O1yo9J1rZnAM=
=2CUV
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Mainly singleton patches all over the place.
Series of note are:
- updates to scripts/gdb from Glenn Washburn
- kexec cleanups from Bjorn Helgaas"
* tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits)
mailmap: add entries for Paul Mackerras
libgcc: add forward declarations for generic library routines
mailmap: add entry for Oleksandr
ocfs2: reduce ioctl stack usage
fs/proc: add Kthread flag to /proc/$pid/status
ia64: fix an addr to taddr in huge_pte_offset()
checkpatch: introduce proper bindings license check
epoll: rename global epmutex
scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry()
scripts/gdb: create linux/vfs.py for VFS related GDB helpers
uapi/linux/const.h: prefer ISO-friendly __typeof__
delayacct: track delays from IRQ/SOFTIRQ
scripts/gdb: timerlist: convert int chunks to str
scripts/gdb: print interrupts
scripts/gdb: raise error with reduced debugging information
scripts/gdb: add a Radix Tree Parser
lib/rbtree: use '+' instead of '|' for setting color.
proc/stat: remove arch_idle_time()
checkpatch: check for misuse of the link tags
checkpatch: allow Closes tags with links
...
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
This pull request goes with only a few sysctl moves from the
kernel/sysctl.c file, the rest of the work has been put towards
deprecating two API calls which incur recursion and prevent us
from simplifying the registration process / saving memory per
move. Most of the changes have been soaking on linux-next since
v6.3-rc3.
I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
feedback that we should see if we could *save* memory with these
moves instead of incurring more memory. We currently incur more
memory since when we move a syctl from kernel/sysclt.c out to its
own file we end up having to add a new empty sysctl used to register
it. To achieve saving memory we want to allow syctls to be passed
without requiring the end element being empty, and just have our
registration process rely on ARRAY_SIZE(). Without this, supporting
both styles of sysctls would make the sysctl registration pretty
brittle, hard to read and maintain as can be seen from Meng Tang's
efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE()
for all sysctl registrations also implies doing the work to deprecate
two API calls which use recursion in order to support sysctl
declarations with subdirectories.
And so during this development cycle quite a bit of effort went into
this deprecation effort. I've annotated the following two APIs are
deprecated and in few kernel releases we should be good to remove them:
* register_sysctl_table()
* register_sysctl_paths()
During this merge window we should be able to deprecate and unexport
register_sysctl_paths(), we can probably do that towards the end
of this merge window.
Deprecating register_sysctl_table() will take a bit more time but
this pull request goes with a few example of how to do this.
As it turns out each of the conversions to move away from either of
these two API calls *also* saves memory. And so long term, all these
changes *will* prove to have saved a bit of memory on boot.
The way I see it then is if remove a user of one deprecated call, it
gives us enough savings to move one kernel/sysctl.c out from the
generic arrays as we end up with about the same amount of bytes.
Since deprecating register_sysctl_table() and register_sysctl_paths()
does not require maintainer coordination except the final unexport
you'll see quite a bit of these changes from other pull requests, I've
just kept the stragglers after rc3.
Most of these changes have been soaking on linux-next since around rc3.
[0] https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRHAjQSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinTzgQAI/uKHKi0VlUR1l2Psl0XbseUVueuyj3
ZDxSJpbVUmsoDf2MlLjzB8mYE3ricnNTDbLr7qOyA6pXdM1N0mY5LQmRVRu8/ffd
2T1hQ5pl7YnJdWP5dPhcF9Y+jnu1tjX1MW5DS4fzllwK7FnD86HuIruGq52RAPS/
/FH+BD9eodLWWXk6A/o2GFqoWxPKQI0GLxEYWa7Hg7yt8E/3PQL9QsRzn8i6U+HW
BrN/+G3YD1VCCzXu0UAeXnm+i1Z7CdvqNdZuSkvE3DObiZ5WpOS+/i7FrDB7zdiu
zAbHaifHnDPtcK3w2ZodbLAAwEWD/mG4iwIjE2kgIMVYxBv7TFDBRREXAWYAevIT
UUuZnWDQsGaWdjywrebaUycEfd6dytKyan0fTXgMFkcoWRjejhitfdM2iZDdQROg
q453p4HqOw4vTrhy4ov4zOX7J3EFiBzpZdl+SmLqcXk+jbLVb/Q9snUWz1AFtHBl
gHoP5bS82uVktGG3MsObjgTzYYMQjO9YGIrVuW1VP9uWs8WaoWx6M9FQJIIhtwE+
h6wG2s7CjuFWnS0/IxWmDOn91QyUn1w7ohiz9TuvYj/5GLSBpBDGCJHsNB5T2WS1
qbQRaZ2Kg3j9TeyWfXxdlxBx7bt3ni+J/IXDY0zom2sTpGHKl8D2g5AzmEXJDTpl
kd7Z3gsmwhDh
=0U0W
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"This only does a few sysctl moves from the kernel/sysctl.c file, the
rest of the work has been put towards deprecating two API calls which
incur recursion and prevent us from simplifying the registration
process / saving memory per move. Most of the changes have been
soaking on linux-next since v6.3-rc3.
I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
feedback that we should see if we could *save* memory with these moves
instead of incurring more memory. We currently incur more memory since
when we move a syctl from kernel/sysclt.c out to its own file we end
up having to add a new empty sysctl used to register it. To achieve
saving memory we want to allow syctls to be passed without requiring
the end element being empty, and just have our registration process
rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls
would make the sysctl registration pretty brittle, hard to read and
maintain as can be seen from Meng Tang's efforts to do just this [0].
Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations
also implies doing the work to deprecate two API calls which use
recursion in order to support sysctl declarations with subdirectories.
And so during this development cycle quite a bit of effort went into
this deprecation effort. I've annotated the following two APIs are
deprecated and in few kernel releases we should be good to remove
them:
- register_sysctl_table()
- register_sysctl_paths()
During this merge window we should be able to deprecate and unexport
register_sysctl_paths(), we can probably do that towards the end of
this merge window.
Deprecating register_sysctl_table() will take a bit more time but this
pull request goes with a few example of how to do this.
As it turns out each of the conversions to move away from either of
these two API calls *also* saves memory. And so long term, all these
changes *will* prove to have saved a bit of memory on boot.
The way I see it then is if remove a user of one deprecated call, it
gives us enough savings to move one kernel/sysctl.c out from the
generic arrays as we end up with about the same amount of bytes.
Since deprecating register_sysctl_table() and register_sysctl_paths()
does not require maintainer coordination except the final unexport
you'll see quite a bit of these changes from other pull requests, I've
just kept the stragglers after rc3"
Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0]
* tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits)
fs: fix sysctls.c built
mm: compaction: remove incorrect #ifdef checks
mm: compaction: move compaction sysctl to its own file
mm: memory-failure: Move memory failure sysctls to its own file
arm: simplify two-level sysctl registration for ctl_isa_vars
ia64: simplify one-level sysctl registration for kdump_ctl_table
utsname: simplify one-level sysctl registration for uts_kern_table
ntfs: simplfy one-level sysctl registration for ntfs_sysctls
coda: simplify one-level sysctl registration for coda_table
fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls
xfs: simplify two-level sysctl registration for xfs_table
nfs: simplify two-level sysctl registration for nfs_cb_sysctls
nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
lockd: simplify two-level sysctl registration for nlm_sysctls
proc_sysctl: enhance documentation
xen: simplify sysctl registration for balloon
md: simplify sysctl registration
hv: simplify sysctl registration
scsi: simplify sysctl registration with register_sysctl()
csky: simplify alignment sysctl registration
...
* The data=journal writepath has been significantly cleaned up and
simplified, and reduces a large number of data=journal special cases
by Jan Kara.
* Ojaswin Muhoo has replaced linked list used to track extents that
have been used for inode preallocation with a red-black tree in the
multi-block allocator. This improves performance for workloads
which do a large number of random allocating writes.
* Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the
multi-block allocator.
* Matthew wilcox has converted the code paths for reading and writing
ext4 pages to use folios.
* Jason Yan has continued to factor out ext4_fill_super() into smaller
functions for improve ease of maintenance and comprehension.
* Josh Triplett has created an uapi header for ext4 userspace API's.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmRHS3IACgkQ8vlZVpUN
gaNN7AgAnFiWfk4UqKpBsUL5iQKJgf2K4tjlNXgPd6ghNns0IdFEyeWSHhr6KLv/
SQeoMMyiWaUcTvZs9DokD8U/9M1ELPUiE9W5c9GxJjM86SXp8BlLYSZTiRoNHzGJ
noQpvikj4qTRviK0rA3q5ICTP2eh1ECHMFJy2wcsZQgwnBelUejQHsTGtOwSvFWF
8wMdfuVtAFDZJjzOxzVKfHP22R5HVRWlAU7P1d97qKjBj4Se3+QchI+zdcIrmU9A
tTmCXj57NpTDyLjS9dIDmLygtTv93lOzOmZS8glw0BFonPcd3ObI4RHVxR+V9xu1
lN13YYgBrK6yfApn9L5XL/31PuLfbg==
=VLBx
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"There are a number of major cleanups in ext4 this cycle:
- The data=journal writepath has been significantly cleaned up and
simplified, and reduces a large number of data=journal special
cases by Jan Kara.
- Ojaswin Muhoo has replaced linked list used to track extents that
have been used for inode preallocation with a red-black tree in the
multi-block allocator. This improves performance for workloads
which do a large number of random allocating writes.
- Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the
multi-block allocator.
- Matthew wilcox has converted the code paths for reading and writing
ext4 pages to use folios.
- Jason Yan has continued to factor out ext4_fill_super() into
smaller functions for improve ease of maintenance and
comprehension.
- Josh Triplett has created an uapi header for ext4 userspace API's"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (105 commits)
ext4: Add a uapi header for ext4 userspace APIs
ext4: remove useless conditional branch code
ext4: remove unneeded check of nr_to_submit
ext4: move dax and encrypt checking into ext4_check_feature_compatibility()
ext4: factor out ext4_block_group_meta_init()
ext4: move s_reserved_gdt_blocks and addressable checking into ext4_check_geometry()
ext4: rename two functions with 'check'
ext4: factor out ext4_flex_groups_free()
ext4: use ext4_group_desc_free() in ext4_put_super() to save some duplicated code
ext4: factor out ext4_percpu_param_init() and ext4_percpu_param_destroy()
ext4: factor out ext4_hash_info_init()
Revert "ext4: Fix warnings when freezing filesystem with journaled data"
ext4: Update comment in mpage_prepare_extent_to_map()
ext4: Simplify handling of journalled data in ext4_bmap()
ext4: Drop special handling of journalled data from ext4_quota_on()
ext4: Drop special handling of journalled data from ext4_evict_inode()
ext4: Fix special handling of journalled data from extent zeroing
ext4: Drop special handling of journalled data from extent shifting operations
ext4: Drop special handling of journalled data from ext4_sync_file()
ext4: Commit transaction before writing back pages in data=journal mode
...
lockd needs to be able to hash filehandles for tracepoints. Move the
nfs_fhandle_hash() helper to a common nfs include file.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
On most filesystems, there is no reason to delay reaping an nfsd_file
just because its underlying inode is still under writeback. nfsd just
relies on client activity or the local flusher threads to do writeback.
The main exception is NFS, which flushes all of its dirty data on last
close. Add a new EXPORT_OP_FLUSH_ON_CLOSE flag to allow filesystems to
signal that they do this, and only skip closing files under writeback on
such filesystems.
Also, remove a redundant NULL file pointer check in
nfsd_file_check_writeback, and clean up nfs's export op flag
definitions.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEJxiQAKCRCRxhvAZXjc
oj4jAP4pt/SuHdZQzv9k2j+UA4hpDmwt7UUWhLxtycQrt9oEoQD/U9hCwnYY9Ksj
5TSZiK/OGyoslmF1zfAC+1R2crgr3A0=
=PpBS
-----END PGP SIGNATURE-----
Merge tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains a pile of various smaller fixes. Most of them aren't
very interesting so this just highlights things worth mentioning:
- Various filesystems contained the same little helper to convert
from the mode of a dentry to the DT_* type of that dentry.
They have now all been switched to rely on the generic
fs_umode_to_dtype() helper. All custom helpers are removed (Jeff)
- Fsnotify now reports ACCESS and MODIFY events for splice
(Chung-Chiang Cheng)
- After converting timerfd a long time ago to rely on
wait_event_interruptible_*() apis, convert eventfd as well. This
removes the complex open-coded wait code (Wen Yang)
- Simplify sysctl registration for devpts, avoiding the declaration
of two tables. Instead, just use a prefixed path with
register_sysctl() (Luis)
- The setattr_should_drop_sgid() helper is now exported so NFS can
use it. By switching NFS to this helper an NFS setgid inheritance
bug is fixed (me)"
* tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode()
pnode: pass mountpoint directly
eventfd: use wait_event_interruptible_locked_irq() helper
splice: report related fsnotify events
fs: consolidate duplicate dt_type helpers
nfs: use vfs setgid helper
Update relatime comments to include equality
fs/buffer: Remove redundant assignment to err
fs_context: drop the unused lsm_flags member
fs/namespace: fnic: Switch to use %ptTd
Documentation: update idmappings.rst
devpts: simplify two-level sysctl registration for pty_kern_table
eventpoll: align comment with nested epoll limitation
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZEEhwgAKCRCRxhvAZXjc
otwgAQDXHnKiPm/d76lITXbxdUNCtvZz+ig26EbOrD+vEszzIQEA81dru0QbCNCt
ctoZdcsmtKbt2VaYQF1CDOhlnNg5VQM=
=pER1
-----END PGP SIGNATURE-----
Merge tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull acl updates from Christian Brauner:
"After finishing the introduction of the new posix acl api last cycle
the generic POSIX ACL xattr handlers are still around in the
filesystems xattr handlers for two reasons:
(1) Because a few filesystems rely on the ->list() method of the
generic POSIX ACL xattr handlers in their ->listxattr() inode
operation.
(2) POSIX ACLs are only available if IOP_XATTR is raised. The
IOP_XATTR flag is raised in inode_init_always() based on whether
the sb->s_xattr pointer is non-NULL. IOW, the registered xattr
handlers of the filesystem are used to raise IOP_XATTR. Removing
the generic POSIX ACL xattr handlers from all filesystems would
risk regressing filesystems that only implement POSIX ACL support
and no other xattrs (nfs3 comes to mind).
This contains the work to decouple POSIX ACLs from the IOP_XATTR flag
as they don't depend on xattr handlers anymore. So it's now possible
to remove the generic POSIX ACL xattr handlers from the sb->s_xattr
list of all filesystems. This is a crucial step as the generic POSIX
ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs
don't depend on the xattr infrastructure anymore.
Adressing problem (1) will require more long-term work. It would be
best to get rid of the ->list() method of xattr handlers completely at
some point.
For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX
ACL xattr handler is kept around so they can continue to use
array-based xattr handler indexing.
This update does simplify the ->listxattr() implementation of all
these filesystems however"
* tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
acl: don't depend on IOP_XATTR
ovl: check for ->listxattr() support
reiserfs: rework priv inode handling
fs: rename generic posix acl handlers
reiserfs: rework ->listxattr() implementation
fs: simplify ->listxattr() implementation
fs: drop unused posix acl handlers
xattr: remove unused argument
xattr: add listxattr helper
xattr: simplify listxattr helpers
The root rpc_clnt is not used here, clean it up.
Fixes: 4dc73c6791 ("NFSv4: keep state manager thread active if swap is enabled")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
smatch reports
fs/nfs/fscache.c:260:10: warning: symbol
'nfs_netfs_debug_id' was not declared. Should it be static?
This variable is only used in its defining file, so it should be static
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
This patch only converts the actual array, but doesn't touch the
individual nfs_cache_array pages and related functions (that will be
done in the next patch).
I also adjust the names of the fields in the nfs_readdir_descriptor to
say "folio" instead of "page".
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFSv3 includes pre/post wcc attributes which allow the client to
determine if all changes to the file have been made by the client
itself, or if any might have been made by some other client.
If there are gaps in the pre/post ctime sequence it must be assumed that
some other client changed the file in that gap and the local cache must
be suspect. The next time the file is opened the cache should be
invalidated.
Since Commit 1c341b7775 ("NFS: Add deferred cache invalidation for
close-to-open consistency violations") in linux 5.3 the Linux client has
been triggering this invalidation. The chunk in nfs_update_inode() in
particularly triggers.
Unfortunately Linux NFS assumes that all replies will be processed in
the order sent, and will arrive in the order processed. This is not
true in general. Consequently Linux NFS might ignore the wcc info in a
WRITE reply because the reply is in response to a WRITE that was sent
before some other request for which a reply has already been seen. This
is detected by Linux using the gencount tests in nfs_inode_attr_cmp().
Also, when the gencount tests pass it is still possible that the request
were processed on the server in a different order, and a gap seen in
the ctime sequence might be filled in by a subsequent reply, so gaps
should not immediately trigger delayed invalidation.
The net result is that writing to a server and then reading the file
back can result in going to the server for the read rather than serving
it from cache - all because a couple of replies arrived out-of-order.
This is a performance regression over kernels before 5.3, though the
change in 5.3 is a correctness improvement.
This has been seen with Linux writing to a Netapp server which
occasionally re-orders requests. In testing the majority of requests
were in-order, but a few (maybe 2 or three at a time) could be
re-ordered.
This patch addresses the problem by recording any gaps seen in the
pre/post ctime sequence and not triggering invalidation until either
there are too many gaps to fit in the table, or until there are no more
active writes and the remaining gaps cannot be resolved.
We allocate a table of 16 gaps on demand. If the allocation fails we
revert to current behaviour which is of little cost as we are unlikely
to be able to cache the writes anyway.
In the table we store "start->end" pair when iversion is updated and
"end<-start" pairs pre/post pairs reported by the server. Usually these
exactly cancel out and so nothing is stored. When there are
out-of-order replies we do store gaps and these will eventually be
cancelled against later replies when this client is the only writer.
If the final write is out-of-order there may be one gap remaining when
the file is closed. This will be noticed and if there is precisely on
gap and if the iversion can be advanced to match it, then we do so.
This patch makes no attempt to handle directories correctly. The same
problem potentially exists in the out-of-order replies to create/unlink
requests can cause future lookup requires to be sent to the server
unnecessarily. A similar scheme using the same primitives could be used
to notice and handle out-of-order replies.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The NFS specific trace points are no longer needed as tracing is well
covered by netfs and fscache.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The old NFSIOS_FSCACHE counters are no longer accurate or useful with
the conversion to the new netfs API. The new API does not have a page
based interface, and so the counters in nfs_stat_fscachecounters are
no longer obtainable. The new netfs the API has extensive statistics
inside /proc/fs/fscache/stats so we no longer need NFS specific fscache
stats.
Note this also removes the 'fsc:' line from /proc/self/mountstats so
it will be a user-visible change.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Convert the NFS buffered read code paths to corresponding netfs APIs,
but only when fscache is configured and enabled.
The netfs API defines struct netfs_request_ops which must be filled
in by the network filesystem. For NFS, we only need to define 5 of
the functions, the main one being the issue_read() function.
The issue_read() function is called by the netfs layer when a read
cannot be fulfilled locally, and must be sent to the server (either
the cache is not active, or it is active but the data is not available).
Once the read from the server is complete, netfs requires a call to
netfs_subreq_terminated() which conveys either how many bytes were read
successfully, or an error. Note that issue_read() is called with a
structure, netfs_io_subrequest, which defines the IO requested, and
contains a start and a length (both in bytes), and assumes the underlying
netfs will return a either an error on the whole region, or the number
of bytes successfully read.
The NFS IO path is page based and the main APIs are the pgio APIs defined
in pagelist.c. For the pgio APIs, there is no way for the caller to
know how many RPCs will be sent and how the pages will be broken up
into underlying RPCs, each of which will have their own completion and
return code. In contrast, netfs is subrequest based, a single
subrequest may contain multiple pages, and a single subrequest is
initiated with issue_read() and terminated with netfs_subreq_terminated().
Thus, to utilze the netfs APIs, NFS needs some way to accommodate
the netfs API requirement on the single response to the whole
subrequest, while also minimizing disruptive changes to the NFS
pgio layer.
The approach taken with this patch is to allocate a small structure
for each nfs_netfs_issue_read() call, store the final error and number
of bytes successfully transferred in the structure, and update these values
as each RPC completes. The refcount on the structure is used as a marker
for the last RPC completion, is incremented in nfs_netfs_read_initiate(),
and decremented inside nfs_netfs_read_completion(), when a nfs_pgio_header
contains a valid pointer to the data. On the final put (which signals
the final outstanding RPC is complete) in nfs_netfs_read_completion(),
call netfs_subreq_terminated() with either the final error value (if
one or more READs complete with an error) or the number of bytes
successfully transferred (if all RPCs complete successfully). Note
that when all RPCs complete successfully, the number of bytes transferred
is capped to the length of the subrequest. Capping the transferred length
to the subrequest length prevents "Subreq overread" warnings from netfs.
This is due to the "aligned_len" in nfs_pageio_add_page(), and the
corner case where NFS requests a full page at the end of the file,
even when i_size reflects only a partial page (NFS overread).
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
As first steps for support of the netfs library when NFS_FSCACHE is
configured, add NETFS_SUPPORT to Kconfig and add the required netfs_inode
into struct nfs_inode.
Using netfs requires we move the VFS inode structure to be stored
inside struct netfs_inode, along with the fscache_cookie.
Thus, if NFS_FSCACHE is configured, place netfs_inode inside an
anonymous union so the vfs_inode memory is the same and we do
not need to modify other non-fscache areas of NFS.
In addition, inside the NFS fscache code, use the new helpers,
netfs_inode() and netfs_i_cookie() helpers, and remove our own
helper, nfs_i_fscache().
Later patches will convert NFS fscache to fully use netfs.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Rename readpage_async_filler to nfs_read_add_folio to
better reflect what this function does (add a folio to
the nfs_pageio_descriptor), and simplify arguments to
this function by removing struct nfs_readdesc.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The spec requires that we always at least send a RECLAIM_COMPLETE when
we're done establishing the lease and recovering any state.
Fixes: fce5c838e1 ("nfs41: RECLAIM_COMPLETE functionality")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Remove empty if statement from nfs3_prepare_get_acl and update comment to
follow the one from the referred fs/posix_acl.c:get_acl().
No functional change intended.
Link: https://lkml.kernel.org/r/20221130151231.3654-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This particular combination of flags is used by most filesystems
in their ->write_begin method, although it does find use in a
few other places. Before folios, it warranted its own function
(grab_cache_page_write_begin()), but I think that just having specialised
flags is enough. It certainly helps the few places that have been
converted from grab_cache_page_write_begin() to __filemap_get_folio().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20230324180129.1220691-2-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Instead of returning NULL for all errors, distinguish between:
- no entry found and not asked to allocated (-ENOENT)
- failed to allocate memory (-ENOMEM)
- would block (-EAGAIN)
so that callers don't have to guess the error based on the passed in
flags.
Also pass through the error through the direct callers: filemap_get_folio,
filemap_lock_folio filemap_grab_folio and filemap_get_incore_folio.
[hch@lst.de: fix null-pointer deref]
Link: https://lkml.kernel.org/r/20230310070023.GA13563@lst.de
Link: https://lkml.kernel.org/r/20230310043137.GA1624890@u2004
Link: https://lkml.kernel.org/r/20230307143410.28031-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nilfs2]
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Fix a crash and a resource leak in NFSv4 COMPOUND processing
- Fix issues with AUTH_SYS credential handling
- Try again to address an NFS/NFSD/SUNRPC build dependency regression
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmQsLHMACgkQM2qzM29m
f5eNAA/8DokMQLQ+gN8zhQNqmw92sUdW3m41o0/DfETVXyE60sX8uOE7PktSGwfz
fWMMiQpvmnmw/lbO84XQ9i8E0hjh8cT26l1CJum4VgSFiBJJvIqNxb0Yro43R4Jc
1wU2AOpC9qzCokdHhHKszDXuOgsz3v5OMJSQz3mG50dlq8/+6KKrCnK6jakyrvxr
vKcVMsoENhxh2MnJfbsIQ70UM/rF6dmZWzGuBJH51Fkt+0FD9cnxXZKCkv1+D8JN
5Hr+8rv4I/VqBGDzv9QHoEowZr70e8UUi2UME/jwSArhdxwsfPEqV/qWwHq9Q133
RW40Gco7/E3JUjpAVTRXVGSB+LwU1EvhWQ9qSpSx5D2CPAHJ9hsOw4I54+Q0vD+j
2druOpqIITZOvI3K54ZJXa2LK6SpZ8NnncP5YkLWOwR0Wqohy1U8Sm5uOiMs+IJa
neTxL7f+u3MDQgaDCTuBkI4oKzSDF/ZiMTWh52iPyy9x03SRYXbW6UgqDiySIg0P
jvvaDFCvKvvL2qEmksMoQbWxSjVj8PqL+qJIxQNIZwHbows6paL+l0rdSPXc+l2O
97GBlqNPfHt+AjfJvGDscaIcLA+gu+ErzwxG6BLKvB9QcX9/F3A62Nh3txpe5Q1r
M5NyQwK3vVQcTejMqw34sBqp3EeI5iIJ9CjD/2tN+dUpeHyQld8=
=bk6S
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix a crash and a resource leak in NFSv4 COMPOUND processing
- Fix issues with AUTH_SYS credential handling
- Try again to address an NFS/NFSD/SUNRPC build dependency regression
* tag 'nfsd-6.3-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: callback request does not use correct credential for AUTH_SYS
NFS: Remove "select RPCSEC_GSS_KRB5
sunrpc: only free unix grouplist after RCU settles
nfsd: call op_release, even when op_func returns an error
NFSD: Avoid calling OPDESC() with ops->opnum == OP_ILLEGAL
If CONFIG_CRYPTO=n (e.g. arm/shmobile_defconfig):
WARNING: unmet direct dependencies detected for RPCSEC_GSS_KRB5
Depends on [n]: NETWORK_FILESYSTEMS [=y] && SUNRPC [=y] && CRYPTO [=n]
Selected by [y]:
- NFS_V4 [=y] && NETWORK_FILESYSTEMS [=y] && NFS_FS [=y]
As NFSv4 can work without crypto enabled, remove the RPCSEC_GSS_KRB5
dependency altogether.
Trond says:
> It is possible to use the NFSv4.1 client with just AUTH_SYS, and
> in fact there are plenty of people out there using only that. The
> fact that RFC5661 gets its knickers in a twist about RPCSEC_GSS
> support is largely irrelevant to those people.
>
> The other issue is that ’select’ enforces the strict dependency
> that if the NFS client is compiled into the kernel, then the
> RPCSEC_GSS and kerberos code needs to be compiled in as well: they
> cannot exist as modules.
Fixes: e57d065277 ("NFS & NFSD: Update GSS dependencies")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Niklas Söderlund <niklas.soderlund@ragnatech.se>
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Bugfixes:
* Fix shutdown of NFS TCP client sockets
* Fix hangs when recovering open state after a server reboot
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmQnOw8ACgkQ18tUv7Cl
QOvhTBAAhgxWVQkGh9Uh+b7vwBUnMenXCl2LhE9WodDgxCw4+7nQTSgW79iOKmA/
DGQd7iBT8nksKs3hPtg8aIQLuj42F0MlVexU7mJNOeA3cTSkbLaEwRJuOo3W8eKm
A3/DMhUmYNue5ufhYmVV/9/j6pHRgib1JaBHgWFjqtr5vmehCxweW5TpfgJer5oF
aX5YJkPShmMAb5C7yfZ/cjooQbLi1d21wwpkLHkANOIN2JK0QNesJb4ttYFxLxPN
BNCSY3oe/E2pcbtwZeb0Eq17m4ToHOlwdtU9JvYyrLNdrJdNEgNGzD35dQfnZG+q
FfP6Pla3Sr8T/06t+Y4mo3EyChLsM1Kq+K6iqPhVElzGA5eFtNT3TsS0Hxa4JfP4
sW3P4zx6Rl0oU1wPkwRonc66GjHmklHyfaXA6TeBqXpSKOXoZU5BP0tKdQ8TXZmu
BeO37IzSfh+DZ/etl4xrVZDK1MYOPz/096RUXhMrkBRWobwSTrj5jmQtUlgGoLqv
M1II1a4btRZ04gTl/o+eMnwR+ULp7F/Yd1hEJq1fkNhuIc8Fh+GzU/21xh2dWmsb
83jjKURyDglKSIvP6LaqB9i1ilJS5YqnQjRKhZAWxzhHrkS9GnZZ0FV0685efvUH
aVgctskoVrltJ7jkvn3Z2WSGhkVo/k/qvnyqqqeR8EIRXTE/3J0=
=S4il
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.3-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
- Fix shutdown of NFS TCP client sockets
- Fix hangs when recovering open state after a server reboot
* tag 'nfs-for-6.3-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: fix shutdown of NFS TCP client socket
NFSv4: Fix hangs when recovering open state after a server reboot
We've aligned setgid behavior over multiple kernel releases. The details
can be found in the following two merge messages:
cf619f8919 ("Merge tag 'fs.ovl.setgid.v6.2')
426b4ca2d6 ("Merge tag 'fs.setgid.v6.0')
Consistent setgid stripping behavior is now encapsulated in the
setattr_should_drop_sgid() helper which is used by all filesystems that
strip setgid bits outside of vfs proper. Switch nfs to rely on this
helper as well. Without this patch the setgid stripping tests in
xfstests will fail.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230313-fs-nfs-setgid-v2-1-9a59f436cfc0@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When we're using a cached open stateid or a delegation in order to avoid
sending a CLAIM_PREVIOUS open RPC call to the server, we don't have a
new open stateid to present to update_open_stateid().
Instead rely on nfs4_try_open_cached(), just as if we were doing a
normal open.
Fixes: d2bfda2e7a ("NFSv4: don't reprocess cached open CLAIM_PREVIOUS")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
- Fix a crash during NFS READs from certain client implementations
- Address a minor kbuild regression in v6.3
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmQaEJgACgkQM2qzM29m
f5ceYBAAiGHz+ISIIirg2+B5DttDL5ssPoqbgfZ16JZfBgaDduOg8oK7pbILv1nf
4ufZQRcLpJDSnzhqNXxQE/L4Vsei8+tqkbwILSqFsCBZk5ei2AYSsZMu7Ptf8X8X
Y+KZ55Bko0uTYAsglnPiEaDEGeq0ZYwfUc5/bnw7ZzXdAHhlwgjoAads6Q25A9f1
6ZuKQ6nIis/ok9q88GdLf7x51IdKrXTrkC1iYAeVTXCgR3qXPezqmPOzW44rZs0t
TMzUgDuyPgvB5eT/N+pwXoI9Bli9v1PFoXZjm2NchhJ/Iy6Gt1eEEB93J/qRSU/x
ziJLkk0jnJEPWJ0Ht5li4+60b58t2OoWCTJTr3FzHCcF5Iqr7YjenSi2U8IT84ct
EeFdqR2nZpJ4Y1wbAhTjQ2w6rw5WpaXDLHCRRdE7OF/D121n3FkSNo8DtSn+fbR8
sFDpOruZn4tjgpf3AjCR1zbyKw9baC6clhnLrW6AE6T4ht9XwfE6eCUSy48FSMzg
4IZKLcofQh2UEvlVtCvGOjtcK151eFKocsq+p6/yhYE6BM20Z3dBqhw9iBzqOAPg
0qZk60AWl9W5h7M+ovkGAZpwXv4djsdWf7LbFuYOQ0kodzXOQ/5sFQm4zC5F3I1Z
8WPPA9SkYyNiNSBUt65tca4JNw5h6x9A3scPWtuHB7Lz3fuGTnE=
=frHb
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix a crash during NFS READs from certain client implementations
- Address a minor kbuild regression in v6.3
* tag 'nfsd-6.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: don't replace page in rq_pages if it's a continuation of last page
NFS & NFSD: Update GSS dependencies
This isn't ever used by VFS now, and it couldn't even work. Any FS that
uses the SECURITY_LSM_NATIVE_LABELS flag needs to also process the
value returned back from the LSM, so it needs to do its
security_sb_set_mnt_opts() call on its own anyway.
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
When the user's login time is newer than the cache's timestamp,
the original entry in the RB-tree will be replaced by a new entry.
Currently, the timestamp is only set if the entry is not found in
the RB-tree, which can cause the timestamp to be undefined when
the entry exists. This may result in a significant increase in
ACCESS operations if the timestamp is set to zero.
Signed-off-by: Chengen Du <chengen.du@canonical.com>
Fixes: 0eb43812c0 ("NFS: Clear the file access cache upon login”)
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Prior to commit 8786fde842 ("Convert NFS from readpages to
readahead"), nfs_readpages() used the old mm interface read_cache_pages()
which called task_io_account_read() for each NFS page read. After
this commit, nfs_readpages() is converted to nfs_readahead(), which
now uses the new mm interface readahead_page(). The new interface
requires callers to call task_io_account_read() themselves.
In addition, to nfs_readahead() task_io_account_read() should also
be called from nfs_read_folio().
Fixes: 8786fde842 ("Convert NFS from readpages to readahead")
Link: https://lore.kernel.org/linux-nfs/CAPt2mGNEYUk5u8V4abe=5MM5msZqmvzCVrtCP4Qw1n=gCHCnww@mail.gmail.com/
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Geert reports that:
> On v6.2, "make ARCH=m68k defconfig" gives you
> CONFIG_RPCSEC_GSS_KRB5=m
> On v6.3, it became builtin, due to dropping the dependencies on
> the individual crypto modules.
>
> $ grep -E "CRYPTO_(MD5|DES|CBC|CTS|ECB|HMAC|SHA1|AES)" .config
> CONFIG_CRYPTO_AES=y
> CONFIG_CRYPTO_AES_TI=m
> CONFIG_CRYPTO_DES=m
> CONFIG_CRYPTO_CBC=m
> CONFIG_CRYPTO_CTS=m
> CONFIG_CRYPTO_ECB=m
> CONFIG_CRYPTO_HMAC=m
> CONFIG_CRYPTO_MD5=m
> CONFIG_CRYPTO_SHA1=m
This behavior is triggered by the "default y" in the definition of
RPCSEC_GSS.
The "default y" was added in 2010 by commit df486a2590 ("NFS: Fix
the selection of security flavours in Kconfig"). However,
svc_gss_principal was removed in 2012 by commit 03a4e1f6dd
("nfsd4: move principal name into svc_cred"), so the 2010 fix is
no longer necessary. We can safely change the NFS_V4 and NFSD_V4
dependencies back to RPCSEC_GSS_KRB5 to get the nicer v6.2
behavior back.
Selecting KRB5 symbolically represents the true requirement here:
that all spec-compliant NFSv4 implementations must have Kerberos
available to use.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: dfe9a12345 ("SUNRPC: Enable rpcsec_gss_krb5.ko to be built without CRYPTO_DES")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Remove struct posix_acl_{access,default}_handler for all filesystems
that don't depend on the xattr handler in their inode->i_op->listxattr()
method in any way. There's nothing more to do than to simply remove the
handler. It's been effectively unused ever since we introduced the new
posix acl api.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter". These filters provide users
with finer-grained control over DAMOS's actions. SeongJae has also done
some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series "mm:
support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with his
series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings. The previous BPF-based approach had
shortcomings. See "mm: In-kernel support for memory-deny-write-execute
(MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a per-node
basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage during
compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in ths
series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's series
"mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
"fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest of
the kernel in the series "Simplify the external interface for GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the series
"mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
=MlGs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
New Features:
* Convert the read and write paths to use folios
Bugfixes and Cleanups:
* Fix tracepoint state manager flag printing
* Fix disabling swap files
* Fix NFSv4 client identifier sysfs path in the documentation
* Don't clear NFS_CAP_COPY if server returns NFS4ERR_OFFLOAD_DENIED
* Treat GETDEVICEINFO errors as a layout failure
* Replace kmap_atomic() calls with kmap_local_page()
* Constify sunrpc sysfs kobj_type structures
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmP2lSQACgkQ18tUv7Cl
QOuaPw//TNLuEMiiMtM6EXFPc2p2tNMfM+EeMjYKZkS5WnvszVOZRzQfXz4dvvif
R2BIHT+C6AWkPOxnbuhIk86acy/wRuAPD9OMyp+3PNa4oEqjyIk5QaTvgYeeRBNe
H7haAqCiEBx/g8F9DladWLAwpilTSPpupdmiXwTS7Q7wB6NkKqc1eAJ+BTuRYAsm
19p2nGTzenWrut4HMOOmuAVrA2OcWtzgYzdJlY19lWRmzfCvX1gh3i2JxGutmnl3
tduviVBdu5QGnTKHJiP853kzs3VwEHIFSO+5z137JWKTm3IgPxW/u3zf+ijmw22t
vjbtFAhb4+u5juchvVKyIX4lClPSKZincQ6dn9soOm7qJcxYWNm40DzxAF21GUAq
d4tp+zBoe9pfKxZrylp6YxIchLQYhU+dL0hhqbKzfOAFOg28k1JsJubl/wRvWKVe
LGbvFOrUYo8arPOKfxBWMf4HCouu/vGq/qng+U/Ppjw3h8gD32aniyP3qN9Fm1JA
rFAMKTnD4I3r3x1E4HX3icUMwil36zBr9cPglQBYD5DW54ConEW8e6hvQmjTdQHP
EgN9+PSYzYXuKR/Fij7BYOymUD9grT1+5hVaOPzlM4SY3jMy2novcQGoGiLVOd0t
PGQ5047qbESFD7VkW4GBAYftrSMLiU+l5KL3aN1JRtprxu6NaYA=
=39ue
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Convert the read and write paths to use folios
Bugfixes and Cleanups:
- Fix tracepoint state manager flag printing
- Fix disabling swap files
- Fix NFSv4 client identifier sysfs path in the documentation
- Don't clear NFS_CAP_COPY if server returns NFS4ERR_OFFLOAD_DENIED
- Treat GETDEVICEINFO errors as a layout failure
- Replace kmap_atomic() calls with kmap_local_page()
- Constify sunrpc sysfs kobj_type structures"
* tag 'nfs-for-6.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (25 commits)
fs/nfs: Replace kmap_atomic() with kmap_local_page() in dir.c
pNFS/filelayout: treat GETDEVICEINFO errors as layout failure
Documentation: Fix sysfs path for the NFSv4 client identifier
nfs42: do not fail with EIO if ssc returns NFS4ERR_OFFLOAD_DENIED
NFS: fix disabling of swap
SUNRPC: make kobj_type structures constant
nfs4trace: fix state manager flag printing
NFS: Remove unnecessary check in nfs_read_folio()
NFS: Improve tracing of nfs_wb_folio()
NFS: Enable tracing of nfs_invalidate_folio() and nfs_launder_folio()
NFS: fix up nfs_release_folio() to try to release the page
NFS: Clean up O_DIRECT request allocation
NFS: Fix up nfs_vm_page_mkwrite() for folios
NFS: Convert nfs_write_begin/end to use folios
NFS: Remove unused function nfs_wb_page()
NFS: Convert buffered writes to use folios
NFS: Convert the function nfs_wb_page() to use folios
NFS: Convert buffered reads to use folios
NFS: Add a helper nfs_wb_folio()
NFS: Convert the remaining pagelist helper functions to support folios
...
Two significant security enhancements are part of this release:
* NFSD's RPC header encoding and decoding, including RPCSEC GSS
and gssproxy header parsing, has been overhauled to make it
more memory-safe.
* Support for Kerberos AES-SHA2-based encryption types has been
added for both the NFS client and server. This provides a clean
path for deprecating and removing insecure encryption types
based on DES and SHA-1. AES-SHA2 is also FIPS-140 compliant, so
that NFS with Kerberos may now be used on systems with fips
enabled.
In addition to these, NFSD is now able to handle crossing into an
auto-mounted mount point on an exported NFS mount. A number of
fixes have been made to NFSD's server-side copy implementation.
RPC metrics have been converted to per-CPU variables. This helps
reduce unnecessary cross-CPU and cross-node memory bus traffic,
and significantly reduces noise when KCSAN is enabled.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmPzgiYACgkQM2qzM29m
f5dB2A//eqjpj+FgAN+UjygrwMC4ahAsPX3Sc3FG8/lTAiao3NFVFY2gxAiCPyVE
CFk+tUyfL23oXvbyfIBe3LhxSBOf621xU6up2OzqAzJqh1Q9iUWB6as3I14to8ZU
sWpxXo5ofwk1hzkbrvOAVkyfY0emwsr00iBeWMawkpBe8FZEQA31OYj3/xHr6bBI
zEVlZPBZAZlp0DZ74tb+bBLs/EOnqKj+XLWcogCH13JB3sn2umF6cQNkYgsxvHGa
TNQi4LEdzWZGme242LfBRiGGwm1xuVIjlAhYV/R1wIjaknE3QBzqfXc6lJx74WII
HaqpRJGrKqdo7B+1gaXCl/AMS7YluED1CBrxuej0wBG7l2JEB7m2MFMQ4LTQjgsn
nrr3P70DgbB4LuPCPyUS7dtsMmUXabIqP7niiCR4T1toH6lBmHAgEi4cFmkzg7Cd
EoFzn888mtDpfx4fghcsRWS5oKXEzbPJfu5+IZOD63+UB+NGpi0Xo2s23sJPK8vz
kqK/X63JYOUxWUvK0zkj/c/wW1cLqIaBwnSKbShou5/BL+cZVI+uJYrnEesgpoB2
5fh/cZv3hdcoOPO7OfcjCLQYy4J6RCWajptnk/hcS3lMvBTBrnq697iAqCVURDKU
Xfmlf7XbBwje+sk4eHgqVGEqqVjrEmoqbmA2OS44WSS5LDvxXdI=
=ZG/7
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"Two significant security enhancements are part of this release:
- NFSD's RPC header encoding and decoding, including RPCSEC GSS and
gssproxy header parsing, has been overhauled to make it more
memory-safe.
- Support for Kerberos AES-SHA2-based encryption types has been added
for both the NFS client and server. This provides a clean path for
deprecating and removing insecure encryption types based on DES and
SHA-1. AES-SHA2 is also FIPS-140 compliant, so that NFS with
Kerberos may now be used on systems with fips enabled.
In addition to these, NFSD is now able to handle crossing into an
auto-mounted mount point on an exported NFS mount. A number of fixes
have been made to NFSD's server-side copy implementation.
RPC metrics have been converted to per-CPU variables. This helps
reduce unnecessary cross-CPU and cross-node memory bus traffic, and
significantly reduces noise when KCSAN is enabled"
* tag 'nfsd-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (121 commits)
NFSD: Clean up nfsd_symlink()
NFSD: copy the whole verifier in nfsd_copy_write_verifier
nfsd: don't fsync nfsd_files on last close
SUNRPC: Fix occasional warning when destroying gss_krb5_enctypes
nfsd: fix courtesy client with deny mode handling in nfs4_upgrade_open
NFSD: fix problems with cleanup on errors in nfsd4_copy
nfsd: fix race to check ls_layouts
nfsd: don't hand out delegation on setuid files being opened for write
SUNRPC: Remove ->xpo_secure_port()
SUNRPC: Clean up the svc_xprt_flags() macro
nfsd: remove fs/nfsd/fault_inject.c
NFSD: fix leaked reference count of nfsd4_ssc_umount_item
nfsd: clean up potential nfsd_file refcount leaks in COPY codepath
nfsd: zero out pointers after putting nfsd_files on COPY setup error
SUNRPC: Fix whitespace damage in svcauth_unix.c
nfsd: eliminate __nfs4_get_fd
nfsd: add some kerneldoc comments for stateid preprocessing functions
nfsd: eliminate find_deleg_file_locked
nfsd: don't take nfsd4_copy ref for OP_OFFLOAD_STATUS
SUNRPC: Add encryption self-tests
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmPvfncQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpob2EADXJxcr2jjYHm/7cjKkyuVX8fr80dNdMeuY
JFdsjG1k6Uj73BVhQQWYTcs/PsrWBHWRsv6uz4WgOELj55eXmf5Q0kJszyUeJW33
/DjqLvtoppVcYf80xE13wKvCfn73BjwQo6xkGM0qAYn15eaXiD/Ax3xC6eJlsBeK
PEw7EJyhacbSxZa/1D2B6+mqII1jUQWProTCc3udZ4JHi3WvdWa3Rda0qCqHl4a1
+K2aP2YTFIRPxBzfMNa/CafWVIFubTdht+4Ds6R60RImzB9e0VUBfcsiUyW5Zg7L
Fwv7ptXuWrALwVNdW56Oz1QikBxn2pdRR2HMLwKJW1MD8kP9r8LMm2jV5Rhiwe0B
OQsGRYkOzBvw+bxeP5fvk0iPGVMz6ActH4gkraA5QdLqayDaFYOadlhqz0uRo5SH
Fb42Vl658K/MHDSIk8U58TNkmrsIJsBGohXI9DOGINPPvv3XOPi4Q1HmXkGRmii0
y+lNU/QEGh7xXXew29SPP76uQpQaYfC7NxXCMw/OpOMwehzjsjshmM2lpxi8zsgt
PJUmfHv5qxCplNmTJXmUpmX7sS7550HUdu9FJb13DM+gzKg8bk9jWVuLrzqrVlG5
1hKWEl1+heg1heRfaIuJVLbPI0au6Sb4uqhih/PHyrP9TWIoAruDbDJM65GKTxyE
2uEgcHzHQw==
=poRc
-----END PGP SIGNATURE-----
Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe updates via Christoph:
- Small improvements to the logging functionality (Amit Engel)
- Authentication cleanups (Hannes Reinecke)
- Cleanup and optimize the DMA mapping cod in the PCIe driver
(Keith Busch)
- Work around the command effects for Format NVM (Keith Busch)
- Misc cleanups (Keith Busch, Christoph Hellwig)
- Fix and cleanup freeing single sgl (Keith Busch)
- MD updates via Song:
- Fix a rare crash during the takeover process
- Don't update recovery_cp when curr_resync is ACTIVE
- Free writes_pending in md_stop
- Change active_io to percpu
- Updates to drbd, inching us closer to unifying the out-of-tree driver
with the in-tree one (Andreas, Christoph, Lars, Robert)
- BFQ update adding support for multi-actuator drives (Paolo, Federico,
Davide)
- Make brd compliant with REQ_NOWAIT (me)
- Fix for IOPOLL and queue entering, fixing stalled IO waiting on
timeouts (me)
- Fix for REQ_NOWAIT with multiple bios (me)
- Fix memory leak in blktrace cleanup (Greg)
- Clean up sbitmap and fix a potential hang (Kemeng)
- Clean up some bits in BFQ, and fix a bug in the request injection
(Kemeng)
- Clean up the request allocation and issue code, and fix some bugs
related to that (Kemeng)
- ublk updates and fixes:
- Add support for unprivileged ublk (Ming)
- Improve device deletion handling (Ming)
- Misc (Liu, Ziyang)
- s390 dasd fixes (Alexander, Qiheng)
- Improve utility of request caching and fixes (Anuj, Xiao)
- zoned cleanups (Pankaj)
- More constification for kobjs (Thomas)
- blk-iocost cleanups (Yu)
- Remove bio splitting from drivers that don't need it (Christoph)
- Switch blk-cgroups to use struct gendisk. Some of this is now
incomplete as select late reverts were done. (Christoph)
- Add bvec initialization helpers, and convert callers to use that
rather than open-coding it (Christoph)
- Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin,
Matthew, Ulf, Zhong)
* tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits)
brd: use radix_tree_maybe_preload instead of radix_tree_preload
block: use proper return value from bio_failfast()
block: bio-integrity: Copy flags when bio_integrity_payload is cloned
block: Fix io statistics for cgroup in throttle path
brd: mark as nowait compatible
brd: check for REQ_NOWAIT and set correct page allocation mask
brd: return 0/-error from brd_insert_page()
block: sync mixed merged request's failfast with 1st bio's
Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"
Revert "blk-cgroup: pass a gendisk to blkg_lookup"
Revert "blk-cgroup: delay blk-cgroup initialization until add_disk"
Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release"
Revert "blk-cgroup: move the cgroup information to struct gendisk"
nvme-pci: remove iod use_sgls
nvme-pci: fix freeing single sgl
block: ublk: check IO buffer based on flag need_get_data
s390/dasd: Fix potential memleak in dasd_eckd_init()
s390/dasd: sort out physical vs virtual pointers usage
block: Remove the ALLOC_CACHE_SLACK constant
block: make kobj_type structures constant
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
=+BG5
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs idmapping updates from Christian Brauner:
- Last cycle we introduced the dedicated struct mnt_idmap type for
mount idmapping and the required infrastucture in 256c8aed2b ("fs:
introduce dedicated idmap type for mounts"). As promised in last
cycle's pull request message this converts everything to rely on
struct mnt_idmap.
Currently we still pass around the plain namespace that was attached
to a mount. This is in general pretty convenient but it makes it easy
to conflate namespaces that are relevant on the filesystem with
namespaces that are relevant on the mount level. Especially for
non-vfs developers without detailed knowledge in this area this was a
potential source for bugs.
This finishes the conversion. Instead of passing the plain namespace
around this updates all places that currently take a pointer to a
mnt_userns with a pointer to struct mnt_idmap.
Now that the conversion is done all helpers down to the really
low-level helpers only accept a struct mnt_idmap argument instead of
two namespace arguments.
Conflating mount and other idmappings will now cause the compiler to
complain loudly thus eliminating the possibility of any bugs. This
makes it impossible for filesystem developers to mix up mount and
filesystem idmappings as they are two distinct types and require
distinct helpers that cannot be used interchangeably.
Everything associated with struct mnt_idmap is moved into a single
separate file. With that change no code can poke around in struct
mnt_idmap. It can only be interacted with through dedicated helpers.
That means all filesystems are and all of the vfs is completely
oblivious to the actual implementation of idmappings.
We are now also able to extend struct mnt_idmap as we see fit. For
example, we can decouple it completely from namespaces for users that
don't require or don't want to use them at all. We can also extend
the concept of idmappings so we can cover filesystem specific
requirements.
In combination with the vfs{g,u}id_t work we finished in v6.2 this
makes this feature substantially more robust and thus difficult to
implement wrong by a given filesystem and also protects the vfs.
- Enable idmapped mounts for tmpfs and fulfill a longstanding request.
A long-standing request from users had been to make it possible to
create idmapped mounts for tmpfs. For example, to share the host's
tmpfs mount between multiple sandboxes. This is a prerequisite for
some advanced Kubernetes cases. Systemd also has a range of use-cases
to increase service isolation. And there are more users of this.
However, with all of the other work going on this was way down on the
priority list but luckily someone other than ourselves picked this
up.
As usual the patch is tiny as all the infrastructure work had been
done multiple kernel releases ago. In addition to all the tests that
we already have I requested that Rodrigo add a dedicated tmpfs
testsuite for idmapped mounts to xfstests. It is to be included into
xfstests during the v6.3 development cycle. This should add a slew of
additional tests.
* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
shmem: support idmapped mounts for tmpfs
fs: move mnt_idmap
fs: port vfs{g,u}id helpers to mnt_idmap
fs: port fs{g,u}id helpers to mnt_idmap
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
fs: port i_{g,u}id_{needs_}update() to mnt_idmap
quota: port to mnt_idmap
fs: port privilege checking helpers to mnt_idmap
fs: port inode_owner_or_capable() to mnt_idmap
fs: port inode_init_owner() to mnt_idmap
fs: port acl to mnt_idmap
fs: port xattr to mnt_idmap
fs: port ->permission() to pass mnt_idmap
fs: port ->fileattr_set() to pass mnt_idmap
fs: port ->set_acl() to pass mnt_idmap
fs: port ->get_acl() to pass mnt_idmap
fs: port ->tmpfile() to pass mnt_idmap
fs: port ->rename() to pass mnt_idmap
fs: port ->mknod() to pass mnt_idmap
fs: port ->mkdir() to pass mnt_idmap
...
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmPuDAUTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFQu9D/9ZEYVvoY3s7tK5Cw+My+1U0sNQFUW0
D46kVF0bLxv09G8prI6FZFQORguSMZli3TEsdyUbU12ak3kNHYkTW6AxxnN0jQVE
ur6f5iOauh9mWcv2S9/gulz9KH2uiG1rGHWfClfoZgQI1QCatfaKh9nrEcNrPy71
qpTbcN8AOhhYDP2ZAHajuwTuak1jLBHYN7ihfaUPyex9kAxUalbw5wIsXYfTB+Mn
/xPuCAN9YL5KpuUzVSF7QfWWVEiqErSFeFip+NCS/srihURKr14JuVSwUwkTn8rR
ONWBawXDw+eiWZ1nQmzvPTazbvsQGO1cKx8A20nOLmJjtRqAGlfngDQrCchkoSg7
SwYDiDpDqkg5PTWSfkDMG1wY/tEm6dO89P912kY+jFuGJIrG4HPkOR20StmKkuz5
La4bUs5UwucxxS9Hlu5+aFGrYFtuixWpR/1Wu5ITRtoq+j5vzHHrQ1wkbezNXs0w
VxS7YqW1TDDUUXgBzYOctmQ50ADVVC63J6AbhA7w7Qh+1vpEvyqYcWZw/Vd54bPP
tDh3FFnEGJjX0QqtdpHpyIiOm/BXiJ/8y96vEAxVzXtOmb59AQDCPq99HgWa4G8L
62ltF7qkgXKOsWCaKk07uycE0zISnx1ZXk9bvLFeN02XhRkPxevV93Yn/YiJ7nZe
xcR23eGRsseA/w==
=EKA6
-----END PGP SIGNATURE-----
Merge tag 'iversion-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull i_version updates from Jeff Layton:
"This overhauls how we handle i_version queries from nfsd.
Instead of having special routines and grabbing the i_version field
directly out of the inode in some cases, we've moved most of the
handling into the various filesystems' getattr operations. As a bonus,
this makes ceph's change attribute usable by knfsd as well.
This should pave the way for future work to make this value queryable
by userland, and to make it more resilient against rolling back on a
crash"
* tag 'iversion-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
nfsd: remove fetch_iversion export operation
nfsd: use the getattr operation to fetch i_version
nfsd: move nfsd4_change_attribute to nfsfh.c
ceph: report the inode version in getattr if requested
nfs: report the inode version in getattr if requested
vfs: plumb i_version handling into struct kstat
fs: clarify when the i_version counter must be updated
fs: uninline inode_query_iversion
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmPuC5kTHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFdXpEADFN/loTtcANLPQmLmgmJLDuZr2zKrf
aziMJRjqGMx6BLdwBgX8/XGBNwpG4tkVbI+zdRoVHkpcayMDpLq0dnrvi79a/dGU
fBrI72ZDMd/S9lzbodObHMziLqvgFthsPm9ldVAZ2Kt400KKNE+ozcveiC3yVGy0
n1k5BSt/78abzpqut5whVgJBooHtUMCh3XvBJPKwgOneHfAXCm+jqaXlKKpKlpZj
s2OUyn8BLfNkTgpAZ88L5Rkf0mftjziL6C8KOMy1hvOsyiP0IkwLuQ/kO+2H0Ate
p3tbOGvUT+n1gYpFYBDLnuWB4G8+CVPxfoO6KGhwT4OlCJpPlNCM8O+w/A/dKn4I
858spkpYPMy91lEkcrRLRkg/MARWGTgZ3k76fp3OWNnfruWd6ekMlYKx9n6CIy34
Aoc3Svy9KeA7oRrbRDltmw3UVmz53GcDo337ZL1J6Jph3s86dMG7AwGYvoDfKuKK
b0oNK5db5v50scBnRHX6UWejE5fSnnHvgC7pU57u08odCVEUALB+r8f04vmkxcVJ
Qed7lolQdFzn9ddaOXzpg5KeCe/cX3p4IPZSTHad7CPr8gswmC135DfXCr64x2hC
5jyNzKbe/x+7B2xCweHmEk4ojt8IU3UaYxLJoQkNeVr8rEGC9gqZgkSDe7BnTpOf
wT0ijzhy2u5RKg==
=Zhf3
-----END PGP SIGNATURE-----
Merge tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"The main change here is that I've broken out most of the file locking
definitions into a new header file. I also went ahead and completed
the removal of locks_inode function"
* tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: remove locks_inode
filelock: move file locking definitions to separate header file
Currently, svcauth_gss_accept() pre-reserves response buffer space
for the RPC payload length and GSS sequence number before returning
to the dispatcher, which then adds the header's accept_stat field.
The problem is the accept_stat field is supposed to go before the
length and seq_num fields. So svcauth_gss_release() has to relocate
the accept_stat value (see svcauth_gss_prepare_to_wrap()).
To enable these fields to be added to the response buffer in the
correct (final) order, the pointer to the accept_stat has to be made
available to svcauth_gss_accept() so that it can set it before
reserving space for the length and seq_num fields.
As a first step, move the pointer to the location of the accept_stat
field into struct svc_rqst.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that all vs_dispatch functions invoke svcxdr_init_encode(), it
is common code and can be pushed down into the generic RPC server.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that all vs_dispatch functions invoke svcxdr_init_decode(), it
is common code and can be pushed down into the generic RPC server.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now with NFSD being able to cross into auto mounts,
the check can be removed.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
This reverts commit 7fd461c47c.
Unfortunately, it has come to our attention that there is still a bug
somewhere in the READ_PLUS code that can result in nfsroot systems on
ARM to crash during boot.
Let's do the right thing and revert this change so we don't break
people's nfsroot setups.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
kmap_atomic() is deprecated in favor of kmap_local_page().
With kmap_local_page() the mappings are per thread, CPU local, can take
page-faults, and can be called from any context (including interrupts).
Furthermore, the tasks can be preempted and, when they are scheduled to
run again, the kernel virtual addresses are restored and still valid.
kmap_atomic() is implemented like a kmap_local_page() which also disables
page-faults and preemption (the latter only for !PREEMPT_RT kernels,
otherwise it only disables migration).
The code within the mappings/un-mappings in the functions of dir.c don't
depend on the above-mentioned side effects of kmap_atomic(), so that mere
replacements of the old API with the new one is all that is required
(i.e., there is no need to explicitly add calls to pagefault_disable()
and/or preempt_disable()).
Therefore, replace kmap_atomic() with kmap_local_page() in fs/nfs/dir.c.
Tested in a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When GETDEVICEINFO call fails, return the layout and fall back to MDS.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The NFSv4.2 server even if supports intra-SSC might prefer that for
a particular file a classic copy is performed. As returning ENOTSUPP
will clear the SSC capability of the server by the client, server
might return NFS4ERR_OFFLOAD_DENIED (well, spec talks about remote
servers there).
Update nfs42_proc_copy to handle NFS4ERR_OFFLOAD_DENIED as ENOTSUPP,
but without clearing NFS_CAP_COPY bit.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When swap is activated to a file on an NFSv4 mount we arrange that the
state manager thread is always present as starting a new thread requires
memory allocations that might block waiting for swap.
Unfortunately the code for allowing the state manager thread to exit when
swap is disabled was not tested properly and does not work.
This can be seen by examining /proc/fs/nfsfs/servers after disabling swap
and unmounting the filesystem. The servers file will still list one
entry. Also a "ps" listing will show the state manager thread is still
present.
There are two problems.
1/ rpc_clnt_swap_deactivate() doesn't walk up the ->cl_parent list to
find the primary client on which the state manager runs.
2/ The thread is not woken up properly and it immediately goes back to
sleep without checking whether it is really needed. Using
nfs4_schedule_state_manager() ensures a proper wake-up.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 4dc73c6791 ("NFSv4: keep state manager thread active if swap is enabled")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
__print_flags wants a mask, not the enum value. Add two more flags.
Fixes: 511ba52e4c ("NFS4: Trace state recovery operation")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
All the callers are expected to supply a valid struct file argument, so
there is no need for the NULL check.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Include info about which folio is being traced.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the gfp context allows it, and we're not kswapd, then try to write
out the folio that has private data.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Rather than adjusting the index+offset after the call to
nfs_create_request(), add a function nfs_page_create_from_page() that
takes an offset.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Mechanical conversion of struct page and functions into the folio
equivalents.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add a helper nfs_folio_grab_cache_write_begin() that can call
__filemap_get_folio() directly with the appropriate parameters.
Since write_begin()/write_end() take struct page arguments, just pass
the folio->page back for now.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Mostly mechanical conversion of struct page and functions into struct
folio equivalents.
The lack of support for folios in write_cache_pages(), means we still
only support order 0 folio allocations. However the rest of the
writeback code should now be ready for order n > 0.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Convert to use the folio functions, but pass the struct page to
nfs_writepage_locked() for now.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Perform a largely mechanical conversion of references to struct page and
page-specific functions to use the folio equivalents.
Note that the fscache functionality remains untouched. Instead we just
pass in the folio page. This should be OK, as long as we use order 0
folios together with fscache.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
...and use it in nfs_launder_folio().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Allow creation of subrequests from a request that is carrying a folio.
Add helpers to set up and tear down requests carrying folios.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Replace all the open coded calls to page_file_mapping(req->wb_page)->host.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Use the helper folio_size() where appropriate.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add support for multi-page folios in the generic NFS i/o engine.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the same page and data is being used for multiple requests,
then ignore that when the request indicates we're reading from the start
of the page.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Patch series "Convert writepage_t to use a folio".
More folioisation. I split out the mpage work from everything else
because it completely dominated the patch, but some implementations I just
converted outright.
This patch (of 2):
We always write back an entire folio, but that's currently passed as the
head page. Convert all filesystems that use write_cache_pages() to expect
a folio instead of a page.
Link: https://lkml.kernel.org/r/20230126201255.1681189-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230126201255.1681189-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now that the i_version counter is reported in struct kstat, there is no
need for this export operation.
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Allow NFS to report the i_version in getattr requests. Since the cost to
fetch it is relatively cheap, do it unconditionally and just set the
flag if it looks like it's valid. Also, conditionally enable the
MONOTONIC flag when the server reports its change attr type as such.
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.
Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Highlights include:
Bugfixes
- Fix a race in the RPCSEC_GSS upcall code that causes hung RPC calls
- Fix a broken coalescing test in the pNFS file layout driver
- Ensure that the access cache rcu path also applies the login test
- Fix up for a sparse warning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmO5s6MACgkQZwvnipYK
APJ8PA//aVnu+DqLTxeGnyr4HNIcm0su4TetghQabAf0S8nk69Ly8W+N+xCRwwCR
/PvSe8j51tXVInpYQKSOTjDl8/0mq8cybf6N9DzDS0A+hKQfdK8jvPLU0yj/aZ6u
D8x17ve7RjzLgw9minltq4eTvVIHpqSh/vVUyQhnj2A8qQlWrboXiQnLQAp5KBbq
7NexvcajgfFHXm7S+edZ8NH5QZ20FNVskvzyrClp46DQLJcdNLHGJLp8tdeD31kE
lsNBsLG0LxOFyBY28MbQptXalhBP8RRwYwxdSdIUnExzB7eKTLS6LgHuK9GV+EzL
wSfRrnyM023momKmdFn3Fy/cRyeF1RwfQON+KAHo1cQ5fFiQS16p7lTELGkApyLc
yOXGpAvaPJuuBJLM+Xi07ovhqpc7SjGmr7cjZslyasNtez60Le06oSihKnrgcKDx
ZVh1pTc89O25GqGm161HaQJY9YU4XCtjxPv7rvdrlhO5Dkgxai293R5aXUFBz1dD
tssv5a2DqIekrE29YaCi/RjIqkngCXHlwniFw/L3/aFRUyHqN9/MyxH5VeCRDI2T
OJQRR056V1FXSBLH6Wuzs2Ixz2HnJCLR7LE0zAoPLm/RwW8CuNmEXu28jfU0Kcmt
Hj5VyxYfnqEAGsUO0O2oNhOnxWY9OHvlvpb/9DWcke0ASpeScB4=
=N6cT
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
- Fix a race in the RPCSEC_GSS upcall code that causes hung RPC calls
- Fix a broken coalescing test in the pNFS file layout driver
- Ensure that the access cache rcu path also applies the login test
- Fix up for a sparse warning
* tag 'nfs-for-6.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix up a sparse warning
NFS: Judge the file access cache's timestamp in rcu path
pNFS/filelayout: Fix coalescing test for single DS
SUNRPC: ensure the matching upcall is in-flight upon downcall
If the user's login time is newer than the cache's timestamp,
we expect the cache may be stale and need to clear.
The stale cache will remain in the list's tail if no other
users operate on that inode.
Once the user accesses the inode, the stale cache will be
returned in rcu path.
Signed-off-by: Chengen Du <chengen.du@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When there is a single DS no striping constraints need to be placed on
the IO. When such constraint is applied then buffered reads don't
coalesce to the DS's rsize.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass in
a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be used
no matter what the const value is. This prevents every subsystem from
having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject, objects
as being "non-mutable". The changes to the kobject and driver core in
this pull request are the result of that, as there are lots of paths
where kobjects and device pointers are not modified at all, so marking
them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object rules.
All of this has been bike-shedded in quite a lot of detail on lkml with
different names and implementations resulting in the tiny version we
have in here, much better than my original proposal. Lots of subsystem
maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with no
problems, OTHER than some merge issues with other trees that should be
obvious when you hit them (block tree deletes a driver that this tree
modifies, iommufd tree modifies code that this tree also touches). If
there are merge problems with these trees, please let me know.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY5wz3A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yks0ACeKYUlVgCsER8eYW+x18szFa2QTXgAn2h/VhZe
1Fp53boFaQkGBjl8mGF8
=v+FB
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass
in a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be
used no matter what the const value is. This prevents every subsystem
from having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject,
objects as being "non-mutable". The changes to the kobject and driver
core in this pull request are the result of that, as there are lots of
paths where kobjects and device pointers are not modified at all, so
marking them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object
rules.
All of this has been bike-shedded in quite a lot of detail on lkml
with different names and implementations resulting in the tiny version
we have in here, much better than my original proposal. Lots of
subsystem maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with
no problems"
* tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits)
device property: Fix documentation for fwnode_get_next_parent()
firmware_loader: fix up to_fw_sysfs() to preserve const
usb.h: take advantage of container_of_const()
device.h: move kobj_to_dev() to use container_of_const()
container_of: add container_of_const() that preserves const-ness of the pointer
driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion.
driver core: fix up missed scsi/cxlflash class.devnode() conversion.
driver core: fix up some missing class.devnode() conversions.
driver core: make struct class.devnode() take a const *
driver core: make struct class.dev_uevent() take a const *
cacheinfo: Remove of_node_put() for fw_token
device property: Add a blank line in Kconfig of tests
device property: Rename goto label to be more precise
device property: Move PROPERTY_ENTRY_BOOL() a bit down
device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*()
kernfs: fix all kernel-doc warnings and multiple typos
driver core: pass a const * into of_device_uevent()
kobject: kset_uevent_ops: make name() callback take a const *
kobject: kset_uevent_ops: make filter() callback take a const *
kobject: make kobject_namespace take a const *
...
- Convert flexible array members, fix -Wstringop-overflow warnings,
and fix KCFI function type mismatches that went ignored by
maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook).
- Remove the remaining side-effect users of ksize() by converting
dma-buf, btrfs, and coredump to using kmalloc_size_roundup(),
add more __alloc_size attributes, and introduce full testing
of all allocator functions. Finally remove the ksize() side-effect
so that each allocation-aware checker can finally behave without
exceptions.
- Introduce oops_limit (default 10,000) and warn_limit (default off)
to provide greater granularity of control for panic_on_oops and
panic_on_warn (Jann Horn, Kees Cook).
- Introduce overflows_type() and castable_to_type() helpers for
cleaner overflow checking.
- Improve code generation for strscpy() and update str*() kern-doc.
- Convert strscpy and sigphash tests to KUnit, and expand memcpy
tests.
- Always use a non-NULL argument for prepare_kernel_cred().
- Disable structleak plugin in FORTIFY KUnit test (Anders Roxell).
- Adjust orphan linker section checking to respect CONFIG_WERROR
(Xin Li).
- Make sure siginfo is cleared for forced SIGKILL (haifeng.xu).
- Fix um vs FORTIFY warnings for always-NULL arguments.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmOZSOoWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjAAD/0YkvpU7f03f8hcQMJK6wv//24K
AW41hEaBikq9RcmkuvkLLrJRibGgZ5O2xUkUkxRs/HxhkhrZ0kEw8sbwZe8MoWls
F4Y9+TDjsrdHmjhfcBZdLnVxwcKK5wlaEcpjZXtbsfcdhx3TbgcDA23YELl5t0K+
I11j4kYmf9SLl4CwIrSP5iACml8CBHARDh8oIMF7FT/LrjNbM8XkvBcVVT6hTbOV
yjgA8WP2e9GXvj9GzKgqvd0uE/kwPkVAeXLNFWopPi4FQ8AWjlxbBZR0gamA6/EB
d7TIs0ifpVU2JGQaTav4xO6SsFMj3ntoUI0qIrFaTxZAvV4KYGrPT/Kwz1O4SFaG
rN5lcxseQbPQSBTFNG4zFjpywTkVCgD2tZqDwz5Rrmiraz0RyIokCN+i4CD9S0Ds
oEd8JSyLBk1sRALczkuEKo0an5AyC9YWRcBXuRdIHpLo08PsbeUUSe//4pe303cw
0ApQxYOXnrIk26MLElTzSMImlSvlzW6/5XXzL9ME16leSHOIfDeerPnc9FU9Eb3z
ODv22z6tJZ9H/apSUIHZbMciMbbVTZ8zgpkfydr08o87b342N/ncYHZ5cSvQ6DWb
jS5YOIuvl46/IhMPT16qWC8p0bP5YhxoPv5l6Xr0zq0ooEj0E7keiD/SzoLvW+Qs
AHXcibguPRQBPAdiPQ==
=yaaN
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kernel hardening updates from Kees Cook:
- Convert flexible array members, fix -Wstringop-overflow warnings, and
fix KCFI function type mismatches that went ignored by maintainers
(Gustavo A. R. Silva, Nathan Chancellor, Kees Cook)
- Remove the remaining side-effect users of ksize() by converting
dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add
more __alloc_size attributes, and introduce full testing of all
allocator functions. Finally remove the ksize() side-effect so that
each allocation-aware checker can finally behave without exceptions
- Introduce oops_limit (default 10,000) and warn_limit (default off) to
provide greater granularity of control for panic_on_oops and
panic_on_warn (Jann Horn, Kees Cook)
- Introduce overflows_type() and castable_to_type() helpers for cleaner
overflow checking
- Improve code generation for strscpy() and update str*() kern-doc
- Convert strscpy and sigphash tests to KUnit, and expand memcpy tests
- Always use a non-NULL argument for prepare_kernel_cred()
- Disable structleak plugin in FORTIFY KUnit test (Anders Roxell)
- Adjust orphan linker section checking to respect CONFIG_WERROR (Xin
Li)
- Make sure siginfo is cleared for forced SIGKILL (haifeng.xu)
- Fix um vs FORTIFY warnings for always-NULL arguments
* tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits)
ksmbd: replace one-element arrays with flexible-array members
hpet: Replace one-element array with flexible-array member
um: virt-pci: Avoid GCC non-NULL warning
signal: Initialize the info in ksignal
lib: fortify_kunit: build without structleak plugin
panic: Expose "warn_count" to sysfs
panic: Introduce warn_limit
panic: Consolidate open-coded panic_on_warn checks
exit: Allow oops_limit to be disabled
exit: Expose "oops_count" to sysfs
exit: Put an upper limit on how often we can oops
panic: Separate sysctl logic from CONFIG_SMP
mm/pgtable: Fix multiple -Wstringop-overflow warnings
mm: Make ksize() a reporting-only function
kunit/fortify: Validate __alloc_size attribute results
drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid()
drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid()
driver core: Add __alloc_size hint to devm allocators
overflow: Introduce overflows_type() and castable_to_type()
coredump: Proactively round up to kmalloc bucket size
...
Highlights include:
Bugfixes
- Fix a NULL pointer dereference in the mount parser
- Fix a memory stomp in decode_attr_security_label
- Fix a credential leak in _nfs4_discover_trunking()
- Fix a buffer leak in rpcrdma_req_create()
- Fix a leaked socket in rpc_sockname()
- Fix a deadlock between nfs4_open_recover_helper() and delegreturn
- Fix an Oops in nfs_d_automount()
- Fix a potential race in nfs_call_unlink()
- Multiple fixes for the open context mode
- NFSv4.2 READ_PLUS fixes
- Fix a regression in which small rsize/wsize values are being forbidden
- Fail client initialisation if the NFSv4.x state manager thread can't run
- avoid spurious warning of lost lock that is being unlocked.
- Ensure the initialisation of struct nfs4_label
Features and cleanups
- Trigger the "ls -l" readdir heuristic sooner
- Clear the file access cache upon login to ensure supplementary group
info is in sync between the client and server
- pnfs: Fix up the logging of layout stateids
- NFSv4.2: Change the default KConfig value for READ_PLUS
- Use sysfs_emit() instead of scnprintf() where appropriate
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmOYmTQACgkQZwvnipYK
APLxWw//dAFjHXc3YOLi1J94nINZbOyAvdS2WAPVLRfMh8mEln9JAxVsUREz279N
02EFbW/ZAN3nd2SGdYgbdzTIIJ154LlDS9DmIdwgqK2I/tEZauFeDTpTBDIDNwMg
eruU4twttbgrE7lgVlE0ZDGlvMNFao462BriPeEN+WaaqQWB2NqMp/qoh1i4mbcn
rQPPiovn0LiW5K5ulMT5OFz9y4I4oNaJV77sPSUew0M0iLbGtI1OdrF844+8gzb0
BeMW/ArvM5TpuwMOGmzdt4pql8MTPf4qvcnafiMX69Eg2asGijZlr/AxOpSOYXSl
LDmAXczQ0VSsXYkKI2bJz1d5xWeFL4ZJVsaR8W/M4aewgqHWtAIs/b+Gk2x/S+Zo
s1lGFrHrvzhXd/c8LxuH+LBUxLqy6nep0sVWKOK/eQ79BUo8lLwev8YbNk8Sk4Rh
N2SrcJQeHu9uYKUbWuRWt5r2L9ffYVcSQQfEk5IzCvRuz6B7sFXtwNJK8rPjz78/
LLHNokQIPHUTYUa2P+zKwozwSwOc6y6eqEhxH8CvDJlo5OfDvT+9deyNb8ZOUxiq
UXHXG2qQivU0wTaqoowLp8D8FT3MQB9KkUZkikGf9d884qXgbWipx0kGOp+Y6JJp
JesK1KxcxtKZRiywBSz6fhWW78VknYWf4RR6cESjgpUSbuvblfE=
=Mz1l
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust
"Bugfixes:
- Fix NULL pointer dereference in the mount parser
- Fix memory stomp in decode_attr_security_label
- Fix credential leak in _nfs4_discover_trunking()
- Fix buffer leak in rpcrdma_req_create()
- Fix leaked socket in rpc_sockname()
- Fix deadlock between nfs4_open_recover_helper() and delegreturn
- Fix an Oops in nfs_d_automount()
- Fix potential race in nfs_call_unlink()
- Multiple fixes for the open context mode
- NFSv4.2 READ_PLUS fixes
- Fix a regression in which small rsize/wsize values are being
forbidden
- Fail client initialisation if the NFSv4.x state manager thread
can't run
- Avoid spurious warning of lost lock that is being unlocked.
- Ensure the initialisation of struct nfs4_label
Features and cleanups:
- Trigger the "ls -l" readdir heuristic sooner
- Clear the file access cache upon login to ensure supplementary
group info is in sync between the client and server
- pnfs: Fix up the logging of layout stateids
- NFSv4.2: Change the default KConfig value for READ_PLUS
- Use sysfs_emit() instead of scnprintf() where appropriate"
* tag 'nfs-for-6.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
NFSv4.2: Change the default KConfig value for READ_PLUS
NFSv4.x: Fail client initialisation if state manager thread can't run
fs: nfs: sysfs: use sysfs_emit() to instead of scnprintf()
NFS: use sysfs_emit() to instead of scnprintf()
NFS: Allow very small rsize & wsize again
NFSv4.2: Fix up READ_PLUS alignment
NFSv4.2: Set the correct size scratch buffer for decoding READ_PLUS
SUNRPC: Fix missing release socket in rpc_sockname()
xprtrdma: Fix regbuf data not freed in rpcrdma_req_create()
NFS: avoid spurious warning of lost lock that is being unlocked.
nfs: fix possible null-ptr-deref when parsing param
NFSv4: check FMODE_EXEC from open context mode in nfs4_opendata_access()
NFS: make sure open context mode have FMODE_EXEC when file open for exec
NFS4.x/pnfs: Fix up logging of layout stateids
NFS: Fix a race in nfs_call_unlink()
NFS: Fix an Oops in nfs_d_automount()
NFSv4: Fix a deadlock between nfs4_open_recover_helper() and delegreturn
NFSv4: Fix a credential leak in _nfs4_discover_trunking()
NFS: Trigger the "ls -l" readdir heuristic sooner
NFSv4.2: Fix initialisation of struct nfs4_label
...
This release introduces support for the CB_RECALL_ANY operation.
NFSD can send this operation to request that clients return any
delegations they choose. The server uses this operation to handle
low memory scenarios or indicate to a client when that client has
reached the maximum number of delegations the server supports.
The NFSv4.2 READ_PLUS operation has been simplified temporarily
whilst support for sparse files in local filesystems and the VFS is
improved.
Two major data structure fixes appear in this release:
* The nfs4_file hash table is replaced with a resizable hash table
to reduce the latency of NFSv4 OPEN operations.
* Reference counting in the NFSD filecache has been hardened against
races.
In furtherance of removing support for NFSv2 in a subsequent kernel
release, a new Kconfig option enables server-side support for NFSv2
to be left out of a kernel build.
MAINTAINERS has been updated to indicate that changes to fs/exportfs
should go through the NFSD tree.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmOXPE4ACgkQM2qzM29m
f5faaRAAh7YT5V61afPbfgBybO5AbDzztpZSNjNjLZs78piSnFp6hP75yNtTviwQ
1o7St13/NkCmDaIdGUpr02U01zbM1BDOq2wGckImOJLNSgb7xHV5r4PqkRiFkh0t
QYSnwG+wp8fDUJeCL/nAOAu9I9EQUqHzWchxiU/h8ln2hN3rXUlIRSeo17Wy7zkD
cNIcoAjTi9fzY3dE6H4r+lZTdNCYH+AdzChmKrHdRZQwq0Xs3FWv4gAMTLbDuD4P
B6NDHz0Umn6XnFsJGptwozkwaWeMQw4GyJj/3iUiO8JF209SaoYXMPjJAyG6tYYa
fUrgv4UXGeXjigDbLBA5IYxfhX7GXjMQSaj3edhzyrl8P74q4/Cq/8fDUnAZ841m
E+TGSCPIQD0QuIjdXxLv9KLY8JNThSfcAt6jr5GBXhPZQr8xpS0BqK/Onr68fgZC
Lpull5xN68L4A1B7cf2GNPuMyvkBKxwSGXOehldh/BkvpVMjFwqd4/q5xWC+6CcQ
tbOkjTbbSS71nzJwZip0NphaYCa3qQPzKT4SZzn/I4I9W5otbwYBx734Bw46gTDE
ZPUXTuJ00VPgX07wbLRahg521Fwzr+8sk1WnVYq82PoaMh1l9FjzLNGouQWBdo3E
UzIo/KUfQKmoZce6O723L6OI4ffdK5oMtfaTpe+SiUPpV1lUAcA=
=jNlu
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"This release introduces support for the CB_RECALL_ANY operation. NFSD
can send this operation to request that clients return any delegations
they choose. The server uses this operation to handle low memory
scenarios or indicate to a client when that client has reached the
maximum number of delegations the server supports.
The NFSv4.2 READ_PLUS operation has been simplified temporarily whilst
support for sparse files in local filesystems and the VFS is improved.
Two major data structure fixes appear in this release:
- The nfs4_file hash table is replaced with a resizable hash table to
reduce the latency of NFSv4 OPEN operations.
- Reference counting in the NFSD filecache has been hardened against
races.
In furtherance of removing support for NFSv2 in a subsequent kernel
release, a new Kconfig option enables server-side support for NFSv2 to
be left out of a kernel build.
MAINTAINERS has been updated to indicate that changes to fs/exportfs
should go through the NFSD tree"
* tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (49 commits)
NFSD: Avoid clashing function prototypes
SUNRPC: Fix crasher in unwrap_integ_data()
SUNRPC: Make the svc_authenticate tracepoint conditional
NFSD: Use only RQ_DROPME to signal the need to drop a reply
SUNRPC: Clean up xdr_write_pages()
SUNRPC: Don't leak netobj memory when gss_read_proxy_verf() fails
NFSD: add CB_RECALL_ANY tracepoints
NFSD: add delegation reaper to react to low memory condition
NFSD: add support for sending CB_RECALL_ANY
NFSD: refactoring courtesy_client_reaper to a generic low memory shrinker
trace: Relocate event helper files
NFSD: pass range end to vfs_fsync_range() instead of count
lockd: fix file selection in nlmsvc_cancel_blocked
lockd: ensure we use the correct file descriptor when unlocking
lockd: set missing fl_flags field when retrieving args
NFSD: Use struct_size() helper in alloc_session()
nfsd: return error if nfs4_setacl fails
lockd: set other missing fields when unlocking files
NFSD: Add an nfsd_file_fsync tracepoint
sunrpc: svc: Remove an unused static function svc_ungetu32()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bwTgAKCRCRxhvAZXjc
ovd2AQCK00NAtGjQCjQPQGyTa4GAPqvWgq1ef0lnhv+TL5US5gD9FncQ8UofeMXt
pBfjtAD6ettTPCTxUQfnTwWEU4rc7Qg=
=27Wm
-----END PGP SIGNATURE-----
Merge tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull VFS acl updates from Christian Brauner:
"This contains the work that builds a dedicated vfs posix acl api.
The origins of this work trace back to v5.19 but it took quite a while
to understand the various filesystem specific implementations in
sufficient detail and also come up with an acceptable solution.
As we discussed and seen multiple times the current state of how posix
acls are handled isn't nice and comes with a lot of problems: The
current way of handling posix acls via the generic xattr api is error
prone, hard to maintain, and type unsafe for the vfs until we call
into the filesystem's dedicated get and set inode operations.
It is already the case that posix acls are special-cased to death all
the way through the vfs. There are an uncounted number of hacks that
operate on the uapi posix acl struct instead of the dedicated vfs
struct posix_acl. And the vfs must be involved in order to interpret
and fixup posix acls before storing them to the backing store, caching
them, reporting them to userspace, or for permission checking.
Currently a range of hacks and duct tape exist to make this work. As
with most things this is really no ones fault it's just something that
happened over time. But the code is hard to understand and difficult
to maintain and one is constantly at risk of introducing bugs and
regressions when having to touch it.
Instead of continuing to hack posix acls through the xattr handlers
this series builds a dedicated posix acl api solely around the get and
set inode operations.
Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl()
helpers must be used in order to interact with posix acls. They
operate directly on the vfs internal struct posix_acl instead of
abusing the uapi posix acl struct as we currently do. In the end this
removes all of the hackiness, makes the codepaths easier to maintain,
and gets us type safety.
This series passes the LTP and xfstests suites without any
regressions. For xfstests the following combinations were tested:
- xfs
- ext4
- btrfs
- overlayfs
- overlayfs on top of idmapped mounts
- orangefs
- (limited) cifs
There's more simplifications for posix acls that we can make in the
future if the basic api has made it.
A few implementation details:
- The series makes sure to retain exactly the same security and
integrity module permission checks. Especially for the integrity
modules this api is a win because right now they convert the uapi
posix acl struct passed to them via a void pointer into the vfs
struct posix_acl format to perform permission checking on the mode.
There's a new dedicated security hook for setting posix acls which
passes the vfs struct posix_acl not a void pointer. Basing checking
on the posix acl stored in the uapi format is really unreliable.
The vfs currently hacks around directly in the uapi struct storing
values that frankly the security and integrity modules can't
correctly interpret as evidenced by bugs we reported and fixed in
this area. It's not necessarily even their fault it's just that the
format we provide to them is sub optimal.
- Some filesystems like 9p and cifs need access to the dentry in
order to get and set posix acls which is why they either only
partially or not even at all implement get and set inode
operations. For example, cifs allows setxattr() and getxattr()
operations but doesn't allow permission checking based on posix
acls because it can't implement a get acl inode operation.
Thus, this patch series updates the set acl inode operation to take
a dentry instead of an inode argument. However, for the get acl
inode operation we can't do this as the old get acl method is
called in e.g., generic_permission() and inode_permission(). These
helpers in turn are called in various filesystem's permission inode
operation. So passing a dentry argument to the old get acl inode
operation would amount to passing a dentry to the permission inode
operation which we shouldn't and probably can't do.
So instead of extending the existing inode operation Christoph
suggested to add a new one. He also requested to ensure that the
get and set acl inode operation taking a dentry are consistently
named. So for this version the old get acl operation is renamed to
->get_inode_acl() and a new ->get_acl() inode operation taking a
dentry is added. With this we can give both 9p and cifs get and set
acl inode operations and in turn remove their complex custom posix
xattr handlers.
In the future I hope to get rid of the inode method duplication but
it isn't like we have never had this situation. Readdir is just one
example. And frankly, the overall gain in type safety and the more
pleasant api wise are simply too big of a benefit to not accept
this duplication for a while.
- We've done a full audit of every codepaths using variant of the
current generic xattr api to get and set posix acls and
surprisingly it isn't that many places. There's of course always a
chance that we might have missed some and if so I'm sure we'll find
them soon enough.
The crucial codepaths to be converted are obviously stacking
filesystems such as ecryptfs and overlayfs.
For a list of all callers currently using generic xattr api helpers
see [2] including comments whether they support posix acls or not.
- The old vfs generic posix acl infrastructure doesn't obey the
create and replace semantics promised on the setxattr(2) manpage.
This patch series doesn't address this. It really is something we
should revisit later though.
The patches are roughly organized as follows:
(1) Change existing set acl inode operation to take a dentry
argument (Intended to be a non-functional change)
(2) Rename existing get acl method (Intended to be a non-functional
change)
(3) Implement get and set acl inode operations for filesystems that
couldn't implement one before because of the missing dentry.
That's mostly 9p and cifs (Intended to be a non-functional
change)
(4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(),
and vfs_set_acl() including security and integrity hooks
(Intended to be a non-functional change)
(5) Implement get and set acl inode operations for stacking
filesystems (Intended to be a non-functional change)
(6) Switch posix acl handling in stacking filesystems to new posix
acl api now that all filesystems it can stack upon support it.
(7) Switch vfs to new posix acl api (semantical change)
(8) Remove all now unused helpers
(9) Additional regression fixes reported after we merged this into
linux-next
Thanks to Seth for a lot of good discussion around this and
encouragement and input from Christoph"
* tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits)
posix_acl: Fix the type of sentinel in get_acl
orangefs: fix mode handling
ovl: call posix_acl_release() after error checking
evm: remove dead code in evm_inode_set_acl()
cifs: check whether acl is valid early
acl: make vfs_posix_acl_to_xattr() static
acl: remove a slew of now unused helpers
9p: use stub posix acl handlers
cifs: use stub posix acl handlers
ovl: use stub posix acl handlers
ecryptfs: use stub posix acl handlers
evm: remove evm_xattr_acl_change()
xattr: use posix acl api
ovl: use posix acl api
ovl: implement set acl method
ovl: implement get acl method
ecryptfs: implement set acl method
ecryptfs: implement get acl method
ksmbd: use vfs_remove_acl()
acl: add vfs_remove_acl()
...
direction misannotations and (hopefully) preventing
more of the same for the future.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
=bcvF
-----END PGP SIGNATURE-----
Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
Now that we've worked out performance issues and have a server patch
addressing the failed xfstests, we can safely enable this feature by
default.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Steven Rostedt says:
> The include/trace/events/ directory should only hold files that
> are to create events, not headers that hold helper functions.
>
> Can you please move them out of include/trace/events/ as that
> directory is "special" in the creation of events.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the state manager thread fails to start, then we should just mark the
client initialisation as failed so that other processes or threads don't
get stuck in nfs_wait_client_init_complete().
Reported-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Fixes: 4697bd5e94 ("NFSv4: Fix a race in the net namespace mount notification")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
940261a195 introduced nfs_io_size() to clamp the iosize to a multiple
of PAGE_SIZE. This had the unintended side effect of no longer allowing
iosizes less than a page, which could be useful in some situations.
UDP already has an exception that causes it to fall back on the
power-of-two style sizes instead. This patch adds an additional
exception for very small iosizes.
Reported-by: Jeff Layton <jlayton@kernel.org>
Fixes: 940261a195 ("NFS: Allow setting rsize / wsize to a multiple of PAGE_SIZE")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Assume that the first segment will be a DATA segment, and place the data
directly into the xdr pages so it doesn't need to be shifted.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The scratch_buf array is 16 bytes, but I was passing 32 to the
xdr_set_scratch_buffer() function. Fix this by using sizeof(), which is
what I probably should have been doing this whole time.
Fixes: d3b00a802c ("NFS: Replace the READ_PLUS decoding code")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>