Commit Graph

1311110 Commits

Author SHA1 Message Date
Mike Snitzer
c840b8e1f0 nfs_common: must not hold RCU while calling nfsd_file_put_local
Move holding the RCU from nfs_to_nfsd_file_put_local to
nfs_to_nfsd_net_put.  It is the call to nfs_to->nfsd_serv_put that
requires the RCU anyway (the puts for nfsd_file and netns were
combined to avoid an extra indirect reference but that
micro-optimization isn't possible now).

This fixes xfstests generic/013 and it triggering:

"Voluntary context switch within RCU read-side critical section!"

[  143.545738] Call Trace:
[  143.546206]  <TASK>
[  143.546625]  ? show_regs+0x6d/0x80
[  143.547267]  ? __warn+0x91/0x140
[  143.547951]  ? rcu_note_context_switch+0x496/0x5d0
[  143.548856]  ? report_bug+0x193/0x1a0
[  143.549557]  ? handle_bug+0x63/0xa0
[  143.550214]  ? exc_invalid_op+0x1d/0x80
[  143.550938]  ? asm_exc_invalid_op+0x1f/0x30
[  143.551736]  ? rcu_note_context_switch+0x496/0x5d0
[  143.552634]  ? wakeup_preempt+0x62/0x70
[  143.553358]  __schedule+0xaa/0x1380
[  143.554025]  ? _raw_spin_unlock_irqrestore+0x12/0x40
[  143.554958]  ? try_to_wake_up+0x1fe/0x6b0
[  143.555715]  ? wake_up_process+0x19/0x20
[  143.556452]  schedule+0x2e/0x120
[  143.557066]  schedule_preempt_disabled+0x19/0x30
[  143.557933]  rwsem_down_read_slowpath+0x24d/0x4a0
[  143.558818]  ? xfs_efi_item_format+0x50/0xc0 [xfs]
[  143.559894]  down_read+0x4e/0xb0
[  143.560519]  xlog_cil_commit+0x1b2/0xbc0 [xfs]
[  143.561460]  ? _raw_spin_unlock+0x12/0x30
[  143.562212]  ? xfs_inode_item_precommit+0xc7/0x220 [xfs]
[  143.563309]  ? xfs_trans_run_precommits+0x69/0xd0 [xfs]
[  143.564394]  __xfs_trans_commit+0xb5/0x330 [xfs]
[  143.565367]  xfs_trans_roll+0x48/0xc0 [xfs]
[  143.566262]  xfs_defer_trans_roll+0x57/0x100 [xfs]
[  143.567278]  xfs_defer_finish_noroll+0x27a/0x490 [xfs]
[  143.568342]  xfs_defer_finish+0x1a/0x80 [xfs]
[  143.569267]  xfs_bunmapi_range+0x4d/0xb0 [xfs]
[  143.570208]  xfs_itruncate_extents_flags+0x13d/0x230 [xfs]
[  143.571353]  xfs_free_eofblocks+0x12e/0x190 [xfs]
[  143.572359]  xfs_file_release+0x12d/0x140 [xfs]
[  143.573324]  __fput+0xe8/0x2d0
[  143.573922]  __fput_sync+0x1d/0x30
[  143.574574]  nfsd_filp_close+0x33/0x60 [nfsd]
[  143.575430]  nfsd_file_free+0x96/0x150 [nfsd]
[  143.576274]  nfsd_file_put+0xf7/0x1a0 [nfsd]
[  143.577104]  nfsd_file_put_local+0x18/0x30 [nfsd]
[  143.578070]  nfs_close_local_fh+0x101/0x110 [nfs_localio]
[  143.579079]  __put_nfs_open_context+0xc9/0x180 [nfs]
[  143.580031]  nfs_file_clear_open_context+0x4a/0x60 [nfs]
[  143.581038]  nfs_file_release+0x3e/0x60 [nfs]
[  143.581879]  __fput+0xe8/0x2d0
[  143.582464]  __fput_sync+0x1d/0x30
[  143.583108]  __x64_sys_close+0x41/0x80
[  143.583823]  x64_sys_call+0x189a/0x20d0
[  143.584552]  do_syscall_64+0x64/0x170
[  143.585240]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  143.586185] RIP: 0033:0x7f3c5153efd7

Fixes: 65f2a5c366 ("nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:12 -05:00
Al Viro
07442ec85b nfsd: get rid of include ../internal.h
added back in 2015 for the sake of vfs_clone_file_range(),
which is in linux/fs.h these days

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:12 -05:00
Yang Erkun
98100e88dd nfsd: fix nfs4_openowner leak when concurrent nfsd4_open occur
The action force umount(umount -f) will attempt to kill all rpc_task even
umount operation may ultimately fail if some files remain open.
Consequently, if an action attempts to open a file, it can potentially
send two rpc_task to nfs server.

                   NFS CLIENT
thread1                             thread2
open("file")
...
nfs4_do_open
 _nfs4_do_open
  _nfs4_open_and_get_state
   _nfs4_proc_open
    nfs4_run_open_task
     /* rpc_task1 */
     rpc_run_task
     rpc_wait_for_completion_task

                                    umount -f
                                    nfs_umount_begin
                                     rpc_killall_tasks
                                      rpc_signal_task
     rpc_task1 been wakeup
     and return -512
 _nfs4_do_open // while loop
    ...
    nfs4_run_open_task
     /* rpc_task2 */
     rpc_run_task
     rpc_wait_for_completion_task

While processing an open request, nfsd will first attempt to find or
allocate an nfs4_openowner. If it finds an nfs4_openowner that is not
marked as NFS4_OO_CONFIRMED, this nfs4_openowner will released. Since
two rpc_task can attempt to open the same file simultaneously from the
client to server, and because two instances of nfsd can run
concurrently, this situation can lead to lots of memory leak.
Additionally, when we echo 0 to /proc/fs/nfsd/threads, warning will be
triggered.

                    NFS SERVER
nfsd1                  nfsd2       echo 0 > /proc/fs/nfsd/threads

nfsd4_open
 nfsd4_process_open1
  find_or_alloc_open_stateowner
   // alloc oo1, stateid1
                       nfsd4_open
                        nfsd4_process_open1
                        find_or_alloc_open_stateowner
                        // find oo1, without NFS4_OO_CONFIRMED
                         release_openowner
                          unhash_openowner_locked
                          list_del_init(&oo->oo_perclient)
                          // cannot find this oo
                          // from client, LEAK!!!
                         alloc_stateowner // alloc oo2

 nfsd4_process_open2
  init_open_stateid
  // associate oo1
  // with stateid1, stateid1 LEAK!!!
  nfs4_get_vfs_file
  // alloc nfsd_file1 and nfsd_file_mark1
  // all LEAK!!!

                         nfsd4_process_open2
                         ...

                                    write_threads
                                     ...
                                     nfsd_destroy_serv
                                      nfsd_shutdown_net
                                       nfs4_state_shutdown_net
                                        nfs4_state_destroy_net
                                         destroy_client
                                          __destroy_client
                                          // won't find oo1!!!
                                     nfsd_shutdown_generic
                                      nfsd_file_cache_shutdown
                                       kmem_cache_destroy
                                       for nfsd_file_slab
                                       and nfsd_file_mark_slab
                                       // bark since nfsd_file1
                                       // and nfsd_file_mark1
                                       // still alive

=======================================================================
BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on
__kmem_cache_shutdown()
-----------------------------------------------------------------------

Slab 0xffd4000004438a80 objects=34 used=1 fp=0xff11000110e2ad28
flags=0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff)
CPU: 4 UID: 0 PID: 757 Comm: sh Not tainted 6.12.0-rc6+ #19
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xac/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1ae/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Disabling lock debugging due to kernel taint
Object 0xff11000110e2ac38 @offset=3128
Allocated in nfsd_file_do_acquire+0x20f/0xa30 [nfsd] age=1635 cpu=3
pid=800
 nfsd_file_do_acquire+0x20f/0xa30 [nfsd]
 nfsd_file_acquire_opened+0x5f/0x90 [nfsd]
 nfs4_get_vfs_file+0x4c9/0x570 [nfsd]
 nfsd4_process_open2+0x713/0x1070 [nfsd]
 nfsd4_open+0x74b/0x8b0 [nfsd]
 nfsd4_proc_compound+0x70b/0xc20 [nfsd]
 nfsd_dispatch+0x1b4/0x3a0 [nfsd]
 svc_process_common+0x5b8/0xc50 [sunrpc]
 svc_process+0x2ab/0x3b0 [sunrpc]
 svc_handle_xprt+0x681/0xa20 [sunrpc]
 nfsd+0x183/0x220 [nfsd]
 kthread+0x199/0x1e0
 ret_from_fork+0x31/0x60
 ret_from_fork_asm+0x1a/0x30

Add nfs4_openowner_unhashed to help found unhashed nfs4_openowner, and
break nfsd4_open process to fix this problem.

Cc: stable@vger.kernel.org # v5.4+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:12 -05:00
Chuck Lever
aa0ebd21df NFSD: Add nfsd4_copy time-to-live
Keep async copy state alive for a few lease cycles after the copy
completes so that OFFLOAD_STATUS returns something meaningful.

This means that NFSD's client shutdown processing needs to purge
any of this state that happens to be waiting to die.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:11 -05:00
Chuck Lever
ac0514f4d1 NFSD: Add a laundromat reaper for async copy state
RFC 7862 Section 4.8 states:

> A copy offload stateid will be valid until either (A) the client
> or server restarts or (B) the client returns the resource by
> issuing an OFFLOAD_CANCEL operation or the client replies to a
> CB_OFFLOAD operation.

Instead of releasing async copy state when the CB_OFFLOAD callback
completes, now let it live until the next laundromat run after the
callback completes.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:11 -05:00
Chuck Lever
b44ffa4c4f NFSD: Block DESTROY_CLIENTID only when there are ongoing async COPY operations
Currently __destroy_client() consults the nfs4_client's async_copies
list to determine whether there are ongoing async COPY operations.
However, NFSD now keeps copy state in that list even when the
async copy has completed, to enable OFFLOAD_STATUS to find the
COPY results for a while after the COPY has completed.

DESTROY_CLIENTID should not be blocked if the client's async_copies
list contains state for only completed copy operations.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
5c41f32147 NFSD: Handle an NFS4ERR_DELAY response to CB_OFFLOAD
RFC 7862 permits callback services to respond to CB_OFFLOAD with
NFS4ERR_DELAY. Currently NFSD drops the CB_OFFLOAD in that case.

To improve the reliability of COPY offload, NFSD should rather send
another CB_OFFLOAD completion notification.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
409d6f52bd NFSD: Free async copy information in nfsd4_cb_offload_release()
RFC 7862 Section 4.8 states:

> A copy offload stateid will be valid until either (A) the client
> or server restarts or (B) the client returns the resource by
> issuing an OFFLOAD_CANCEL operation or the client replies to a
> CB_OFFLOAD operation.

Currently, NFSD purges the metadata for an async COPY operation as
soon as the CB_OFFLOAD callback has been sent. It does not wait even
for the client's CB_OFFLOAD response, as the paragraph above
suggests that it should.

This makes the OFFLOAD_STATUS operation ineffective during the
window between the completion of an asynchronous COPY and the
server's receipt of the corresponding CB_OFFLOAD response. This is
important if, for example, the client responds with NFS4ERR_DELAY,
or the transport is lost before the server receives the response. A
client might use OFFLOAD_STATUS to query the server about the still
pending asynchronous COPY, but NFSD will respond to OFFLOAD_STATUS
as if it had never heard of the presented copy stateid.

This patch starts to address this issue by extending the lifetime of
struct nfsd4_copy at least until the server has seen the client's
CB_OFFLOAD response, or the CB_OFFLOAD has timed out.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
62a8642ba0 NFSD: Fix nfsd4_shutdown_copy()
nfsd4_shutdown_copy() is just this:

	while ((copy = nfsd4_get_copy(clp)) != NULL)
		nfsd4_stop_copy(copy);

nfsd4_get_copy() bumps @copy's reference count, preventing
nfsd4_stop_copy() from releasing @copy.

A while loop like this usually works by removing the first element
of the list, but neither nfsd4_get_copy() nor nfsd4_stop_copy()
alters the async_copies list.

Best I can tell, then, is that nfsd4_shutdown_copy() continues to
loop until other threads manage to remove all the items from this
list. The spinning loop blocks shutdown until these items are gone.

Possibly the reason we haven't seen this issue in the field is
because client_has_state() prevents __destroy_client() from calling
nfsd4_shutdown_copy() if there are any items on this list. In a
subsequent patch I plan to remove that restriction.

Fixes: e0639dc580 ("NFSD introduce async copy feature")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:09 -05:00
Chuck Lever
a4452e661b NFSD: Add a tracepoint to record canceled async COPY operations
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:09 -05:00
Jeff Layton
10c93b5101 nfsd: make nfsd4_session->se_flags a bool
While this holds the flags from the CREATE_SESSION request, nothing
ever consults them. The only flag used is NFS4_SESSION_DEAD. Make it a
simple bool instead.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:09 -05:00
Jeff Layton
53f9ba78e0 nfsd: remove nfsd4_session->se_bchannel
This field is written and is never consulted again. Remove it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:08 -05:00
NeilBrown
6a404f475f nfsd: make use of warning provided by refcount_t
refcount_t, by design, checks for unwanted situations and provides
warnings.  It is rarely useful to have explicit warnings with refcount
usage.

In this case we have an explicit warning if a refcount_t reaches zero
when decremented.  Simply using refcount_dec() will provide a similar
warning and also mark the refcount_t as saturated to avoid any possible
use-after-free.

This patch drops the warning and uses refcount_dec() instead of
refcount_dec_and_test().

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:08 -05:00
NeilBrown
a2c0412c05 nfsd: Don't fail OP_SETCLIENTID when there are too many clients.
Failing OP_SETCLIENTID or OP_EXCHANGE_ID should only happen if there is
memory allocation failure.  Putting a hard limit on the number of
clients is not really helpful as it will either happen too early and
prevent clients that the server can easily handle, or too late and
allow clients when the server is swamped.

The calculated limit is still useful for expiring courtesy clients where
there are "too many" clients, but it shouldn't prevent the creation of
active clients.

Testing of lots of clients against small-mem servers reports repeated
NFS4ERR_DELAY responses which doesn't seem helpful.  There may have been
reports of similar problems in production use.

Also remove an outdated comment - we do use a slab cache.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:07 -05:00
Ye Bin
ce89e742a4 svcrdma: fix miss destroy percpu_counter in svc_rdma_proc_init()
There's issue as follows:
RPC: Registered rdma transport module.
RPC: Registered rdma backchannel transport module.
RPC: Unregistered rdma transport module.
RPC: Unregistered rdma backchannel transport module.
BUG: unable to handle page fault for address: fffffbfff80c609a
PGD 123fee067 P4D 123fee067 PUD 123fea067 PMD 10c624067 PTE 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI
RIP: 0010:percpu_counter_destroy_many+0xf7/0x2a0
Call Trace:
 <TASK>
 __die+0x1f/0x70
 page_fault_oops+0x2cd/0x860
 spurious_kernel_fault+0x36/0x450
 do_kern_addr_fault+0xca/0x100
 exc_page_fault+0x128/0x150
 asm_exc_page_fault+0x26/0x30
 percpu_counter_destroy_many+0xf7/0x2a0
 mmdrop+0x209/0x350
 finish_task_switch.isra.0+0x481/0x840
 schedule_tail+0xe/0xd0
 ret_from_fork+0x23/0x80
 ret_from_fork_asm+0x1a/0x30
 </TASK>

If register_sysctl() return NULL, then svc_rdma_proc_cleanup() will not
destroy the percpu counters which init in svc_rdma_proc_init().
If CONFIG_HOTPLUG_CPU is enabled, residual nodes may be in the
'percpu_counters' list. The above issue may occur once the module is
removed. If the CONFIG_HOTPLUG_CPU configuration is not enabled, memory
leakage occurs.
To solve above issue just destroy all percpu counters when
register_sysctl() return NULL.

Fixes: 1e7e557316 ("svcrdma: Restore read and write stats")
Fixes: 22df5a2246 ("svcrdma: Convert rdma_stat_sq_starve to a per-CPU counter")
Fixes: df971cd853 ("svcrdma: Convert rdma_stat_recv to a per-CPU counter")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:07 -05:00
Chuck Lever
573954a996 xdrgen: Remove program_stat_to_errno() call sites
Refactor: Translating an on-the-wire value to a local host errno is
architecturally a job for the proc function, not the XDR decoder.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:07 -05:00
Chuck Lever
903a7d37d9 xdrgen: Update the files included in client-side source code
In particular, client-side source code needs the definition of
"struct rpc_procinfo" and does not want header files that pull
in "struct svc_rqst". Otherwise, the source does not compile.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:06 -05:00
Chuck Lever
82c2a36179 xdrgen: Remove check for "nfs_ok" in C templates
Obviously, "nfs_ok" is defined only for NFS protocols. Other XDR
protocols won't know "nfs_ok" from Adam.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:06 -05:00
Chuck Lever
07decac0ac xdrgen: Remove tracepoint call site
This tracepoint was a "note to self" and is not operational. It is
added only to client-side code, which so far we haven't needed. It
will cause immediate breakage once we start generating client code,
though, so remove it now.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:06 -05:00
Yang Erkun
f8c989a0c8 nfsd: release svc_expkey/svc_export with rcu_work
The last reference for `cache_head` can be reduced to zero in `c_show`
and `e_show`(using `rcu_read_lock` and `rcu_read_unlock`). Consequently,
`svc_export_put` and `expkey_put` will be invoked, leading to two
issues:

1. The `svc_export_put` will directly free ex_uuid. However,
   `e_show`/`c_show` will access `ex_uuid` after `cache_put`, which can
   trigger a use-after-free issue, shown below.

   ==================================================================
   BUG: KASAN: slab-use-after-free in svc_export_show+0x362/0x430 [nfsd]
   Read of size 1 at addr ff11000010fdc120 by task cat/870

   CPU: 1 UID: 0 PID: 870 Comm: cat Not tainted 6.12.0-rc3+ #1
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
   1.16.1-2.fc37 04/01/2014
   Call Trace:
    <TASK>
    dump_stack_lvl+0x53/0x70
    print_address_description.constprop.0+0x2c/0x3a0
    print_report+0xb9/0x280
    kasan_report+0xae/0xe0
    svc_export_show+0x362/0x430 [nfsd]
    c_show+0x161/0x390 [sunrpc]
    seq_read_iter+0x589/0x770
    seq_read+0x1e5/0x270
    proc_reg_read+0xe1/0x140
    vfs_read+0x125/0x530
    ksys_read+0xc1/0x160
    do_syscall_64+0x5f/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   Allocated by task 830:
    kasan_save_stack+0x20/0x40
    kasan_save_track+0x14/0x30
    __kasan_kmalloc+0x8f/0xa0
    __kmalloc_node_track_caller_noprof+0x1bc/0x400
    kmemdup_noprof+0x22/0x50
    svc_export_parse+0x8a9/0xb80 [nfsd]
    cache_do_downcall+0x71/0xa0 [sunrpc]
    cache_write_procfs+0x8e/0xd0 [sunrpc]
    proc_reg_write+0xe1/0x140
    vfs_write+0x1a5/0x6d0
    ksys_write+0xc1/0x160
    do_syscall_64+0x5f/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

   Freed by task 868:
    kasan_save_stack+0x20/0x40
    kasan_save_track+0x14/0x30
    kasan_save_free_info+0x3b/0x60
    __kasan_slab_free+0x37/0x50
    kfree+0xf3/0x3e0
    svc_export_put+0x87/0xb0 [nfsd]
    cache_purge+0x17f/0x1f0 [sunrpc]
    nfsd_destroy_serv+0x226/0x2d0 [nfsd]
    nfsd_svc+0x125/0x1e0 [nfsd]
    write_threads+0x16a/0x2a0 [nfsd]
    nfsctl_transaction_write+0x74/0xa0 [nfsd]
    vfs_write+0x1a5/0x6d0
    ksys_write+0xc1/0x160
    do_syscall_64+0x5f/0x170
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

2. We cannot sleep while using `rcu_read_lock`/`rcu_read_unlock`.
   However, `svc_export_put`/`expkey_put` will call path_put, which
   subsequently triggers a sleeping operation due to the following
   `dput`.

   =============================
   WARNING: suspicious RCU usage
   5.10.0-dirty #141 Not tainted
   -----------------------------
   ...
   Call Trace:
   dump_stack+0x9a/0xd0
   ___might_sleep+0x231/0x240
   dput+0x39/0x600
   path_put+0x1b/0x30
   svc_export_put+0x17/0x80
   e_show+0x1c9/0x200
   seq_read_iter+0x63f/0x7c0
   seq_read+0x226/0x2d0
   vfs_read+0x113/0x2c0
   ksys_read+0xc9/0x170
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x67/0xd1

Fix these issues by using `rcu_work` to help release
`svc_expkey`/`svc_export`. This approach allows for an asynchronous
context to invoke `path_put` and also facilitates the freeing of
`uuid/exp/key` after an RCU grace period.

Fixes: 9ceddd9da1 ("knfsd: Allow lockless lookups of the exports")
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:05 -05:00
Yang Erkun
2862eee078 SUNRPC: make sure cache entry active before cache_show
The function `c_show` was called with protection from RCU. This only
ensures that `cp` will not be freed. Therefore, the reference count for
`cp` can drop to zero, which will trigger a refcount use-after-free
warning when `cache_get` is called. To resolve this issue, use
`cache_get_rcu` to ensure that `cp` remains active.

------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 7 PID: 822 at lib/refcount.c:25
refcount_warn_saturate+0xb1/0x120
CPU: 7 UID: 0 PID: 822 Comm: cat Not tainted 6.12.0-rc3+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
RIP: 0010:refcount_warn_saturate+0xb1/0x120

Call Trace:
 <TASK>
 c_show+0x2fc/0x380 [sunrpc]
 seq_read_iter+0x589/0x770
 seq_read+0x1e5/0x270
 proc_reg_read+0xe1/0x140
 vfs_read+0x125/0x530
 ksys_read+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:05 -05:00
Yang Erkun
be8f982c36 nfsd: make sure exp active before svc_export_show
The function `e_show` was called with protection from RCU. This only
ensures that `exp` will not be freed. Therefore, the reference count for
`exp` can drop to zero, which will trigger a refcount use-after-free
warning when `exp_get` is called. To resolve this issue, use
`cache_get_rcu` to ensure that `exp` remains active.

------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 819 at lib/refcount.c:25
refcount_warn_saturate+0xb1/0x120
CPU: 3 UID: 0 PID: 819 Comm: cat Not tainted 6.12.0-rc3+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
RIP: 0010:refcount_warn_saturate+0xb1/0x120
...
Call Trace:
 <TASK>
 e_show+0x20b/0x230 [nfsd]
 seq_read_iter+0x589/0x770
 seq_read+0x1e5/0x270
 vfs_read+0x125/0x530
 ksys_read+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: bf18f163e8 ("NFSD: Using exp_get for export getting")
Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:05 -05:00
Chuck Lever
9189d23b83 lockd: Remove unneeded initialization of file_lock::c.flc_flags
Since commit 75c7940d2a ("lockd: set missing fl_flags field when
retrieving args"), nlmsvc_retrieve_args() initializes the flc_flags
field. svcxdr_decode_lock() no longer needs to do this.

This clean up removes one dependency on the nlm_lock:fl field. No
behavior change is expected.

Analysis:

svcxdr_decode_lock() is called by:

nlm4svc_decode_testargs()
nlm4svc_decode_lockargs()
nlm4svc_decode_cancargs()
nlm4svc_decode_unlockargs()

nlm4svc_decode_testargs() is used by:
- NLMPROC4_TEST and NLMPROC4_TEST_MSG, which call nlmsvc_retrieve_args()
- NLMPROC4_GRANTED and NLMPROC4_GRANTED_MSG, which don't pass the
  lock's file_lock to the generic lock API

nlm4svc_decode_lockargs() is used by:
- NLMPROC4_LOCK and NLM4PROC4_LOCK_MSG, which call nlmsvc_retrieve_args()
- NLMPROC4_UNLOCK and NLM4PROC4_UNLOCK_MSG, which call nlmsvc_retrieve_args()
- NLMPROC4_NM_LOCK, which calls nlmsvc_retrieve_args()

nlm4svc_decode_cancargs() is used by:
- NLMPROC4_CANCEL and NLMPROC4_CANCEL_MSG, which call nlmsvc_retrieve_args()

nlm4svc_decode_unlockargs() is used by:
- NLMPROC4_UNLOCK and NLMPROC4_UNLOCK_MSG, which call nlmsvc_retrieve_args()

All callers except GRANTED/GRANTED_MSG eventually call
nlmsvc_retrieve_args() before using nlm_lock::fl.c.flc_flags. Thus
this change is safe.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:04 -05:00
Chuck Lever
8994a512e2 lockd: Remove unused parameter to nlmsvc_testlock()
The nlm_cookie parameter has been unused since commit 09802fd2a8
("lockd: rip out deferred lock handling from testlock codepath").

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:04 -05:00
Chuck Lever
a872c7313e lockd: Remove some snippets of unfinished code
Clean up.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:04 -05:00
Chuck Lever
e594884128 lockd: Remove unnecessary memset()
Since commit 103cc1fafe ("SUNRPC: Parametrize how much of argsize
should be zeroed") (and possibly long before that, even) all of the
memory underlying rqstp->rq_argp is zeroed already. There's no need
for the memset() in nlm4svc_decode_shareargs().

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:03 -05:00
Chuck Lever
2f746e40e9 lockd: Remove unused typedef
Clean up: Looks like the last usage of this typedef was removed by
commit 026fec7e7c ("sunrpc: properly type pc_decode callbacks") in
2017.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:03 -05:00
Chuck Lever
f64ea4af43 NFSD: Cap the number of bytes copied by nfs4_reset_recoverydir()
It's only current caller already length-checks the string, but let's
be safe.

Fixes: 0964a3d3f1 ("[PATCH] knfsd: nfsd4 reboot dirname fix")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:02 -05:00
Chuck Lever
30c1d2411a NFSD: Remove unused values from nfsd4_encode_components_esc()
Clean up. The computed value of @p is saved each time through the
loop but is never used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:02 -05:00
Chuck Lever
6b9c1080a6 NFSD: Remove unused results in nfsd4_encode_pathname4()
Clean up. The result of "*p++" is saved, but is not used before it
is overwritten. The result of xdr_encode_opaque() is saved each
time through the loop but is never used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:02 -05:00
Chuck Lever
1e02c641c3 NFSD: Prevent NULL dereference in nfsd4_process_cb_update()
@ses is initialized to NULL. If __nfsd4_find_backchannel() finds no
available backchannel session, setup_callback_client() will try to
dereference @ses and segfault.

Fixes: dcbeaa68db ("nfsd4: allow backchannel recovery")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:01 -05:00
Chuck Lever
da4f777e62 NFSD: Remove a never-true comparison
fh_size is an unsigned int, thus it can never be less than 0.

Fixes: d8b26071e6 ("NFSD: simplify struct nfsfh")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:01 -05:00
Chuck Lever
d08bf5ea64 NFSD: Remove dead code in nfsd4_create_session()
Clean up. AFAICT, there is no way to reach the out_free_conn label
with @old set to a non-NULL value, so the expire_client(old) call
is never reached and can be removed.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:01 -05:00
NeilBrown
4cc9b9f2bf nfsd: refine and rename NFSD_MAY_LOCK
NFSD_MAY_LOCK means a few different things.
- it means that GSS is not required.
- it means that with NFSEXP_NOAUTHNLM, authentication is not required
- it means that OWNER_OVERRIDE is allowed.

None of these are specific to locking, they are specific to the NLM
protocol.
So:
 - rename to NFSD_MAY_NLM
 - set NFSD_MAY_OWNER_OVERRIDE and NFSD_MAY_BYPASS_GSS in nlm_fopen()
   so that NFSD_MAY_NLM doesn't need to imply these.
 - move the test on NFSEXP_NOAUTHNLM out of nfsd_permission() and
   into fh_verify where other special-case tests on the MAY flags
   happen.  nfsd_permission() can be called from other places than
   fh_verify(), but none of these will have NFSD_MAY_NLM.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:00 -05:00
Chuck Lever
6640556b0c NFSD: Replace use of NFSD_MAY_LOCK in nfsd4_lock()
NFSv4 LOCK operations should not avoid the set of authorization
checks that apply to all other NFSv4 operations. Also, the
"no_auth_nlm" export option should apply only to NLM LOCK requests.
It's not necessary or sensible to apply it to NFSv4 LOCK operations.

Instead, set no permission bits when calling fh_verify(). Subsequent
stateid processing handles authorization checks.

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:00 -05:00
Julia Lawall
ed9887b876 nfsd: replace call_rcu by kfree_rcu for simple kmem_cache_free callback
Since SLOB was removed and since
commit 6c6c47b063 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"),
it is not necessary to use call_rcu when the callback only performs
kmem_cache_free. Use kfree_rcu() directly.

The changes were made using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:00 -05:00
Chuck Lever
a32442f6ca xdrgen: Add a utility for extracting XDR from RFCs
For convenience, copy the XDR extraction script from RFC

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:59 -05:00
Pali Rohár
bb4f07f240 nfsd: Fix NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT
Currently NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT do not bypass
only GSS, but bypass any method. This is a problem specially for NFS3
AUTH_NULL-only exports.

The purpose of NFSD_MAY_BYPASS_GSS_ON_ROOT is described in RFC 2623,
section 2.3.2, to allow mounting NFS2/3 GSS-only export without
authentication. So few procedures which do not expose security risk used
during mount time can be called also with AUTH_NONE or AUTH_SYS, to allow
client mount operation to finish successfully.

The problem with current implementation is that for AUTH_NULL-only exports,
the NFSD_MAY_BYPASS_GSS_ON_ROOT is active also for NFS3 AUTH_UNIX mount
attempts which confuse NFS3 clients, and make them think that AUTH_UNIX is
enabled and is working. Linux NFS3 client never switches from AUTH_UNIX to
AUTH_NONE on active mount, which makes the mount inaccessible.

Fix the NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT implementation
and really allow to bypass only exports which have enabled some real
authentication (GSS, TLS, or any other).

The result would be: For AUTH_NULL-only export if client attempts to do
mount with AUTH_UNIX flavor then it will receive access errors, which
instruct client that AUTH_UNIX flavor is not usable and will either try
other auth flavor (AUTH_NULL if enabled) or fails mount procedure.
Similarly if client attempt to do mount with AUTH_NULL flavor and only
AUTH_UNIX flavor is enabled then the client will receive access error.

This should fix problems with AUTH_NULL-only or AUTH_UNIX-only exports if
client attempts to mount it with other auth flavor (e.g. with AUTH_NULL for
AUTH_UNIX-only export, or with AUTH_UNIX for AUTH_NULL-only export).

Signed-off-by: Pali Rohár <pali@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:59 -05:00
Pali Rohár
600020927b nfsd: Fill NFSv4.1 server implementation fields in OP_EXCHANGE_ID response
NFSv4.1 OP_EXCHANGE_ID response from server may contain server
implementation details (domain, name and build time) in optional
nfs_impl_id4 field. Currently nfsd does not fill this field.

Send these information in NFSv4.1 OP_EXCHANGE_ID response. Fill them with
the same values as what is Linux NFSv4.1 client doing. Domain is hardcoded
to "kernel.org", name is composed in the same way as "uname -srvm" output
and build time is hardcoded to zeros.

NFSv4.1 client and server implementation fields are useful for statistic
purposes or for identifying type of clients and servers.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:58 -05:00
Pali Rohár
2dc84a7522 lockd: Fix comment about NLMv3 backwards compatibility
NLMv2 is completely different protocol than NLMv1 and NLMv3, and in
original Sun implementation is used for RPC loopback callbacks from statd
to lockd services. Linux does not use nor does not implement NLMv2.

Hence, NLMv3 is not backward compatible with NLMv2. But NLMv3 is backward
compatible with NLMv1. Fix comment.

Signed-off-by: Pali Rohár <pali@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:58 -05:00
Jeff Layton
b9376c7e42 nfsd: new tracepoint for after op_func in compound processing
Turn nfsd_compound_encode_err tracepoint into a class and add a new
nfsd_compound_op_err tracepoint.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:22:58 -05:00
Jeff Layton
f6259e2e4f nfsd: have nfsd4_deleg_getattr_conflict pass back write deleg pointer
Currently we pass back the size and whether it has been modified, but
those just mirror values tracked inside the delegation. In a later
patch, we'll need to get at the timestamps in the delegation too, so
just pass back a reference to the write delegation, and use that to
properly override values in the iattr.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:07 -05:00
Jeff Layton
3a405432e7 nfsd: drop the nfsd4_fattr_args "size" field
We already have a slot for this in the kstat structure. Just overwrite
that instead of keeping a copy.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:07 -05:00
Jeff Layton
c757ca1a56 nfsd: drop the ncf_cb_bmap field
This is always the same value, and in a later patch we're going to need
to set bits in WORD2. We can simplify this code and save a little space
in the delegation too. Just hardcode the bitmap in the callback encode
function.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:07 -05:00
Jeff Layton
f67eef8da0 nfsd: drop inode parameter from nfsd4_change_attribute()
The inode that nfs4_open_delegation() passes to this function is
wrong, which throws off the result. The inode will end up getting a
directory-style change attr instead of a regular-file-style one.

Fix up nfs4_delegation_stat() to fetch STATX_MODE, and then drop the
inode parameter from nfsd4_change_attribute(), since it's no longer
needed.

Fixes: c5967721e1 ("NFSD: handle GETATTR conflict with write delegation")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:06 -05:00
Chuck Lever
ac159338d5 xdrgen: emit maxsize macros
Add "definitions" subcommand logic to emit maxsize macros in
generated code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:06 -05:00
Chuck Lever
e9e1e7e75a xdrgen: Add generator code for XDR width macros
Introduce logic in the code generators to emit maxsize (XDR
width) definitions. In C, these are pre-processor macros.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:05 -05:00
Chuck Lever
ce5a75d993 xdrgen: XDR width for union types
Not yet complete.

The tool doesn't do any math yet. Thus, even though the maximum XDR
width of a union is the width of the union enumerator plus the width
of its largest arm, we're using the sum of all the elements of the
union for the moment.

This means that buffer size requirements are overestimated, and that
the generated maxsize macro cannot yet be used for determining data
element alignment in the XDR buffer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:05 -05:00
Chuck Lever
447dc1efeb xdrgen: XDR width for pointer types
The XDR width of a pointer type is the sum of the widths of each of
the struct's fields, except for the last field. The width of the
implicit boolean "value follows" field is added as well.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:05 -05:00
Chuck Lever
f4bc1e996a xdrgen: XDR width for struct types
The XDR width of a struct type is the sum of the widths of each of
the struct's fields.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11 13:42:04 -05:00