95608 Commits

Author SHA1 Message Date
Paul Aurich
a9685b409a smb: prevent use-after-free due to open_cached_dir error paths
If open_cached_dir() encounters an error parsing the lease from the
server, the error handling may race with receiving a lease break,
resulting in open_cached_dir() freeing the cfid while the queued work is
pending.

Update open_cached_dir() to drop refs rather than directly freeing the
cfid.

Have cached_dir_lease_break(), cfids_laundromat_worker(), and
invalidate_all_cached_dirs() clear has_lease immediately while still
holding cfids->cfid_list_lock, and then use this to also simplify the
reference counting in cfids_laundromat_worker() and
invalidate_all_cached_dirs().

Fixes this KASAN splat (which manually injects an error and lease break
in open_cached_dir()):

==================================================================
BUG: KASAN: slab-use-after-free in smb2_cached_lease_break+0x27/0xb0
Read of size 8 at addr ffff88811cc24c10 by task kworker/3:1/65

CPU: 3 UID: 0 PID: 65 Comm: kworker/3:1 Not tainted 6.12.0-rc6-g255cf264e6e5-dirty #87
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Workqueue: cifsiod smb2_cached_lease_break
Call Trace:
 <TASK>
 dump_stack_lvl+0x77/0xb0
 print_report+0xce/0x660
 kasan_report+0xd3/0x110
 smb2_cached_lease_break+0x27/0xb0
 process_one_work+0x50a/0xc50
 worker_thread+0x2ba/0x530
 kthread+0x17c/0x1c0
 ret_from_fork+0x34/0x60
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Allocated by task 2464:
 kasan_save_stack+0x33/0x60
 kasan_save_track+0x14/0x30
 __kasan_kmalloc+0xaa/0xb0
 open_cached_dir+0xa7d/0x1fb0
 smb2_query_path_info+0x43c/0x6e0
 cifs_get_fattr+0x346/0xf10
 cifs_get_inode_info+0x157/0x210
 cifs_revalidate_dentry_attr+0x2d1/0x460
 cifs_getattr+0x173/0x470
 vfs_statx_path+0x10f/0x160
 vfs_statx+0xe9/0x150
 vfs_fstatat+0x5e/0xc0
 __do_sys_newfstatat+0x91/0xf0
 do_syscall_64+0x95/0x1a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Freed by task 2464:
 kasan_save_stack+0x33/0x60
 kasan_save_track+0x14/0x30
 kasan_save_free_info+0x3b/0x60
 __kasan_slab_free+0x51/0x70
 kfree+0x174/0x520
 open_cached_dir+0x97f/0x1fb0
 smb2_query_path_info+0x43c/0x6e0
 cifs_get_fattr+0x346/0xf10
 cifs_get_inode_info+0x157/0x210
 cifs_revalidate_dentry_attr+0x2d1/0x460
 cifs_getattr+0x173/0x470
 vfs_statx_path+0x10f/0x160
 vfs_statx+0xe9/0x150
 vfs_fstatat+0x5e/0xc0
 __do_sys_newfstatat+0x91/0xf0
 do_syscall_64+0x95/0x1a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Last potentially related work creation:
 kasan_save_stack+0x33/0x60
 __kasan_record_aux_stack+0xad/0xc0
 insert_work+0x32/0x100
 __queue_work+0x5c9/0x870
 queue_work_on+0x82/0x90
 open_cached_dir+0x1369/0x1fb0
 smb2_query_path_info+0x43c/0x6e0
 cifs_get_fattr+0x346/0xf10
 cifs_get_inode_info+0x157/0x210
 cifs_revalidate_dentry_attr+0x2d1/0x460
 cifs_getattr+0x173/0x470
 vfs_statx_path+0x10f/0x160
 vfs_statx+0xe9/0x150
 vfs_fstatat+0x5e/0xc0
 __do_sys_newfstatat+0x91/0xf0
 do_syscall_64+0x95/0x1a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

The buggy address belongs to the object at ffff88811cc24c00
 which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 16 bytes inside of
 freed 1024-byte region [ffff88811cc24c00, ffff88811cc25000)

Cc: stable@vger.kernel.org
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:45:50 -06:00
Paul Aurich
7afb867336 smb: Don't leak cfid when reconnect races with open_cached_dir
open_cached_dir() may either race with the tcon reconnection even before
compound_send_recv() or directly trigger a reconnection via
SMB2_open_init() or SMB_query_info_init().

The reconnection process invokes invalidate_all_cached_dirs() via
cifs_mark_open_files_invalid(), which removes all cfids from the
cfids->entries list but doesn't drop a ref if has_lease isn't true. This
results in the currently-being-constructed cfid not being on the list,
but still having a refcount of 2. It leaks if returned from
open_cached_dir().

Fix this by setting cfid->has_lease when the ref is actually taken; the
cfid will not be used by other threads until it has a valid time.

Addresses these kmemleaks:

unreferenced object 0xffff8881090c4000 (size 1024):
  comm "bash", pid 1860, jiffies 4295126592
  hex dump (first 32 bytes):
    00 01 00 00 00 00 ad de 22 01 00 00 00 00 ad de  ........".......
    00 ca 45 22 81 88 ff ff f8 dc 4f 04 81 88 ff ff  ..E"......O.....
  backtrace (crc 6f58c20f):
    [<ffffffff8b895a1e>] __kmalloc_cache_noprof+0x2be/0x350
    [<ffffffff8bda06e3>] open_cached_dir+0x993/0x1fb0
    [<ffffffff8bdaa750>] cifs_readdir+0x15a0/0x1d50
    [<ffffffff8b9a853f>] iterate_dir+0x28f/0x4b0
    [<ffffffff8b9a9aed>] __x64_sys_getdents64+0xfd/0x200
    [<ffffffff8cf6da05>] do_syscall_64+0x95/0x1a0
    [<ffffffff8d00012f>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8881044fdcf8 (size 8):
  comm "bash", pid 1860, jiffies 4295126592
  hex dump (first 8 bytes):
    00 cc cc cc cc cc cc cc                          ........
  backtrace (crc 10c106a9):
    [<ffffffff8b89a3d3>] __kmalloc_node_track_caller_noprof+0x363/0x480
    [<ffffffff8b7d7256>] kstrdup+0x36/0x60
    [<ffffffff8bda0700>] open_cached_dir+0x9b0/0x1fb0
    [<ffffffff8bdaa750>] cifs_readdir+0x15a0/0x1d50
    [<ffffffff8b9a853f>] iterate_dir+0x28f/0x4b0
    [<ffffffff8b9a9aed>] __x64_sys_getdents64+0xfd/0x200
    [<ffffffff8cf6da05>] do_syscall_64+0x95/0x1a0
    [<ffffffff8d00012f>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

And addresses these BUG splats when unmounting the SMB filesystem:

BUG: Dentry ffff888140590ba0{i=1000000000080,n=/}  still in use (2) [unmount of cifs cifs]
WARNING: CPU: 3 PID: 3433 at fs/dcache.c:1536 umount_check+0xd0/0x100
Modules linked in:
CPU: 3 UID: 0 PID: 3433 Comm: bash Not tainted 6.12.0-rc4-g850925a8133c-dirty #49
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
RIP: 0010:umount_check+0xd0/0x100
Code: 8d 7c 24 40 e8 31 5a f4 ff 49 8b 54 24 40 41 56 49 89 e9 45 89 e8 48 89 d9 41 57 48 89 de 48 c7 c7 80 e7 db ac e8 f0 72 9a ff <0f> 0b 58 31 c0 5a 5b 5d 41 5c 41 5d 41 5e 41 5f e9 2b e5 5d 01 41
RSP: 0018:ffff88811cc27978 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff888140590ba0 RCX: ffffffffaaf20bae
RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffff8881f6fb6f40
RBP: ffff8881462ec000 R08: 0000000000000001 R09: ffffed1023984ee3
R10: ffff88811cc2771f R11: 00000000016cfcc0 R12: ffff888134383e08
R13: 0000000000000002 R14: ffff8881462ec668 R15: ffffffffaceab4c0
FS:  00007f23bfa98740(0000) GS:ffff8881f6f80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556de4a6f808 CR3: 0000000123c80000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 d_walk+0x6a/0x530
 shrink_dcache_for_umount+0x6a/0x200
 generic_shutdown_super+0x52/0x2a0
 kill_anon_super+0x22/0x40
 cifs_kill_sb+0x159/0x1e0
 deactivate_locked_super+0x66/0xe0
 cleanup_mnt+0x140/0x210
 task_work_run+0xfb/0x170
 syscall_exit_to_user_mode+0x29f/0x2b0
 do_syscall_64+0xa1/0x1a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f23bfb93ae7
Code: ff ff ff ff c3 66 0f 1f 44 00 00 48 8b 0d 11 93 0d 00 f7 d8 64 89 01 b8 ff ff ff ff eb bf 0f 1f 44 00 00 b8 50 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 92 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffee9138598 EFLAGS: 00000246 ORIG_RAX: 0000000000000050
RAX: 0000000000000000 RBX: 0000558f1803e9a0 RCX: 00007f23bfb93ae7
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000558f1803e9a0
RBP: 0000558f1803e600 R08: 0000000000000007 R09: 0000558f17fab610
R10: d91d5ec34ab757b0 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000015 R15: 0000000000000000
 </TASK>
irq event stamp: 1163486
hardirqs last  enabled at (1163485): [<ffffffffac98d344>] _raw_spin_unlock_irqrestore+0x34/0x60
hardirqs last disabled at (1163486): [<ffffffffac97dcfc>] __schedule+0xc7c/0x19a0
softirqs last  enabled at (1163482): [<ffffffffab79a3ee>] __smb_send_rqst+0x3de/0x990
softirqs last disabled at (1163480): [<ffffffffac2314f1>] release_sock+0x21/0xf0
---[ end trace 0000000000000000 ]---

VFS: Busy inodes after unmount of cifs (cifs)
------------[ cut here ]------------
kernel BUG at fs/super.c:661!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 3433 Comm: bash Tainted: G        W          6.12.0-rc4-g850925a8133c-dirty #49
Tainted: [W]=WARN
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
RIP: 0010:generic_shutdown_super+0x290/0x2a0
Code: e8 15 7c f7 ff 48 8b 5d 28 48 89 df e8 09 7c f7 ff 48 8b 0b 48 89 ee 48 8d 95 68 06 00 00 48 c7 c7 80 7f db ac e8 00 69 af ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90
RSP: 0018:ffff88811cc27a50 EFLAGS: 00010246
RAX: 000000000000003e RBX: ffffffffae994420 RCX: 0000000000000027
RDX: 0000000000000000 RSI: ffffffffab06180e RDI: ffff8881f6eb18c8
RBP: ffff8881462ec000 R08: 0000000000000001 R09: ffffed103edd6319
R10: ffff8881f6eb18cb R11: 00000000016d3158 R12: ffff8881462ec9c0
R13: ffff8881462ec050 R14: 0000000000000001 R15: 0000000000000000
FS:  00007f23bfa98740(0000) GS:ffff8881f6e80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8364005d68 CR3: 0000000123c80000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 kill_anon_super+0x22/0x40
 cifs_kill_sb+0x159/0x1e0
 deactivate_locked_super+0x66/0xe0
 cleanup_mnt+0x140/0x210
 task_work_run+0xfb/0x170
 syscall_exit_to_user_mode+0x29f/0x2b0
 do_syscall_64+0xa1/0x1a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f23bfb93ae7
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:generic_shutdown_super+0x290/0x2a0
Code: e8 15 7c f7 ff 48 8b 5d 28 48 89 df e8 09 7c f7 ff 48 8b 0b 48 89 ee 48 8d 95 68 06 00 00 48 c7 c7 80 7f db ac e8 00 69 af ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90
RSP: 0018:ffff88811cc27a50 EFLAGS: 00010246
RAX: 000000000000003e RBX: ffffffffae994420 RCX: 0000000000000027
RDX: 0000000000000000 RSI: ffffffffab06180e RDI: ffff8881f6eb18c8
RBP: ffff8881462ec000 R08: 0000000000000001 R09: ffffed103edd6319
R10: ffff8881f6eb18cb R11: 00000000016d3158 R12: ffff8881462ec9c0
R13: ffff8881462ec050 R14: 0000000000000001 R15: 0000000000000000
FS:  00007f23bfa98740(0000) GS:ffff8881f6e80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8364005d68 CR3: 0000000123c80000 CR4: 0000000000350ef0

This reproduces eventually with an SMB mount and two shells running
these loops concurrently

- while true; do
      cd ~; sleep 1;
      for i in {1..3}; do cd /mnt/test/subdir;
          echo $PWD; sleep 1; cd ..; echo $PWD; sleep 1;
      done;
      echo ...;
  done
- while true; do
      iptables -F OUTPUT; mount -t cifs -a;
      for _ in {0..2}; do ls /mnt/test/subdir/ | wc -l; done;
      iptables -I OUTPUT -p tcp --dport 445 -j DROP;
      sleep 10
      echo "unmounting"; umount -l -t cifs -a; echo "done unmounting";
      sleep 20
      echo "recovering"; iptables -F OUTPUT;
      sleep 10;
  done

Fixes: ebe98f1447bb ("cifs: enable caching of directories for which a lease is held")
Fixes: 5c86919455c1 ("smb: client: fix use-after-free in smb2_query_info_compound()")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:45:50 -06:00
Paulo Alcantara
0812340811 smb: client: handle max length for SMB symlinks
We can't use PATH_MAX for SMB symlinks because

  (1) Windows Server will fail FSCTL_SET_REPARSE_POINT with
      STATUS_IO_REPARSE_DATA_INVALID when input buffer is larger than
      16K, as specified in MS-FSA 2.1.5.10.37.

  (2) The client won't be able to parse large SMB responses that
      includes SMB symlink path within SMB2_CREATE or SMB2_IOCTL
      responses.

Fix this by defining a maximum length value (4060) for SMB symlinks
that both client and server can handle.

Cc: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:45:50 -06:00
Paulo Alcantara
9f544d26b1 smb: client: get rid of bounds check in SMB2_ioctl_init()
smb2_set_next_command() no longer squashes request iovs into a single
iov, so the bounds check can be dropped.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:45:50 -06:00
Paulo Alcantara
bc925c1216 smb: client: improve compound padding in encryption
After commit f7f291e14dde ("cifs: fix oops during encryption"), the
encryption layer can handle vmalloc'd buffers as well as kmalloc'd
buffers, so there is no need to inefficiently squash request iovs
into a single one to handle padding in compound requests.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:45:49 -06:00
Steve French
9ed9d83a51 smb3: request handle caching when caching directories
This client was only requesting READ caching, not READ and HANDLE caching
in the LeaseState on the open requests we send for directories.  To
delay closing a handle (e.g. for caching directory contents) we should
be requesting HANDLE as well as READ (as we already do for deferred
close of files).   See MS-SMB2 3.3.1.4 e.g.

Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:45:15 -06:00
Pali Rohár
0d6b0d2e38 cifs: Recognize SFU char/block devices created by Windows NFS server on Windows Server <<2012
Windows NFS server versions on Windows Server older than 2012 release use
for storing char and block devices modified SFU format, not compatible with
the original SFU. Windows NFS server on Windows Server 2012 and new
versions use different format (reparse points), not related to SFU-style.

SFU / SUA / Interix subsystem stores the major and major numbers as pair of
64-bit integer, but Windows NFS server stores as pair of 32-bit integers.

Which makes char and block devices between Windows NFS server <<2012 and
Windows SFU/SUA/Interix subsytem incompatible.

So improve Linux SMB client.

When SFU mode is enabled (mount option -o sfu is specified) then recognize
also these kind of char and block devices and its major and minor numbers,
which are used by Windows Server versions older than 2012.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:44:29 -06:00
Ritvik Budhiraja
db363b0a1d CIFS: New mount option for cifs.upcall namespace resolution
In the current implementation, the SMB filesystem on a mount point can
trigger upcalls from the kernel to the userspace to enable certain
functionalities like spnego, dns_resolution, amongst others. These upcalls
usually either happen in the context of the mount or in the context of an
application/user. The upcall handler for cifs, cifs.upcall already has
existing code which switches the namespaces to the caller's namespace
before handling the upcall. This behaviour is expected for scenarios like
multiuser mounts, but might not cover all single user scenario with
services such as Kubernetes, where the mount can happen from different
locations such as on the host, from an app container, or a driver pod
which does the mount on behalf of a different pod.

This patch introduces a new mount option called upcall_target, to
customise the upcall behaviour. upcall_target can take 'mount' and 'app'
as possible values. This aids use cases like Kubernetes where the mount
happens on behalf of the application in another container altogether.
Having this new mount option allows the mount command to specify where the
upcall should happen: 'mount' for resolving the upcall to the host
namespace, and 'app' for resolving the upcall to the ns of the calling
thread. This will enable both the scenarios where the Kerberos credentials
can be found on the application namespace or the host namespace to which
just the mount operation is "delegated".

Reviewed-by: Shyam Prasad <shyam.prasad@microsoft.com>
Reviewed-by: Bharath S M <bharathsm@microsoft.com>
Reviewed-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Ritvik Budhiraja <rbudhiraja@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:44:03 -06:00
Dan Carpenter
7a2158b73c smb/client: Prevent error pointer dereference
The cifs_sb_tlink() function can return error pointers, but this code
dereferences it before checking for error pointers.  Re-order the code
to fix that.

Fixes: 0f9b6b045bb2 ("fs/smb/client: implement chmod() for SMB3 POSIX Extensions")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:43:23 -06:00
Ralph Boehme
d413eabff1 fs/smb/client: implement chmod() for SMB3 POSIX Extensions
The NT ACL format for an SMB3 POSIX Extensions chmod() is a single ACE with the
magic S-1-5-88-3-mode SID:

  NT Security Descriptor
      Revision: 1
      Type: 0x8004, Self Relative, DACL Present
      Offset to owner SID: 56
      Offset to group SID: 124
      Offset to SACL: 0
      Offset to DACL: 20
      Owner: S-1-5-21-3177838999-3893657415-1037673384-1000
      Group: S-1-22-2-1000
      NT User (DACL) ACL
          Revision: NT4 (2)
          Size: 36
          Num ACEs: 1
          NT ACE: S-1-5-88-3-438, flags 0x00, Access Allowed, mask 0x00000000
              Type: Access Allowed
              NT ACE Flags: 0x00
              Size: 28
              Access required: 0x00000000
              SID: S-1-5-88-3-438

Owner and Group should be NULL, but the server is not required to fail the
request if they are present.

Signed-off-by: Ralph Boehme <slow@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:43:01 -06:00
Paul Aurich
128630e1db smb: cached directories can be more than root file handle
Update this log message since cached fids may represent things other
than the root of a mount.

Fixes: e4029e072673 ("cifs: find and use the dentry for cached non-root directories also")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-21 10:41:35 -06:00
Linus Torvalds
fcc79e1714 Networking changes for 6.13.
The most significant set of changes is the per netns RTNL. The new
 behavior is disabled by default, regression risk should be contained.
 
 Notably the new config knob PTP_1588_CLOCK_VMCLOCK will inherit its
 default value from PTP_1588_CLOCK_KVM, as the first is intended to be
 a more reliable replacement for the latter.
 
 Core
 ----
 
  - Started a very large, in-progress, effort to make the RTNL lock
    scope per network-namespace, thus reducing the lock contention
    significantly in the containerized use-case, comprising:
    - RCU-ified some relevant slices of the FIB control path
    - introduce basic per netns locking helpers
    - namespacified the IPv4 address hash table
    - remove rtnl_register{,_module}() in favour of rtnl_register_many()
    - refactor rtnl_{new,del,set}link() moving as much validation as
      possible out of RTNL lock
    - convert all phonet doit() and dumpit() handlers to RCU
    - convert IPv4 addresses manipulation to per-netns RTNL
    - convert virtual interface creation to per-netns RTNL
    the per-netns lock infra is guarded by the CONFIG_DEBUG_NET_SMALL_RTNL
    knob, disabled by default ad interim.
 
  - Introduce NAPI suspension, to efficiently switching between busy
    polling (NAPI processing suspended) and normal processing.
 
  - Migrate the IPv4 routing input, output and control path from direct
    ToS usage to DSCP macros. This is a work in progress to make ECN
    handling consistent and reliable.
 
  - Add drop reasons support to the IPv4 rotue input path, allowing
    better introspection in case of packets drop.
 
  - Make FIB seqnum lockless, dropping RTNL protection for read
    access.
 
  - Make inet{,v6} addresses hashing less predicable.
 
  - Allow providing timestamp OPT_ID via cmsg, to correlate TX packets
    and timestamps
 
 Things we sprinkled into general kernel code
 --------------------------------------------
 
  - Add small file operations for debugfs, to reduce the struct ops size.
 
  - Refactoring and optimization for the implementation of page_frag API,
    This is a preparatory work to consolidate the page_frag
    implementation.
 
 Netfilter
 ---------
 
  - Optimize set element transactions to reduce memory consumption
 
  - Extended netlink error reporting for attribute parser failure.
 
  - Make legacy xtables configs user selectable, giving users
    the option to configure iptables without enabling any other config.
 
  - Address a lot of false-positive RCU issues, pointed by recent
    CI improvements.
 
 BPF
 ---
 
  - Put xsk sockets on a struct diet and add various cleanups. Overall,
    this helps to bump performance by 12% for some workloads.
 
  - Extend BPF selftests to increase coverage of XDP features in
    combination with BPF cpumap.
 
  - Optimize and homogenize bpf_csum_diff helper for all archs and also
    add a batch of new BPF selftests for it.
 
  - Extend netkit with an option to delegate skb->{mark,priority}
    scrubbing to its BPF program.
 
  - Make the bpf_get_netns_cookie() helper available also to tc(x) BPF
    programs.
 
 Protocols
 ---------
 
  - Introduces 4-tuple hash for connected udp sockets, speeding-up
    significantly connected sockets lookup.
 
  - Add a fastpath for some TCP timers that usually expires after close,
    the socket lock contention.
 
  - Add inbound and outbound xfrm state caches to speed up state lookups.
 
  - Avoid sending MPTCP advertisements on stale subflows, reducing
    risks on loosing them.
 
  - Make neighbours table flushing more scalable, maintaining per device
    neigh lists.
 
 Driver API
 ----------
 
  - Introduce a unified interface to configure transmission H/W shaping,
    and expose it to user-space via generic-netlink.
 
  - Add support for per-NAPI config via netlink. This makes napi
    configuration persistent across queues removal and re-creation.
    Requires driver updates, currently supported drivers are:
    nVidia/Mellanox mlx4 and mlx5, Broadcom brcm and Intel ice.
 
  - Add ethtool support for writing SFP / PHY firmware blocks.
 
  - Track RSS context allocation from ethtool core.
 
  - Implement support for mirroring to DSA CPU port, via TC mirror
    offload.
 
  - Consolidate FDB updates notification, to avoid duplicates on
    device-specific entries.
 
  - Expose DPLL clock quality level to the user-space.
 
  - Support master-slave PHY config via device tree.
 
 Tests and tooling
 -----------------
 
  - forwarding: introduce deferred commands, to simplify
    the cleanup phase
 
 Drivers
 -------
 
  - Updated several drivers - Amazon vNic, Google vNic, Microsoft vNic,
    Intel e1000e and Broadcom Tigon3 - to use netdev-genl to link the
    IRQs and queues to NAPI IDs, allowing busy polling and better
    introspection.
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox:
      - mlx5:
        - a large refactor to implement support for cross E-Switch
          scheduling
        - refactor H/W conter management to let it scale better
        - H/W GRO cleanups
    - Intel (100G, ice)::
      - adds support for ethtool reset
      - implement support for per TX queue H/W shaping
    - AMD/Solarflare:
      - implement per device queue stats support
    - Broadcom (bnxt):
      - improve wildcard l4proto on IPv4/IPv6 ntuple rules
    - Marvell Octeon:
      - Adds representor support for each Resource Virtualization Unit
        (RVU) device.
    - Hisilicon:
      - adds support for the BMC Gigabit Ethernet
    - IBM (EMAC):
      - driver cleanup and modernization
    - Cisco (VIC):
      - raise the queues number limit to 256
 
  - Ethernet virtual:
    - Google vNIC:
      - implements page pool support
    - macsec:
      - inherit lower device's features and TSO limits when offloading
    - virtio_net:
      - enable premapped mode by default
      - support for XDP socket(AF_XDP) zerocopy TX
    - wireguard:
      - set the TSO max size to be GSO_MAX_SIZE, to aggregate larger
        packets.
 
  - Ethernet NICs embedded and virtual:
    - Broadcom ASP:
      - enable software timestamping
    - Freescale:
      - add enetc4 PF driver
    - MediaTek: Airoha SoC:
      - implement BQL support
    - RealTek r8169:
      - enable TSO by default on r8168/r8125
      - implement extended ethtool stats
    - Renesas AVB:
      - enable TX checksum offload
    - Synopsys (stmmac):
      - support header splitting for vlan tagged packets
      - move common code for DWMAC4 and DWXGMAC into a separate FPE
        module.
      - Add the dwmac driver support for T-HEAD TH1520 SoC
    - Synopsys (xpcs):
      - driver refactor and cleanup
    - TI:
      - icssg_prueth: add VLAN offload support
    - Xilinx emaclite:
      - adds clock support
 
  - Ethernet switches:
    - Microchip:
      - implement support for the lan969x Ethernet switch family
      - add LAN9646 switch support to KSZ DSA driver
 
  - Ethernet PHYs:
    - Marvel: 88q2x: enable auto negotiation
    - Microchip: add support for LAN865X Rev B1 and LAN867X Rev C1/C2
 
  - PTP:
    - Add support for the Amazon virtual clock device
    - Add PtP driver for s390 clocks
 
  - WiFi:
    - mac80211
      - EHT 1024 aggregation size for transmissions
      - new operation to indicate that a new interface is to be added
      - support radio separation of multi-band devices
      - move wireless extension spy implementation to libiw
    - Broadcom:
      - brcmfmac: optional LPO clock support
    - Microchip:
      - add support for Atmel WILC3000
    - Qualcomm (ath12k):
      - firmware coredump collection support
      - add debugfs support for a multitude of statistics
    - Qualcomm (ath5k):
      -  Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support
    - Realtek:
      - rtw88: 8821au and 8812au USB adapters support
      - rtw89: add thermal protection
      - rtw89: fine tune BT-coexsitence to improve user experience
      - rtw89: firmware secure boot for WiFi 6 chip
 
  - Bluetooth
      - add Qualcomm WCN785x support for ids Foxconn 0xe0fc/0xe0f3 and
        0x13d3:0x3623
      - add Realtek RTL8852BE support for id Foxconn 0xe123
      - add MediaTek MT7920 support for wireless module ids
      - btintel_pcie: add handshake between driver and firmware
      - btintel_pcie: add recovery mechanism
      - btnxpuart: add GPIO support to power save feature
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmc8sukSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkLEYQAIMM6Qjh0bh3Byr3gOS1xZzXG+APLjP4
 9Jr0p3i+X53i90jvVqzeVO5FTc95MVHSKZ3kvPkDMXSLUaEJxocNHCI5Dzl/2/qL
 wWdpUB6/ou+jKB4Bn6Z8OvVODT7qrr0tVa9M2/fuKWrIsOU/ntIhG8EhnGddk5U/
 vKPSf5PUIb81uNRnF58VusY3wrT1dEoh9VfJYxL+ST+inPxjEAMy6Y+lmlsjGaSX
 jrS+Pp9KYiUwl3Qt0AQs+cG4OHkJdjbnChrfosWwpkiyddO8klVq06+wX/TiSzfF
 b9VZtBfy/GZs3lkE1mQkcILdtX5pP3YHQdpsuxFfVI0JHVszx2ck7WdoRux/8F0v
 kKZsYcO7bH9I1wMFP66Ff9hIbdEQaeucK+KdDkXyPNMfP91Vzmfjii8IBxOC36Ie
 BbOeFUrXyTxxJ2u0vf/X9JtIq8bcrkNrSd1n1jlGPMqG3FVzsY95+Oi4qfsyeUbl
 lS1PlVTqPMPFdX54HnxM3y2rJjhd7iXhkvmtuXNjRFThXlOiK3maAPWlM1aZ3b8u
 Vjs4JFUsW0tleZG+RzANjsGjXbf7AiPUGLZt+acem0K+fcjG4i5aGIAJrxwa/ORx
 eG74IZRt5cOI371W7gNLGHjwnuge8tFPgOWcRP2eozNm7jvMYALBejYS7eWUTvaf
 THcvVM+bupEZ
 =GzPr
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "The most significant set of changes is the per netns RTNL. The new
  behavior is disabled by default, regression risk should be contained.

  Notably the new config knob PTP_1588_CLOCK_VMCLOCK will inherit its
  default value from PTP_1588_CLOCK_KVM, as the first is intended to be
  a more reliable replacement for the latter.

  Core:

   - Started a very large, in-progress, effort to make the RTNL lock
     scope per network-namespace, thus reducing the lock contention
     significantly in the containerized use-case, comprising:
       - RCU-ified some relevant slices of the FIB control path
       - introduce basic per netns locking helpers
       - namespacified the IPv4 address hash table
       - remove rtnl_register{,_module}() in favour of
         rtnl_register_many()
       - refactor rtnl_{new,del,set}link() moving as much validation as
         possible out of RTNL lock
       - convert all phonet doit() and dumpit() handlers to RCU
       - convert IPv4 addresses manipulation to per-netns RTNL
       - convert virtual interface creation to per-netns RTNL
     the per-netns lock infrastructure is guarded by the
     CONFIG_DEBUG_NET_SMALL_RTNL knob, disabled by default ad interim.

   - Introduce NAPI suspension, to efficiently switching between busy
     polling (NAPI processing suspended) and normal processing.

   - Migrate the IPv4 routing input, output and control path from direct
     ToS usage to DSCP macros. This is a work in progress to make ECN
     handling consistent and reliable.

   - Add drop reasons support to the IPv4 rotue input path, allowing
     better introspection in case of packets drop.

   - Make FIB seqnum lockless, dropping RTNL protection for read access.

   - Make inet{,v6} addresses hashing less predicable.

   - Allow providing timestamp OPT_ID via cmsg, to correlate TX packets
     and timestamps

  Things we sprinkled into general kernel code:

   - Add small file operations for debugfs, to reduce the struct ops
     size.

   - Refactoring and optimization for the implementation of page_frag
     API, This is a preparatory work to consolidate the page_frag
     implementation.

  Netfilter:

   - Optimize set element transactions to reduce memory consumption

   - Extended netlink error reporting for attribute parser failure.

   - Make legacy xtables configs user selectable, giving users the
     option to configure iptables without enabling any other config.

   - Address a lot of false-positive RCU issues, pointed by recent CI
     improvements.

  BPF:

   - Put xsk sockets on a struct diet and add various cleanups. Overall,
     this helps to bump performance by 12% for some workloads.

   - Extend BPF selftests to increase coverage of XDP features in
     combination with BPF cpumap.

   - Optimize and homogenize bpf_csum_diff helper for all archs and also
     add a batch of new BPF selftests for it.

   - Extend netkit with an option to delegate skb->{mark,priority}
     scrubbing to its BPF program.

   - Make the bpf_get_netns_cookie() helper available also to tc(x) BPF
     programs.

  Protocols:

   - Introduces 4-tuple hash for connected udp sockets, speeding-up
     significantly connected sockets lookup.

   - Add a fastpath for some TCP timers that usually expires after
     close, the socket lock contention.

   - Add inbound and outbound xfrm state caches to speed up state
     lookups.

   - Avoid sending MPTCP advertisements on stale subflows, reducing
     risks on loosing them.

   - Make neighbours table flushing more scalable, maintaining per
     device neigh lists.

  Driver API:

   - Introduce a unified interface to configure transmission H/W
     shaping, and expose it to user-space via generic-netlink.

   - Add support for per-NAPI config via netlink. This makes napi
     configuration persistent across queues removal and re-creation.
     Requires driver updates, currently supported drivers are:
     nVidia/Mellanox mlx4 and mlx5, Broadcom brcm and Intel ice.

   - Add ethtool support for writing SFP / PHY firmware blocks.

   - Track RSS context allocation from ethtool core.

   - Implement support for mirroring to DSA CPU port, via TC mirror
     offload.

   - Consolidate FDB updates notification, to avoid duplicates on
     device-specific entries.

   - Expose DPLL clock quality level to the user-space.

   - Support master-slave PHY config via device tree.

  Tests and tooling:

   - forwarding: introduce deferred commands, to simplify the cleanup
     phase

  Drivers:

   - Updated several drivers - Amazon vNic, Google vNic, Microsoft vNic,
     Intel e1000e and Broadcom Tigon3 - to use netdev-genl to link the
     IRQs and queues to NAPI IDs, allowing busy polling and better
     introspection.

   - Ethernet high-speed NICs:
      - nVidia/Mellanox:
         - mlx5:
           - a large refactor to implement support for cross E-Switch
             scheduling
           - refactor H/W conter management to let it scale better
           - H/W GRO cleanups
      - Intel (100G, ice)::
         - add support for ethtool reset
         - implement support for per TX queue H/W shaping
      - AMD/Solarflare:
         - implement per device queue stats support
      - Broadcom (bnxt):
         - improve wildcard l4proto on IPv4/IPv6 ntuple rules
      - Marvell Octeon:
         - Add representor support for each Resource Virtualization Unit
           (RVU) device.
      - Hisilicon:
         - add support for the BMC Gigabit Ethernet
      - IBM (EMAC):
         - driver cleanup and modernization
      - Cisco (VIC):
         - raise the queues number limit to 256

   - Ethernet virtual:
      - Google vNIC:
         - implement page pool support
      - macsec:
         - inherit lower device's features and TSO limits when
           offloading
      - virtio_net:
         - enable premapped mode by default
         - support for XDP socket(AF_XDP) zerocopy TX
      - wireguard:
         - set the TSO max size to be GSO_MAX_SIZE, to aggregate larger
           packets.

   - Ethernet NICs embedded and virtual:
      - Broadcom ASP:
         - enable software timestamping
      - Freescale:
         - add enetc4 PF driver
      - MediaTek: Airoha SoC:
         - implement BQL support
      - RealTek r8169:
         - enable TSO by default on r8168/r8125
         - implement extended ethtool stats
      - Renesas AVB:
         - enable TX checksum offload
      - Synopsys (stmmac):
         - support header splitting for vlan tagged packets
         - move common code for DWMAC4 and DWXGMAC into a separate FPE
           module.
         - add dwmac driver support for T-HEAD TH1520 SoC
      - Synopsys (xpcs):
         - driver refactor and cleanup
      - TI:
         - icssg_prueth: add VLAN offload support
      - Xilinx emaclite:
         - add clock support

   - Ethernet switches:
      - Microchip:
         - implement support for the lan969x Ethernet switch family
         - add LAN9646 switch support to KSZ DSA driver

   - Ethernet PHYs:
      - Marvel: 88q2x: enable auto negotiation
      - Microchip: add support for LAN865X Rev B1 and LAN867X Rev C1/C2

   - PTP:
      - Add support for the Amazon virtual clock device
      - Add PtP driver for s390 clocks

   - WiFi:
      - mac80211
         - EHT 1024 aggregation size for transmissions
         - new operation to indicate that a new interface is to be added
         - support radio separation of multi-band devices
         - move wireless extension spy implementation to libiw
      - Broadcom:
         - brcmfmac: optional LPO clock support
      - Microchip:
         - add support for Atmel WILC3000
      - Qualcomm (ath12k):
         - firmware coredump collection support
         - add debugfs support for a multitude of statistics
      - Qualcomm (ath5k):
         -  Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support
      - Realtek:
         - rtw88: 8821au and 8812au USB adapters support
         - rtw89: add thermal protection
         - rtw89: fine tune BT-coexsitence to improve user experience
         - rtw89: firmware secure boot for WiFi 6 chip

   - Bluetooth
      - add Qualcomm WCN785x support for ids Foxconn 0xe0fc/0xe0f3 and
        0x13d3:0x3623
      - add Realtek RTL8852BE support for id Foxconn 0xe123
      - add MediaTek MT7920 support for wireless module ids
      - btintel_pcie: add handshake between driver and firmware
      - btintel_pcie: add recovery mechanism
      - btnxpuart: add GPIO support to power save feature"

* tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1475 commits)
  mm: page_frag: fix a compile error when kernel is not compiled
  Documentation: tipc: fix formatting issue in tipc.rst
  selftests: nic_performance: Add selftest for performance of NIC driver
  selftests: nic_link_layer: Add selftest case for speed and duplex states
  selftests: nic_link_layer: Add link layer selftest for NIC driver
  bnxt_en: Add FW trace coredump segments to the coredump
  bnxt_en: Add a new ethtool -W dump flag
  bnxt_en: Add 2 parameters to bnxt_fill_coredump_seg_hdr()
  bnxt_en: Add functions to copy host context memory
  bnxt_en: Do not free FW log context memory
  bnxt_en: Manage the FW trace context memory
  bnxt_en: Allocate backing store memory for FW trace logs
  bnxt_en: Add a 'force' parameter to bnxt_free_ctx_mem()
  bnxt_en: Refactor bnxt_free_ctx_mem()
  bnxt_en: Add mem_valid bit to struct bnxt_ctx_mem_type
  bnxt_en: Update firmware interface spec to 1.10.3.85
  selftests/bpf: Add some tests with sockmap SK_PASS
  bpf: fix recursive lock when verdict program return SK_PASS
  wireguard: device: support big tcp GSO
  wireguard: selftests: load nf_conntrack if not present
  ...
2024-11-21 08:28:08 -08:00
Chao Yu
3fc5d5a182 f2fs: fix to shrink read extent node in batches
We use rwlock to protect core structure data of extent tree during
its shrink, however, if there is a huge number of extent nodes in
extent tree, during shrink of extent tree, it may hold rwlock for
a very long time, which may trigger kernel hang issue.

This patch fixes to shrink read extent node in batches, so that,
critical region of the rwlock can be shrunk to avoid its extreme
long time hold.

Reported-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Closes: https://lore.kernel.org/linux-f2fs-devel/20241112110627.1314632-1-xiuhong.wang@unisoc.com/
Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:16:54 +00:00
Chao Yu
81520c684c f2fs: print message if fscorrupted was found in f2fs_new_node_page()
If fs corruption occurs in f2fs_new_node_page(), let's print
more information about corrupted metadata into kernel log.

Meanwhile, it updates to record ERROR_INCONSISTENT_NAT instead
of ERROR_INVALID_BLKADDR if blkaddr in nat entry is not
NULL_ADDR which means nat bitmap and nat entry is inconsistent.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:16:39 +00:00
Sheng Yong
f88c7904b5 f2fs: clear SBI_POR_DOING before initing inmem curseg
SBI_POR_DOING can be cleared after recovery is completed, so that
changes made before recovery can be persistent, and subsequent
errors can be recorded into cp/sb.

Signed-off-by: Song Feng <songfeng@oppo.com>
Signed-off-by: Yongpeng Yang <yangyongpeng1@oppo.com>
Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:14:07 +00:00
Sheng Yong
1015035609 f2fs: fix changing cursegs if recovery fails on zoned device
Fsync data recovery attempts to check and fix write pointer consistency
of cursegs and all other zones. If the write pointers of cursegs are
unaligned, cursegs are changed to new sections.

If recovery fails, zone write pointers are still checked and fixed,
but the latest checkpoint cannot be written back. Additionally, retry-
mount skips recovery and rolls back to reuse the old cursegs whose
zones are already finished. This can lead to unaligned write later.

This patch addresses the issue by leaving writer pointers untouched if
recovery fails. When retry-mount is performed, cursegs and other zones
are checked and fixed after skipping recovery.

Signed-off-by: Song Feng <songfeng@oppo.com>
Signed-off-by: Yongpeng Yang <yangyongpeng1@oppo.com>
Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:13:53 +00:00
Daeho Jeong
a35749b1ed f2fs: adjust unusable cap before checkpoint=disable mode
The unusable cap value must be adjusted before checking whether
checkpoint=disable is feasible.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:12:30 +00:00
Chao Yu
6787a82245 f2fs: fix to requery extent which cross boundary of inquiry
dd if=/dev/zero of=file bs=4k count=5
xfs_io file -c "fiemap -v 2 16384"
file:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..31]:         139272..139303      32 0x1000
     1: [32..39]:        139304..139311       8 0x1001
xfs_io file -c "fiemap -v 0 16384"
file:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..31]:         139272..139303      32 0x1000
xfs_io file -c "fiemap -v 0 16385"
file:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..39]:         139272..139311      40 0x1001

There are two problems:
- continuous extent is split to two
- FIEMAP_EXTENT_LAST is missing in last extent

The root cause is: if upper boundary of inquiry crosses extent,
f2fs_map_blocks() will truncate length of returned extent to
F2FS_BYTES_TO_BLK(len), and also, it will stop to query latter
extent or hole to make sure current extent is last or not.

In order to fix this issue, once we found an extent locates
in the end of inquiry range by f2fs_map_blocks(), we need to
expand inquiry range to requiry.

Cc: stable@vger.kernel.org
Fixes: 7f63eb77af7b ("f2fs: report unwritten area in f2fs_fiemap")
Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:12:30 +00:00
Zhiguo Niu
77569f785c f2fs: fix to adjust appropriate length for fiemap
If user give a file size as "length" parameter for fiemap
operations, but if this size is non-block size aligned,
it will show 2 segments fiemap results even this whole file
is contiguous on disk, such as the following results:

 ./f2fs_io fiemap 0 19034 ylog/analyzer.py
Fiemap: offset = 0 len = 19034
        logical addr.    physical addr.   length           flags
0       0000000000000000 0000000020baa000 0000000000004000 00001000
1       0000000000004000 0000000020bae000 0000000000001000 00001001

after this patch:
./f2fs_io fiemap 0 19034 ylog/analyzer.py
Fiemap: offset = 0 len = 19034
    logical addr.    physical addr.   length           flags
0    0000000000000000 00000000315f3000 0000000000005000 00001001

Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:12:30 +00:00
Chao Yu
7461f37094 f2fs: clean up w/ F2FS_{BLK_TO_BYTES,BTYES_TO_BLK}
f2fs doesn't support different blksize in one instance, so
bytes_to_blks() and blks_to_bytes() are equal to F2FS_BYTES_TO_BLK
and F2FS_BLK_TO_BYTES, let's use F2FS_BYTES_TO_BLK/F2FS_BLK_TO_BYTES
instead for cleanup.

Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:12:29 +00:00
Chao Yu
3273d8ad94 f2fs: fix to do cast in F2FS_{BLK_TO_BYTES, BTYES_TO_BLK} to avoid overflow
It missed to cast variable to unsigned long long type before
bit shift, which will cause overflow, fix it.

Fixes: f7ef9b83b583 ("f2fs: introduce macros to convert bytes and blocks in f2fs")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:12:29 +00:00
Daniel Yang
789ca0eb47 f2fs: replace deprecated strcpy with strscpy
strcpy is deprecated. Kernel docs recommend replacing strcpy with
strscpy. The function strcpy() return value isn't used so there
shouldn't be an issue replacing with the safer alternative strscpy.

Signed-off-by: Daniel Yang <danielyangkang@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:12:29 +00:00
Jaegeuk Kim
acff9409dd Revert "f2fs: remove unreachable lazytime mount option parsing"
This reverts commit 54f43a10fa257ad4af02a1d157fefef6ebcfa7dc.

The above commit broke the lazytime mount, given

mount("/dev/vdb", "/mnt/test", "f2fs", 0, "lazytime");

CC: stable@vger.kernel.org # 6.11+
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21 16:12:15 +00:00
Christian Brauner
3e5360167a
statmount: fix security option retrieval
Fix the inverted check for security_sb_show_options().

Link: https://lore.kernel.org/r/c8eaa647-5d67-49b6-9401-705afcb7e4d7@stanley.mountain
Link: https://lore.kernel.org/r/20241120-verehren-rhabarber-83a11b297bcc@brauner
Fixes: aefff51e1c29 ("statmount: retrieve security mount options")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: stable@vger.kernel.org # mainline only
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21 09:35:31 +01:00
Miklos Szeredi
d18516a021
statmount: clean up unescaped option handling
Move common code from opt_array/opt_sec_array to helper.  This helper
does more than just unescape options, so rename to
statmount_opt_process().

Handle corner case of just a single character in options.

Rename some local variables to better describe their function.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20241120142732.55210-1-mszeredi@redhat.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21 09:35:26 +01:00
Brian Foster
fde4c4c3ec
iomap: elide flush from partial eof zero range
iomap zero range flushes pagecache in certain situations to
determine which parts of the range might require zeroing if dirty
data is present in pagecache. The kernel robot recently reported a
regression associated with this flushing in the following stress-ng
workload on XFS:

stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64

This workload involves repeated small, strided, extending writes. On
XFS, this produces a pattern of post-eof speculative preallocation,
conversion of preallocation from delalloc to unwritten, dirtying
pagecache over newly unwritten blocks, and then rinse and repeat
from the new EOF. This leads to repetitive flushing of the EOF folio
via the zero range call XFS uses for writes that start beyond
current EOF.

To mitigate this problem, special case EOF block zeroing to prefer
zeroing the folio over a flush when the EOF folio is already dirty.
To do this, split out and open code handling of an unaligned start
offset. This brings most of the performance back by avoiding flushes
on zero range calls via write and truncate extension operations. The
flush doesn't occur in these situations because the entire range is
post-eof and therefore the folio that overlaps EOF is the only one
in the range.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20241115200155.593665-4-bfoster@redhat.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21 09:35:25 +01:00
Brian Foster
889ac75787
iomap: lift zeroed mapping handling into iomap_zero_range()
In preparation for special handling of subranges, lift the zeroed
mapping logic from the iterator into the caller. Since this puts the
pagecache dirty check and flushing in the same place, streamline the
comments a bit as well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20241115200155.593665-3-bfoster@redhat.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21 09:35:25 +01:00
Brian Foster
2519369201
iomap: reset per-iter state on non-error iter advances
iomap_iter_advance() zeroes the processed and mapping fields on
every non-error iteration except for the last expected iteration
(i.e. return 0 expected to terminate the iteration loop). This
appears to be circumstantial as nothing currently relies on these
fields after the final iteration.

Therefore to better faciliate iomap_iter reuse in subsequent
patches, update iomap_iter_advance() to always reset per-iteration
state on successful completion.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20241115200155.593665-2-bfoster@redhat.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21 09:35:25 +01:00
Thorsten Blum
a514e6f8f5
fscache: Remove duplicate included header
Remove duplicate included header file linux/uio.h

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Link: https://lore.kernel.org/r/20240628062329.321162-2-thorsten.blum@toblux.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21 09:35:25 +01:00
Brian Foster
eb65540aa9
iomap: warn on zero range of a post-eof folio
iomap_zero_range() uses buffered writes for manual zeroing, no
longer updates i_size for such writes, but is still explicitly
called for post-eof ranges. The historical use case for this is
zeroing post-eof speculative preallocation on extending writes from
XFS. However, XFS also recently changed to convert all post-eof
delalloc mappings to unwritten in the iomap_begin() handler, which
means it now never expects manual zeroing of post-eof mappings. In
other words, all post-eof mappings should be reported as holes or
unwritten.

This is a subtle dependency that can be hard to detect if violated
because associated codepaths are likely to update i_size after folio
locks are dropped, but before writeback happens to occur. For
example, if XFS reverts back to some form of manual zeroing of
post-eof blocks on write extension, writeback of those zeroed folios
will now race with the presumed i_size update from the subsequent
buffered write.

Since iomap_zero_range() can't correctly zero post-eof mappings
beyond EOF without updating i_size, warn if this ever occurs. This
serves as minimal indication that if this use case is reintroduced
by a filesystem, iomap_zero_range() might need to reconsider i_size
updates for write extending use cases.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://lore.kernel.org/r/20241115145931.535207-1-bfoster@redhat.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21 09:35:24 +01:00
Vasiliy Kovalev
c8b359dddb ovl: Filter invalid inodes with missing lookup function
Add a check to the ovl_dentry_weird() function to prevent the
processing of directory inodes that lack the lookup function.
This is important because such inodes can cause errors in overlayfs
when passed to the lowerstack.

Reported-by: syzbot+a8c9d476508bd14a90e5@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=a8c9d476508bd14a90e5
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/linux-unionfs/CAJfpegvx-oS9XGuwpJx=Xe28_jzWx5eRo1y900_ZzWY+=gGzUg@mail.gmail.com/
Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-20 10:23:04 +01:00
Linus Torvalds
bf9aa14fc5 A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers
 
     posix-timers are currently auto-rearmed by the kernel when the signal
     of the timer is ignored so that the timer signal can be delivered once
     the corresponding signal is unignored.
 
     This requires to throttle the timer to prevent a DoS by small intervals
     and keeps the system pointlessly out of low power states for no value.
     This is a long standing non-trivial problem due to the lock order of
     posix-timer lock and the sighand lock along with life time issues as
     the timer and the sigqueue have different life time rules.
 
     Cure this by:
 
      * Embedding the sigqueue into the timer struct to have the same life
        time rules. Aside of that this also avoids the lookup of the timer
        in the signal delivery and rearm path as it's just a always valid
        container_of() now.
 
      * Queuing ignored timer signals onto a seperate ignored list.
 
      * Moving queued timer signals onto the ignored list when the signal is
        switched to SIG_IGN before it could be delivered.
 
      * Walking the ignored list when SIG_IGN is lifted and requeue the
        signals to the actual signal lists. This allows the signal delivery
        code to rearm the timer.
 
     This also required to consolidate the signal delivery rules so they are
     consistent across all situations. With that all self test scenarios
     finally succeed.
 
   - Core infrastructure for VFS multigrain timestamping
 
     This is required to allow the kernel to use coarse grained time stamps
     by default and switch to fine grained time stamps when inode attributes
     are actively observed via getattr().
 
     These changes have been provided to the VFS tree as well, so that the
     VFS specific infrastructure could be built on top.
 
   - Cleanup and consolidation of the sleep() infrastructure
 
     * Move all sleep and timeout functions into one file
 
     * Rework udelay() and ndelay() into proper documented inline functions
       and replace the hardcoded magic numbers by proper defines.
 
     * Rework the fsleep() implementation to take the reality of the timer
       wheel granularity on different HZ values into account. Right now the
       boundaries are hard coded time ranges which fail to provide the
       requested accuracy on different HZ settings.
 
     * Update documentation for all sleep/timeout related functions and fix
       up stale documentation links all over the place
 
     * Fixup a few usage sites
 
   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks
 
     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as that's
     the clock which drives the timekeeping adjustments via the various user
     space daemons through adjtimex(2).
 
     The non TAI based clock domains are accessible via the file descriptor
     based posix clocks, but their usability is very limited. They can't be
     accessed fast as they always go all the way out to the hardware and
     they cannot be utilized in the kernel itself.
 
     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.
 
     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the kernel
     provides access to clock MONOTONIC, REALTIME etc.
 
     Instead of creating a duplicated infrastructure this rework converts
     timekeeping and adjtimex(2) into generic functionality which operates
     on pointers to data structures instead of using static variables.
 
     This allows to provide time accessors and adjtimex(2) functionality for
     the independent PTP clocks in a subsequent step.
 
   - Consolidate hrtimer initialization
 
     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.
 
     That's an extra unnecessary step and makes Rust support less straight
     forward than it should be.
 
     Provide a new set of hrtimer_setup*() functions and convert the core
     code and a few usage sites of the less frequently used interfaces over.
 
     The bulk of the htimer_init() to hrtimer_setup() conversion is already
     prepared and scheduled for the next merge window.
 
   - Drivers:
 
     * Ensure that the global timekeeping clocksource is utilizing the
       cluster 0 timer on MIPS multi-cluster systems.
 
       Otherwise CPUs on different clusters use their cluster specific
       clocksource which is not guaranteed to be synchronized with other
       clusters.
 
     * Mostly boring cleanups, fixes, improvements and code movement
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL
 tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9
 6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM
 EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9
 8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG
 bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB
 DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3
 UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV
 Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n
 AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5
 JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw
 hZZ4hdM9+3T7cg==
 =2VC6
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "A rather large update for timekeeping and timers:

   - The final step to get rid of auto-rearming posix-timers

     posix-timers are currently auto-rearmed by the kernel when the
     signal of the timer is ignored so that the timer signal can be
     delivered once the corresponding signal is unignored.

     This requires to throttle the timer to prevent a DoS by small
     intervals and keeps the system pointlessly out of low power states
     for no value. This is a long standing non-trivial problem due to
     the lock order of posix-timer lock and the sighand lock along with
     life time issues as the timer and the sigqueue have different life
     time rules.

     Cure this by:

       - Embedding the sigqueue into the timer struct to have the same
         life time rules. Aside of that this also avoids the lookup of
         the timer in the signal delivery and rearm path as it's just a
         always valid container_of() now.

       - Queuing ignored timer signals onto a seperate ignored list.

       - Moving queued timer signals onto the ignored list when the
         signal is switched to SIG_IGN before it could be delivered.

       - Walking the ignored list when SIG_IGN is lifted and requeue the
         signals to the actual signal lists. This allows the signal
         delivery code to rearm the timer.

     This also required to consolidate the signal delivery rules so they
     are consistent across all situations. With that all self test
     scenarios finally succeed.

   - Core infrastructure for VFS multigrain timestamping

     This is required to allow the kernel to use coarse grained time
     stamps by default and switch to fine grained time stamps when inode
     attributes are actively observed via getattr().

     These changes have been provided to the VFS tree as well, so that
     the VFS specific infrastructure could be built on top.

   - Cleanup and consolidation of the sleep() infrastructure

       - Move all sleep and timeout functions into one file

       - Rework udelay() and ndelay() into proper documented inline
         functions and replace the hardcoded magic numbers by proper
         defines.

       - Rework the fsleep() implementation to take the reality of the
         timer wheel granularity on different HZ values into account.
         Right now the boundaries are hard coded time ranges which fail
         to provide the requested accuracy on different HZ settings.

       - Update documentation for all sleep/timeout related functions
         and fix up stale documentation links all over the place

       - Fixup a few usage sites

   - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP
     clocks

     A system can have multiple PTP clocks which are participating in
     seperate and independent PTP clock domains. So far the kernel only
     considers the PTP clock which is based on CLOCK TAI relevant as
     that's the clock which drives the timekeeping adjustments via the
     various user space daemons through adjtimex(2).

     The non TAI based clock domains are accessible via the file
     descriptor based posix clocks, but their usability is very limited.
     They can't be accessed fast as they always go all the way out to
     the hardware and they cannot be utilized in the kernel itself.

     As Time Sensitive Networking (TSN) gains traction it is required to
     provide fast user and kernel space access to these clocks.

     The approach taken is to utilize the timekeeping and adjtimex(2)
     infrastructure to provide this access in a similar way how the
     kernel provides access to clock MONOTONIC, REALTIME etc.

     Instead of creating a duplicated infrastructure this rework
     converts timekeeping and adjtimex(2) into generic functionality
     which operates on pointers to data structures instead of using
     static variables.

     This allows to provide time accessors and adjtimex(2) functionality
     for the independent PTP clocks in a subsequent step.

   - Consolidate hrtimer initialization

     hrtimers are set up by initializing the data structure and then
     seperately setting the callback function for historical reasons.

     That's an extra unnecessary step and makes Rust support less
     straight forward than it should be.

     Provide a new set of hrtimer_setup*() functions and convert the
     core code and a few usage sites of the less frequently used
     interfaces over.

     The bulk of the htimer_init() to hrtimer_setup() conversion is
     already prepared and scheduled for the next merge window.

   - Drivers:

       - Ensure that the global timekeeping clocksource is utilizing the
         cluster 0 timer on MIPS multi-cluster systems.

         Otherwise CPUs on different clusters use their cluster specific
         clocksource which is not guaranteed to be synchronized with
         other clusters.

       - Mostly boring cleanups, fixes, improvements and code movement"

* tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
  posix-timers: Fix spurious warning on double enqueue versus do_exit()
  clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties
  clocksource/drivers/gpx: Remove redundant casts
  clocksource/drivers/timer-ti-dm: Fix child node refcount handling
  dt-bindings: timer: actions,owl-timer: convert to YAML
  clocksource/drivers/ralink: Add Ralink System Tick Counter driver
  clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource
  clocksource/drivers/timer-ti-dm: Don't fail probe if int not found
  clocksource/drivers:sp804: Make user selectable
  clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions
  hrtimers: Delete hrtimer_init_on_stack()
  alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack()
  io_uring: Switch to use hrtimer_setup_on_stack()
  sched/idle: Switch to use hrtimer_setup_on_stack()
  hrtimers: Delete hrtimer_init_sleeper_on_stack()
  wait: Switch to use hrtimer_setup_sleeper_on_stack()
  timers: Switch to use hrtimer_setup_sleeper_on_stack()
  net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack()
  futex: Switch to use hrtimer_setup_sleeper_on_stack()
  fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
  ...
2024-11-19 16:35:06 -08:00
Linus Torvalds
5c2b050848 A set of updates for the interrupt subsystem:
- Tree wide:
 
     * Make nr_irqs static to the core code and provide accessor functions
       to remove existing and prevent future aliasing problems with local
       variables or function arguments of the same name.
 
   - Core code:
 
     * Prevent freeing an interrupt in the devres code which is not managed
       by devres in the first place.
 
     * Use seq_put_decimal_ull_width() for decimal values output in
       /proc/interrupts which increases performance significantly as it
       avoids parsing the format strings over and over.
 
     * Optimize raising the timer and hrtimer soft interrupts by using the
       'set bit only' variants instead of the combined version which checks
       whether ksoftirqd should be woken up. The latter is a pointless
       exercise as both soft interrupts are raised in the context of the
       timer interrupt and therefore never wake up ksoftirqd.
 
     * Delegate timer/hrtimer soft interrupt processing to a dedicated thread
       on RT.
 
       Timer and hrtimer soft interrupts are always processed in ksoftirqd
       on RT enabled kernels. This can lead to high latencies when other
       soft interrupts are delegated to ksoftirqd as well.
 
       The separate thread allows to run them seperately under a RT
       scheduling policy to reduce the latency overhead.
 
   - Drivers:
 
     * New drivers or extensions of existing drivers to support Renesas
       RZ/V2H(P), Aspeed AST27XX, T-HEAD C900 and ATMEL sam9x7 interrupt
       chips
 
     * Support for multi-cluster GICs on MIPS.
 
       MIPS CPUs can come with multiple CPU clusters, where each CPU cluster
       has its own GIC (Generic Interrupt Controller). This requires to
       access the GIC of a remote cluster through a redirect register block.
 
       This is encapsulated into a set of helper functions to keep the
       complexity out of the actual code paths which handle the GIC details.
 
     * Support for encrypted guests in the ARM GICV3 ITS driver
 
       The ITS page needs to be shared with the hypervisor and therefore
       must be decrypted.
 
     * Small cleanups and fixes all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7ggcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaf7D/9G6FgJXx/60zqnpnOr9Yx0hxjaI47x
 PFyCd3P05qyVMBYXfI99vrSKuVdMZXJ/fH5L83y+sOaTASyLTzg37igZycIDJzLI
 FnHh/m/+UA8k2aIC5VUiNAjne2RLaTZiRN15uEHFVjByC5Y+YTlCNUE4BBhg5RfQ
 hKmskeffWdtui3ou13CSNvbFn+pmqi4g6n1ysUuLhiwM2E5b1rZMprcCOnun/cGP
 IdUQsODNWTTv9eqPJez985M6A1x2SCGNv7Z73h58B9N0pBRPEC1xnhUnCJ1sA0cJ
 pnfde2C1lztEjYbwDngy0wgq0P6LINjQ5Ma2YY2F2hTMsXGJxGPDZm24/u5uR46x
 N/gsOQMXqw6f5yvbiS7Asx9WzR6ry8rJl70QRgTyozz7xxJTaiNm2HqVFe2wc+et
 Q/BzaKdhmUJj1GMZmqD2rrgwYeDcb4wWYNtwjM4PVHHxYlJVq0mEF1kLLS8YDyjf
 HuGPVqtSkt3E0+Br3FKcv5ltUQP8clXbudc6L1u98YBfNK12hW8L+c3YSvIiFoYM
 ZOAeANPM7VtQbP2Jg2q81Dd3CShImt5jqL2um+l8g7+mUE7l9gyuO/w/a5dQ57+b
 kx7mHHIW2zCeHrkZZbRUYzI2BJfMCCOVN4Ax5OZxTLnLsL9VEehy8NM8QYT4TS8R
 XmTOYW3U9XR3gw==
 =JqxC
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull interrupt subsystem updates from Thomas Gleixner:
 "Tree wide:

   - Make nr_irqs static to the core code and provide accessor functions
     to remove existing and prevent future aliasing problems with local
     variables or function arguments of the same name.

  Core code:

   - Prevent freeing an interrupt in the devres code which is not
     managed by devres in the first place.

   - Use seq_put_decimal_ull_width() for decimal values output in
     /proc/interrupts which increases performance significantly as it
     avoids parsing the format strings over and over.

   - Optimize raising the timer and hrtimer soft interrupts by using the
     'set bit only' variants instead of the combined version which
     checks whether ksoftirqd should be woken up. The latter is a
     pointless exercise as both soft interrupts are raised in the
     context of the timer interrupt and therefore never wake up
     ksoftirqd.

   - Delegate timer/hrtimer soft interrupt processing to a dedicated
     thread on RT.

     Timer and hrtimer soft interrupts are always processed in ksoftirqd
     on RT enabled kernels. This can lead to high latencies when other
     soft interrupts are delegated to ksoftirqd as well.

     The separate thread allows to run them seperately under a RT
     scheduling policy to reduce the latency overhead.

  Drivers:

   - New drivers or extensions of existing drivers to support Renesas
     RZ/V2H(P), Aspeed AST27XX, T-HEAD C900 and ATMEL sam9x7 interrupt
     chips

   - Support for multi-cluster GICs on MIPS.

     MIPS CPUs can come with multiple CPU clusters, where each CPU
     cluster has its own GIC (Generic Interrupt Controller). This
     requires to access the GIC of a remote cluster through a redirect
     register block.

     This is encapsulated into a set of helper functions to keep the
     complexity out of the actual code paths which handle the GIC
     details.

   - Support for encrypted guests in the ARM GICV3 ITS driver

     The ITS page needs to be shared with the hypervisor and therefore
     must be decrypted.

   - Small cleanups and fixes all over the place"

* tag 'irq-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  irqchip/riscv-aplic: Prevent crash when MSI domain is missing
  genirq/proc: Use seq_put_decimal_ull_width() for decimal values
  softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  timers: Use __raise_softirq_irqoff() to raise the softirq.
  hrtimer: Use __raise_softirq_irqoff() to raise the softirq
  riscv: defconfig: Enable T-HEAD C900 ACLINT SSWI drivers
  irqchip: Add T-HEAD C900 ACLINT SSWI driver
  dt-bindings: interrupt-controller: Add T-HEAD C900 ACLINT SSWI device
  irqchip/stm32mp-exti: Use of_property_present() for non-boolean properties
  irqchip/mips-gic: Fix selection of GENERIC_IRQ_EFFECTIVE_AFF_MASK
  irqchip/mips-gic: Prevent indirect access to clusters without CPU cores
  irqchip/mips-gic: Multi-cluster support
  irqchip/mips-gic: Setup defaults in each cluster
  irqchip/mips-gic: Support multi-cluster in for_each_online_cpu_gic()
  irqchip/mips-gic: Replace open coded online CPU iterations
  genirq/irqdesc: Use str_enabled_disabled() helper in wakeup_show()
  genirq/devres: Don't free interrupt which is not managed by devres
  irqchip/gic-v3-its: Fix over allocation in itt_alloc_pool()
  irqchip/aspeed-intc: Add AST27XX INTC support
  dt-bindings: interrupt-controller: Add support for ASPEED AST27XX INTC
  ...
2024-11-19 15:54:19 -08:00
Linus Torvalds
3f020399e4 Scheduler changes for v6.13:
- Core facilities:
 
     - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes
       fair-class preemption by delaying preemption requests to the
       tick boundary, while working as full preemption for RR/FIFO/DEADLINE
       classes. (Peter Zijlstra)
 
         - x86: Enable Lazy preemption (Peter Zijlstra)
         - riscv: Enable Lazy preemption (Jisheng Zhang)
 
     - Initialize idle tasks only once (Thomas Gleixner)
 
     - sched/ext: Remove sched_fork() hack (Thomas Gleixner)
 
  - Fair scheduler:
     - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
 
  - Idle loop:
       Optimize the generic idle loop by removing unnecessary
       memory barrier (Zhongqiu Han)
 
  - RSEQ:
     - Improve cache locality of RSEQ concurrency IDs for
       intermittent workloads (Mathieu Desnoyers)
 
  - Waitqueues:
     - Make wake_up_{bit,var} less fragile (Neil Brown)
 
  - PSI:
     - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner)
 
  - Preparatory patches for proxy execution:
     - core: Add move_queued_task_locked helper (Connor O'Brien)
     - core: Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
     - core: Split out __schedule() deactivate task logic into a helper (John Stultz)
     - core: Split scheduler and execution contexts (Peter Zijlstra)
     - locking/mutex: Make mutex::wait_lock irq safe (Juri Lelli)
     - locking/mutex: Expose __mutex_owner() (Juri Lelli)
     - locking/mutex: Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
 
  - Misc fixes and cleanups:
     - core: Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp)
     - core: Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior)
     - wait: Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
     - fair: remove the DOUBLE_TICK feature (Huang Shijie)
     - fair: fix the comment for PREEMPT_SHORT (Huang Shijie)
     - uclamp: Fix unnused variable warning (Christian Loehle)
     - rt: No PREEMPT_RT=y for all{yes,mod}config
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7fnQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hZTBAAozVdWA2m51aNa67HvAZta/olmrIagVbW
 inwbTgqa8b+UfeWEuKOfrZr5khjEh6pLgR3dBTib1uH6xxYj/Okds+qbPWSBPVLh
 yzavlm/zJZM1U1XtxE3eyVfqWik4GrY7DoIMDQQr+YH7rNXonJeJkll38OI2E5MC
 q3Q01qyMo8RJJX8qkf3f8ObOoP/51NsVniTw0Zb2fzEhXz8FjezLlxk6cMfgSkJG
 lg9gfIwUZ7Xg5neRo4kJcc3Ht31KYOhWSiupBJzRD1hss/N/AybvMcTX/Cm8d07w
 HIAdDDAn84o46miFo/a0V/hsJZ72idWbqxVJUCtaezrpOUiFkG+uInRvG/ynr0lF
 5dEI9f+6PUw8Nc7L72IyHkobjPqS2IefSaxYYCBKmxMX2qrenfTor/pKiWzzhBIl
 rX3MZSuUJ8NjV4rNGD/qXRM1IsMJrsDwxDyv+sRec3XdH33x286ds6aAUEPDQ6N7
 96VS0sOKcNUJN8776ErNjlIxRl8HTlpkaO3nZlQIfXgTlXUpRvOuKbEWqP+606lo
 oANgJTKgUhgJPWZnvmdRxDjSiOp93QcImjus9i1tN81FGiEDleONsJUxu2Di1E5+
 s1nCiytjq+cdvzCqFyiOZUh+g6kSZ4yXxNgLg2UvbXzX1zOeUQT3WtyKUhMPXhU8
 esh1TgbUbpE=
 =Zcqj
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core facilities:

   - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which
     optimizes fair-class preemption by delaying preemption requests to
     the tick boundary, while working as full preemption for
     RR/FIFO/DEADLINE classes. (Peter Zijlstra)
        - x86: Enable Lazy preemption (Peter Zijlstra)
        - riscv: Enable Lazy preemption (Jisheng Zhang)

   - Initialize idle tasks only once (Thomas Gleixner)

   - sched/ext: Remove sched_fork() hack (Thomas Gleixner)

  Fair scheduler:

   - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)

  Idle loop:

   - Optimize the generic idle loop by removing unnecessary memory
     barrier (Zhongqiu Han)

  RSEQ:

   - Improve cache locality of RSEQ concurrency IDs for intermittent
     workloads (Mathieu Desnoyers)

  Waitqueues:

   - Make wake_up_{bit,var} less fragile (Neil Brown)

  PSI:

   - Pass enqueue/dequeue flags to psi callbacks directly (Johannes
     Weiner)

  Preparatory patches for proxy execution:

   - Add move_queued_task_locked helper (Connor O'Brien)

   - Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)

   - Split out __schedule() deactivate task logic into a helper (John
     Stultz)

   - Split scheduler and execution contexts (Peter Zijlstra)

   - Make mutex::wait_lock irq safe (Juri Lelli)

   - Expose __mutex_owner() (Juri Lelli)

   - Remove wakeups from under mutex::wait_lock (Peter Zijlstra)

  Misc fixes and cleanups:

   - Remove unused __HAVE_THREAD_FUNCTIONS hook support (David
     Disseldorp)

   - Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej
     Siewior)

   - Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)

   - remove the DOUBLE_TICK feature (Huang Shijie)

   - fix the comment for PREEMPT_SHORT (Huang Shijie)

   - Fix unnused variable warning (Christian Loehle)

   - No PREEMPT_RT=y for all{yes,mod}config"

* tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.
  sched: No PREEMPT_RT=y for all{yes,mod}config
  riscv: add PREEMPT_LAZY support
  sched, x86: Enable Lazy preemption
  sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
  sched: Add Lazy preemption model
  sched: Add TIF_NEED_RESCHED_LAZY infrastructure
  sched/ext: Remove sched_fork() hack
  sched: Initialize idle tasks only once
  sched: psi: pass enqueue/dequeue flags to psi callbacks directly
  sched/uclamp: Fix unnused variable warning
  sched: Split scheduler and execution contexts
  sched: Split out __schedule() deactivate task logic into a helper
  sched: Consolidate pick_*_task to task_is_pushable helper
  sched: Add move_queued_task_locked helper
  locking/mutex: Expose __mutex_owner()
  locking/mutex: Make mutex::wait_lock irq safe
  locking/mutex: Remove wakeups from under mutex::wait_lock
  sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
  sched: idle: Optimize the generic idle loop by removing needless memory barrier
  ...
2024-11-19 14:16:06 -08:00
Linus Torvalds
8a7fa81137 Random number generator updates for Linux 6.13-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmc6oE0ACgkQSfxwEqXe
 A65n5BAAtNmfBJhYRiC6Svsg7+ktHmhCAHoHwnP7sv+bjs81FRAEv21CsfI+02Nb
 zUvaPuyiLtYzlWxzE5Yg44v1cADHAq+QZE1Fg5yl7ge6zPZ3+S1pv/8suNSyyI2M
 PKvh1sb4OkUtqplveYSuP1J87u55zAtV9mP9qC3hSlY3XkeQUObt9Awss8peOMdv
 sH2AxwBlRkqFXpY2worxlfg3p5iLemb3AUZ3f0Jc6fRmOagSJCt7i4mDrWo3EXke
 90Ao8ypY0x3YVGRFACHnxCS53X20HGwLxm7jdicfriMCzAJ6JQR6asO+NYnXR+Ev
 9Za3UquVHP6HbQGWj6d1k5k2nF+IbkTHTgFBPRK/CY9ZpVbP04B2K7tE1gmT81wj
 AscRGi9RBVBPKAUguyi99MXYlprFG/ZTLOux3hvdarv5u0bP94eXmy1FrRM+IO0r
 u4BiQ39FlkDdtRxjzKfCiKkMrf3NmFEciZJhxCnflzmOBaj64r1hRt/ea8Bjxvp3
 a4k0MfULmcEn2JwPiT1/Swz45ypZQc4OgbP87SCU8P0a23r21r2oK+9v3No/rCzB
 TI0fP6ykDTFQoiKUOSg1mJmkipdjeDyQ9E+0XIDsKd+T8Yv9rFoaV6RWoMrkt4AJ
 Yea9+V+XEI8F3SjhdD4OL/s3/+bjTjnRHDaXnJf2XzGmXcuvnbs=
 =o4ww
 -----END PGP SIGNATURE-----

Merge tag 'random-6.13-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull random number generator updates from Jason Donenfeld:
 "This contains a single series from Uros to replace uses of
  <linux/random.h> with prandom.h or other more specific headers
  as needed, in order to avoid a circular header issue.

  Uros' goal is to be able to use percpu.h from prandom.h, which
  will then allow him to define __percpu in percpu.h rather than
  in compiler_types.h"

* tag 'random-6.13-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
  prandom: Include <linux/percpu.h> in <linux/prandom.h>
  random: Do not include <linux/prandom.h> in <linux/random.h>
  netem: Include <linux/prandom.h> in sch_netem.c
  lib/test_scanf: Include <linux/prandom.h> instead of <linux/random.h>
  lib/test_parman: Include <linux/prandom.h> instead of <linux/random.h>
  bpf/tests: Include <linux/prandom.h> instead of <linux/random.h>
  lib/rbtree-test: Include <linux/prandom.h> instead of <linux/random.h>
  random32: Include <linux/prandom.h> instead of <linux/random.h>
  kunit: string-stream-test: Include <linux/prandom.h>
  lib/interval_tree_test.c: Include <linux/prandom.h> instead of <linux/random.h>
  bpf: Include <linux/prandom.h> instead of <linux/random.h>
  scsi: libfcoe: Include <linux/prandom.h> instead of <linux/random.h>
  fscrypt: Include <linux/once.h> in fs/crypto/keyring.c
  mtd: tests: Include <linux/prandom.h> instead of <linux/random.h>
  media: vivid: Include <linux/prandom.h> in vivid-vid-cap.c
  drm/lib: Include <linux/prandom.h> instead of <linux/random.h>
  drm/i915/selftests: Include <linux/prandom.h> instead of <linux/random.h>
  crypto: testmgr: Include <linux/prandom.h> instead of <linux/random.h>
  x86/kaslr: Include <linux/prandom.h> instead of <linux/random.h>
2024-11-19 10:43:44 -08:00
Andreas Gruenbacher
ffd1cf0443 gfs2: Prevent inode creation race
When a request to evict an inode comes in over the network, we are
trying to grab an inode reference via the iopen glock's gl_object
pointer.  There is a very small probability that by the time such a
request comes in, inode creation hasn't completed and the I_NEW flag is
still set.  To deal with that, wait for the inode and then check if
inode creation was successful.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-19 13:05:41 +01:00
Andreas Gruenbacher
c5b7a2400e gfs2: Only defer deletes when we have an iopen glock
The mechanism to defer deleting unlinked inodes is tied to
delete_work_func(), which is tied to iopen glocks.  When we don't have
an iopen glock, we must carry out deletes immediately instead.

Fixes a NULL pointer dereference in gfs2_evict_inode().

Fixes: 8c21c2c71e66 ("gfs2: Call gfs2_queue_verify_delete from gfs2_evict_inode")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-19 12:33:20 +01:00
Patrick Donnelly
8ea412e181 ceph: improve caps debugging output
This improves uniformity and exposes important sequence numbers.

Now looks like:

    <7>[   73.749563] ceph:           caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  caps mds2 op export ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 0 iseq 0 mseq 0
    ...
    <7>[   73.749574] ceph:           caps.c:4102 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  cap 20000000000.fffffffffffffffe export to peer 1 piseq 1 pmseq 1
    ...
    <7>[   73.749645] ceph:           caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  caps mds1 op import ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 1 iseq 1 mseq 1
    ...
    <7>[   73.749681] ceph:           caps.c:4244 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  cap 20000000000.fffffffffffffffe import from peer 2 piseq 686 pmseq 0
    ...
    <7>[  248.645596] ceph:           caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290]  caps mds1 op revoke ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 2538 iseq 1 mseq 1

See also ceph.git commit cb4ff28af09f ("mds: add issue_seq to all cap
messages").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:47:16 +01:00
Patrick Donnelly
8b41ac43c7 ceph: correct ceph_mds_cap_peer field name
The peer seq is used as the issue_seq. Use that name for consistency.
See also ceph.git commit 1da6ef237fc7 ("include/ceph_fs: correct
ceph_mds_cap_peer field name").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:08:17 +01:00
Patrick Donnelly
50f42c4895 ceph: correct ceph_mds_cap_item field name
The issue_seq is sent with bulk cap releases, not the current sequence
number. See also ceph.git commit 655cddb7c9f3 ("include/ceph_fs: correct
ceph_mds_cap_item field name").

Link: https://tracker.ceph.com/issues/66704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19 11:08:17 +01:00
Linus Torvalds
ba1f9c8fe3 arm64 updates for 6.13:
* Support for running Linux in a protected VM under the Arm Confidential
   Compute Architecture (CCA)
 
 * Guarded Control Stack user-space support. Current patches follow the
   x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
   patches (already on the list) will add support for clone3() allowing
   finer-grained control of the shadow stack size and placement from libc
 
 * AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
   getting close with the upcoming dpISA support)
 
 * Other arch features:
 
   - In-kernel use of the memcpy instructions, FEAT_MOPS (previously only
     exposed to user; uaccess support not merged yet)
 
   - MTE: hugetlbfs support and the corresponding kselftests
 
   - Optimise CRC32 using the PMULL instructions
 
   - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
 
   - Optimise the kernel TLB flushing to use the range operations
 
   - POE/pkey (permission overlays): further cleanups after bringing the
     signal handler in line with the x86 behaviour for 6.12
 
 * arm64 perf updates:
 
   - Support for the NXP i.MX91 PMU in the existing IMX driver
 
   - Support for Ampere SoCs in the Designware PCIe PMU driver
 
   - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
 
   - Support for Samsung's 'Mongoose' CPU PMU
 
   - Support for PMUv3.9 finer-grained userspace counter access control
 
   - Switch back to platform_driver::remove() now that it returns 'void'
 
   - Add some missing events for the CXL PMU driver
 
 * Miscellaneous arm64 fixes/cleanups:
 
   - Page table accessors cleanup: type updates, drop unused macros,
     reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
     check addresses before runtime P4D/PUD folding
 
   - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
     FEAT_ECV for the generic timers) allowing Linux to boot with
     firmware deployments that don't set SCTLR_EL3.ECVEn
 
   - ACPI/arm64: tighten the check for the array of platform timer
     structures and adjust the error handling procedure in
     gtdt_parse_timer_block()
 
   - Optimise the cache flush for the uprobes xol slot (skip if no
     change) and other uprobes/kprobes cleanups
 
   - Fix the context switching of tpidrro_el0 when kpti is enabled
 
   - Dynamic shadow call stack fixes
 
   - Sysreg updates
 
   - Various arm64 kselftest improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmc5POIACgkQa9axLQDI
 XvEDYA//a3eeNkgMuGdnSCVcLz+zy+oNwAwboG/4X1DqL8jiCbI4npwugPx95RIA
 YZOUvo9T2aL3OyefpUHll4gFHqx9OwoZIig2F70TEUmlPsGUbh0KBkdfQF3xZPdl
 EwV0kHSGEqMWMBwsGJGwgCYrUaf1MUQzh1GBl7VJ2ts5XsJBaBeOyKkysij26wtZ
 V+aHq2IUx7qQS7+HC/4P6IoHxKziFcsCMovaKaynP4cw9xXBQbDMcNlHEwndOMyk
 pu2zrv7GG0j3KQuVP/2Alf5FKhmI0GVGP/6Nc/zsOmw96w8Kf7HfzEtkHawr2aRq
 rqg/c9ivzDn1p+fUBo4ZYtrRk4IAY+yKu6hdzdLTP5+bQrBTWTO9rjQVBm9FAGYT
 sCdEj1NqzvExvNHD7X6ut/GJ05lmce3K+qeSXSEysN9gqiT3eomYWMXrD2V2lxzb
 rIDDcb/icfaqjt14Mksh19r/rzNeq7noj9CGSmcqw0BHZfHzl38Lai6pdfYzCNyn
 vCM/c4c1D/WWX8/lifO1JZVbhDk1jy82Iphg2KEhL8iKPxDsKBBZLmYuU1oa7tMo
 WryGAz9+GQwd+W9chFuaOEtMnzvW2scEJ5Eb2fEf0Qj0aEurkL+C9dZR6o1GN77V
 DBUxtU628Ef4PJJGfbNCwZzdd8UPYG3a/mKfQQ3dz0oz2LySlW4=
 =wDot
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - Support for running Linux in a protected VM under the Arm
   Confidential Compute Architecture (CCA)

 - Guarded Control Stack user-space support. Current patches follow the
   x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
   patches (already on the list) will add support for clone3() allowing
   finer-grained control of the shadow stack size and placement from
   libc

 - AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
   getting close with the upcoming dpISA support)

 - Other arch features:

     - In-kernel use of the memcpy instructions, FEAT_MOPS (previously
       only exposed to user; uaccess support not merged yet)

     - MTE: hugetlbfs support and the corresponding kselftests

     - Optimise CRC32 using the PMULL instructions

     - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG

     - Optimise the kernel TLB flushing to use the range operations

     - POE/pkey (permission overlays): further cleanups after bringing
       the signal handler in line with the x86 behaviour for 6.12

 - arm64 perf updates:

     - Support for the NXP i.MX91 PMU in the existing IMX driver

     - Support for Ampere SoCs in the Designware PCIe PMU driver

     - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC

     - Support for Samsung's 'Mongoose' CPU PMU

     - Support for PMUv3.9 finer-grained userspace counter access
       control

     - Switch back to platform_driver::remove() now that it returns
       'void'

     - Add some missing events for the CXL PMU driver

 - Miscellaneous arm64 fixes/cleanups:

     - Page table accessors cleanup: type updates, drop unused macros,
       reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
       check addresses before runtime P4D/PUD folding

     - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
       FEAT_ECV for the generic timers) allowing Linux to boot with
       firmware deployments that don't set SCTLR_EL3.ECVEn

     - ACPI/arm64: tighten the check for the array of platform timer
       structures and adjust the error handling procedure in
       gtdt_parse_timer_block()

     - Optimise the cache flush for the uprobes xol slot (skip if no
       change) and other uprobes/kprobes cleanups

     - Fix the context switching of tpidrro_el0 when kpti is enabled

     - Dynamic shadow call stack fixes

     - Sysreg updates

     - Various arm64 kselftest improvements

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits)
  arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
  kselftest/arm64: Try harder to generate different keys during PAC tests
  kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
  arm64/ptrace: Clarify documentation of VL configuration via ptrace
  kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
  acpi/arm64: remove unnecessary cast
  arm64/mm: Change protval as 'pteval_t' in map_range()
  kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
  kselftest/arm64: Add FPMR coverage to fp-ptrace
  kselftest/arm64: Expand the set of ZA writes fp-ptrace does
  kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
  kselftest/arm64: Enable build of PAC tests with LLVM=1
  kselftest/arm64: Check that SVCR is 0 in signal handlers
  selftests/mm: Fix unused function warning for aarch64_write_signal_pkey()
  kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
  kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
  kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
  kselftest/arm64: Fix build with stricter assemblers
  arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
  arm64/scs: Deal with 64-bit relative offsets in FDE frames
  ...
2024-11-18 18:10:37 -08:00
Jeff Layton
583772eec7 nfsd: allow for up to 32 callback session slots
nfsd currently only uses a single slot in the callback channel, which is
proving to be a bottleneck in some cases. Widen the callback channel to
a max of 32 slots (subject to the client's target_maxreqs value).

Change the cb_holds_slot boolean to an integer that tracks the current
slot number (with -1 meaning "unassigned").  Move the callback slot
tracking info into the session. Add a new u32 that acts as a bitmap to
track which slots are in use, and a u32 to track the latest callback
target_slotid that the client reports. To protect the new fields, add
a new per-session spinlock (the se_lock). Fix nfsd41_cb_get_slot to always
search for the lowest slotid (using ffs()).

Finally, convert the session->se_cb_seq_nr field into an array of
ints and add the necessary handling to ensure that the seqids get
reset when the slot table grows after shrinking.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:13 -05:00
Mike Snitzer
c840b8e1f0 nfs_common: must not hold RCU while calling nfsd_file_put_local
Move holding the RCU from nfs_to_nfsd_file_put_local to
nfs_to_nfsd_net_put.  It is the call to nfs_to->nfsd_serv_put that
requires the RCU anyway (the puts for nfsd_file and netns were
combined to avoid an extra indirect reference but that
micro-optimization isn't possible now).

This fixes xfstests generic/013 and it triggering:

"Voluntary context switch within RCU read-side critical section!"

[  143.545738] Call Trace:
[  143.546206]  <TASK>
[  143.546625]  ? show_regs+0x6d/0x80
[  143.547267]  ? __warn+0x91/0x140
[  143.547951]  ? rcu_note_context_switch+0x496/0x5d0
[  143.548856]  ? report_bug+0x193/0x1a0
[  143.549557]  ? handle_bug+0x63/0xa0
[  143.550214]  ? exc_invalid_op+0x1d/0x80
[  143.550938]  ? asm_exc_invalid_op+0x1f/0x30
[  143.551736]  ? rcu_note_context_switch+0x496/0x5d0
[  143.552634]  ? wakeup_preempt+0x62/0x70
[  143.553358]  __schedule+0xaa/0x1380
[  143.554025]  ? _raw_spin_unlock_irqrestore+0x12/0x40
[  143.554958]  ? try_to_wake_up+0x1fe/0x6b0
[  143.555715]  ? wake_up_process+0x19/0x20
[  143.556452]  schedule+0x2e/0x120
[  143.557066]  schedule_preempt_disabled+0x19/0x30
[  143.557933]  rwsem_down_read_slowpath+0x24d/0x4a0
[  143.558818]  ? xfs_efi_item_format+0x50/0xc0 [xfs]
[  143.559894]  down_read+0x4e/0xb0
[  143.560519]  xlog_cil_commit+0x1b2/0xbc0 [xfs]
[  143.561460]  ? _raw_spin_unlock+0x12/0x30
[  143.562212]  ? xfs_inode_item_precommit+0xc7/0x220 [xfs]
[  143.563309]  ? xfs_trans_run_precommits+0x69/0xd0 [xfs]
[  143.564394]  __xfs_trans_commit+0xb5/0x330 [xfs]
[  143.565367]  xfs_trans_roll+0x48/0xc0 [xfs]
[  143.566262]  xfs_defer_trans_roll+0x57/0x100 [xfs]
[  143.567278]  xfs_defer_finish_noroll+0x27a/0x490 [xfs]
[  143.568342]  xfs_defer_finish+0x1a/0x80 [xfs]
[  143.569267]  xfs_bunmapi_range+0x4d/0xb0 [xfs]
[  143.570208]  xfs_itruncate_extents_flags+0x13d/0x230 [xfs]
[  143.571353]  xfs_free_eofblocks+0x12e/0x190 [xfs]
[  143.572359]  xfs_file_release+0x12d/0x140 [xfs]
[  143.573324]  __fput+0xe8/0x2d0
[  143.573922]  __fput_sync+0x1d/0x30
[  143.574574]  nfsd_filp_close+0x33/0x60 [nfsd]
[  143.575430]  nfsd_file_free+0x96/0x150 [nfsd]
[  143.576274]  nfsd_file_put+0xf7/0x1a0 [nfsd]
[  143.577104]  nfsd_file_put_local+0x18/0x30 [nfsd]
[  143.578070]  nfs_close_local_fh+0x101/0x110 [nfs_localio]
[  143.579079]  __put_nfs_open_context+0xc9/0x180 [nfs]
[  143.580031]  nfs_file_clear_open_context+0x4a/0x60 [nfs]
[  143.581038]  nfs_file_release+0x3e/0x60 [nfs]
[  143.581879]  __fput+0xe8/0x2d0
[  143.582464]  __fput_sync+0x1d/0x30
[  143.583108]  __x64_sys_close+0x41/0x80
[  143.583823]  x64_sys_call+0x189a/0x20d0
[  143.584552]  do_syscall_64+0x64/0x170
[  143.585240]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  143.586185] RIP: 0033:0x7f3c5153efd7

Fixes: 65f2a5c36635 ("nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:12 -05:00
Al Viro
07442ec85b nfsd: get rid of include ../internal.h
added back in 2015 for the sake of vfs_clone_file_range(),
which is in linux/fs.h these days

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:12 -05:00
Yang Erkun
98100e88dd nfsd: fix nfs4_openowner leak when concurrent nfsd4_open occur
The action force umount(umount -f) will attempt to kill all rpc_task even
umount operation may ultimately fail if some files remain open.
Consequently, if an action attempts to open a file, it can potentially
send two rpc_task to nfs server.

                   NFS CLIENT
thread1                             thread2
open("file")
...
nfs4_do_open
 _nfs4_do_open
  _nfs4_open_and_get_state
   _nfs4_proc_open
    nfs4_run_open_task
     /* rpc_task1 */
     rpc_run_task
     rpc_wait_for_completion_task

                                    umount -f
                                    nfs_umount_begin
                                     rpc_killall_tasks
                                      rpc_signal_task
     rpc_task1 been wakeup
     and return -512
 _nfs4_do_open // while loop
    ...
    nfs4_run_open_task
     /* rpc_task2 */
     rpc_run_task
     rpc_wait_for_completion_task

While processing an open request, nfsd will first attempt to find or
allocate an nfs4_openowner. If it finds an nfs4_openowner that is not
marked as NFS4_OO_CONFIRMED, this nfs4_openowner will released. Since
two rpc_task can attempt to open the same file simultaneously from the
client to server, and because two instances of nfsd can run
concurrently, this situation can lead to lots of memory leak.
Additionally, when we echo 0 to /proc/fs/nfsd/threads, warning will be
triggered.

                    NFS SERVER
nfsd1                  nfsd2       echo 0 > /proc/fs/nfsd/threads

nfsd4_open
 nfsd4_process_open1
  find_or_alloc_open_stateowner
   // alloc oo1, stateid1
                       nfsd4_open
                        nfsd4_process_open1
                        find_or_alloc_open_stateowner
                        // find oo1, without NFS4_OO_CONFIRMED
                         release_openowner
                          unhash_openowner_locked
                          list_del_init(&oo->oo_perclient)
                          // cannot find this oo
                          // from client, LEAK!!!
                         alloc_stateowner // alloc oo2

 nfsd4_process_open2
  init_open_stateid
  // associate oo1
  // with stateid1, stateid1 LEAK!!!
  nfs4_get_vfs_file
  // alloc nfsd_file1 and nfsd_file_mark1
  // all LEAK!!!

                         nfsd4_process_open2
                         ...

                                    write_threads
                                     ...
                                     nfsd_destroy_serv
                                      nfsd_shutdown_net
                                       nfs4_state_shutdown_net
                                        nfs4_state_destroy_net
                                         destroy_client
                                          __destroy_client
                                          // won't find oo1!!!
                                     nfsd_shutdown_generic
                                      nfsd_file_cache_shutdown
                                       kmem_cache_destroy
                                       for nfsd_file_slab
                                       and nfsd_file_mark_slab
                                       // bark since nfsd_file1
                                       // and nfsd_file_mark1
                                       // still alive

=======================================================================
BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on
__kmem_cache_shutdown()
-----------------------------------------------------------------------

Slab 0xffd4000004438a80 objects=34 used=1 fp=0xff11000110e2ad28
flags=0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff)
CPU: 4 UID: 0 PID: 757 Comm: sh Not tainted 6.12.0-rc6+ #19
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.16.1-2.fc37 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xac/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1ae/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Disabling lock debugging due to kernel taint
Object 0xff11000110e2ac38 @offset=3128
Allocated in nfsd_file_do_acquire+0x20f/0xa30 [nfsd] age=1635 cpu=3
pid=800
 nfsd_file_do_acquire+0x20f/0xa30 [nfsd]
 nfsd_file_acquire_opened+0x5f/0x90 [nfsd]
 nfs4_get_vfs_file+0x4c9/0x570 [nfsd]
 nfsd4_process_open2+0x713/0x1070 [nfsd]
 nfsd4_open+0x74b/0x8b0 [nfsd]
 nfsd4_proc_compound+0x70b/0xc20 [nfsd]
 nfsd_dispatch+0x1b4/0x3a0 [nfsd]
 svc_process_common+0x5b8/0xc50 [sunrpc]
 svc_process+0x2ab/0x3b0 [sunrpc]
 svc_handle_xprt+0x681/0xa20 [sunrpc]
 nfsd+0x183/0x220 [nfsd]
 kthread+0x199/0x1e0
 ret_from_fork+0x31/0x60
 ret_from_fork_asm+0x1a/0x30

Add nfs4_openowner_unhashed to help found unhashed nfs4_openowner, and
break nfsd4_open process to fix this problem.

Cc: stable@vger.kernel.org # v5.4+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:12 -05:00
Chuck Lever
aa0ebd21df NFSD: Add nfsd4_copy time-to-live
Keep async copy state alive for a few lease cycles after the copy
completes so that OFFLOAD_STATUS returns something meaningful.

This means that NFSD's client shutdown processing needs to purge
any of this state that happens to be waiting to die.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:11 -05:00
Chuck Lever
ac0514f4d1 NFSD: Add a laundromat reaper for async copy state
RFC 7862 Section 4.8 states:

> A copy offload stateid will be valid until either (A) the client
> or server restarts or (B) the client returns the resource by
> issuing an OFFLOAD_CANCEL operation or the client replies to a
> CB_OFFLOAD operation.

Instead of releasing async copy state when the CB_OFFLOAD callback
completes, now let it live until the next laundromat run after the
callback completes.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:11 -05:00
Chuck Lever
b44ffa4c4f NFSD: Block DESTROY_CLIENTID only when there are ongoing async COPY operations
Currently __destroy_client() consults the nfs4_client's async_copies
list to determine whether there are ongoing async COPY operations.
However, NFSD now keeps copy state in that list even when the
async copy has completed, to enable OFFLOAD_STATUS to find the
COPY results for a while after the COPY has completed.

DESTROY_CLIENTID should not be blocked if the client's async_copies
list contains state for only completed copy operations.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
5c41f32147 NFSD: Handle an NFS4ERR_DELAY response to CB_OFFLOAD
RFC 7862 permits callback services to respond to CB_OFFLOAD with
NFS4ERR_DELAY. Currently NFSD drops the CB_OFFLOAD in that case.

To improve the reliability of COPY offload, NFSD should rather send
another CB_OFFLOAD completion notification.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00
Chuck Lever
409d6f52bd NFSD: Free async copy information in nfsd4_cb_offload_release()
RFC 7862 Section 4.8 states:

> A copy offload stateid will be valid until either (A) the client
> or server restarts or (B) the client returns the resource by
> issuing an OFFLOAD_CANCEL operation or the client replies to a
> CB_OFFLOAD operation.

Currently, NFSD purges the metadata for an async COPY operation as
soon as the CB_OFFLOAD callback has been sent. It does not wait even
for the client's CB_OFFLOAD response, as the paragraph above
suggests that it should.

This makes the OFFLOAD_STATUS operation ineffective during the
window between the completion of an asynchronous COPY and the
server's receipt of the corresponding CB_OFFLOAD response. This is
important if, for example, the client responds with NFS4ERR_DELAY,
or the transport is lost before the server receives the response. A
client might use OFFLOAD_STATUS to query the server about the still
pending asynchronous COPY, but NFSD will respond to OFFLOAD_STATUS
as if it had never heard of the presented copy stateid.

This patch starts to address this issue by extending the lifetime of
struct nfsd4_copy at least until the server has seen the client's
CB_OFFLOAD response, or the CB_OFFLOAD has timed out.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18 20:23:10 -05:00