Commit Graph

14373 Commits

Author SHA1 Message Date
Selvin Xavier
c6b2b5c86d RDMA/bnxt_re: Fix the compatibility flag for variable size WQE
For older adapters that doesn't support variable size WQE,
driver is wrongly reporting that variable WQE is supported,
when the latest library is used.

Report the variable WQE capability only if the driver supports
it.

Fixes: 10a104c0de ("RDMA/bnxt_re: Enable variable size WQEs for user space applications")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725444253-13221-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Michael Guralnik
7ebb00cea4 RDMA/mlx5: Fix MR cache temp entries cleanup
Fix the cleanup of the temp cache entries that are dynamically created
in the MR cache.

The cleanup of the temp cache entries is currently scheduled only when a
new entry is created. Since in the cleanup of the entries only the mkeys
are destroyed and the cache entry stays in the cache, subsequent
registrations might reuse the entry and it will eventually be filled with
new mkeys without cleanup ever getting scheduled again.

On workloads that register and deregister MRs with a wide range of
properties we see the cache ends up holding many cache entries, each
holding the max number of mkeys that were ever used through it.

Additionally, as the cleanup work is scheduled to run over the whole
cache, any mkey that is returned to the cache after the cleanup was
scheduled will be held for less than the intended 30 seconds timeout.

Solve both issues by dropping the existing remove_ent_work and reusing
the existing per-entry work to also handle the temp entries cleanup.

Schedule the work to run with a 30 seconds delay every time we push an
mkey to a clean temp entry.
This ensures the cleanup runs on each entry only 30 seconds after the
first mkey was pushed to an empty entry.

As we have already been distinguishing between persistent and temp entries
when scheduling the cache_work_func, it is not being scheduled in any
other flows for the temp entries.

Another benefit from moving to a per-entry cleanup is we now not
required to hold the rb_tree mutex, thus enabling other flow to run
concurrently.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Michael Guralnik
ee6d57a2e1 RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache
When searching the MR cache for suitable cache entries, don't use mkeys
larger than twice the size required for the MR.
This should ensure the usage of mkeys closer to the minimal required size
and reduce memory waste.

On driver init we create entries for mkeys with clear attributes and
powers of 2 sizes from 4 to the max supported size.
This solves the issue for anyone using mkeys that fit these
requirements.

In the use case where an MR is registered with different attributes,
like an access flag we can't UMR, we'll create a new cache entry to store
it upon dereg.
Without this fix, any later registration with same attributes and smaller
size will use the newly created cache entry and it's mkeys, disregarding
the memory waste of using mkeys larger than required.

For example, one worst-case scenario can be when registering and
deregistering a 1GB mkey with ATS enabled which will cause the creation of
a new cache entry to hold those type of mkeys. A user registering a 4k MR
with ATS will end up using the new cache entry and an mkey that can
support a 1GB MR, thus wasting x250k memory than actually needed in the HW.

Additionally, allow all small registration to use the smallest size
cache entry that is initialized on driver load even if size is larger
than twice the required size.

Fixes: 73d09b2fe8 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Michael Guralnik
6f5cd6ac9a RDMA/mlx5: Fix counter update on MR cache mkey creation
After an mkey is created, update the counter for pending mkeys before
reshceduling the work that is filling the cache.

Rescheduling the work with a full MR cache entry and a wrong 'pending'
counter will cause us to miss disabling the fill_to_high_water flag.
Thus leaving the cache full but with an indication that it's still
needs to be filled up to it's full size (2 * limit).
Next time an mkey will be taken from the cache, we'll unnecessarily
continue the process of filling the cache to it's full size.

Fixes: 57e7071683 ("RDMA/mlx5: Implement mkeys management via LIFO queue")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/0f44f462ba22e45f72cb3d0ec6a748634086b8d0.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Michael Guralnik
30e6bd8d3b RDMA/mlx5: Drop redundant work canceling from clean_keys()
The canceling of dealyed work in clean_keys() is a leftover from years
back and was added to prevent races in the cleanup process of MR cache.
The cleanup process was rewritten a few years ago and the canceling of
delayed work and flushing of workqueue was added before the call to
clean_keys().

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/943d21f5a9dba7b98a3e1d531e3561ffe9745d71.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Cheng Xu
e77127ff64 RDMA/erdma: Return QP state in erdma_query_qp
Fix qp_state and cur_qp_state to return correct values in
struct ib_qp_attr.

Fixes: 1550557717 ("RDMA/erdma: Add verbs implementation")
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://patch.msgid.link/20240902112920.58749-4-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Cheng Xu
b80330f105 RDMA/erdma: Add disassociate ucontext support
All IO pages mapped to user space are handled by rdma_user_mmap_io,
so add empty stub for disassociate ucontext.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://patch.msgid.link/20240902112920.58749-3-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:09 +03:00
Cheng Xu
b24506f1c3 RDMA/erdma: Refactor the initialization and destruction of EQ
We extracted the common parts of the initialization/destruction
process to make the code cleaner.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://patch.msgid.link/20240902112920.58749-2-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:17:08 +03:00
Maher Sanalla
34efda1735 RDMA/mlx5: Enable ATS when allocating kernel MRs
When creating kernel MRs, it is not definitive whether they will be used
for peer-to-peer transactions or for other usecases, since address
mapping is performed only after the MR is created.

Since peer-to-peer transactions benefit significantly from ATS
performance-wise, enable ATS on newly-allocated kernel MRs when
supported.

Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Gal Shalom <galshalom@nvidia.com>
Link: https://patch.msgid.link/fafd4c9f14cf438d2882d88649c2947e1d05d0b4.1725273403.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-09 21:16:40 +03:00
Patrisious Haddad
1403c8b147 IB/core: Fix ib_cache_setup_one error flow cleanup
When ib_cache_update return an error, we exit ib_cache_setup_one
instantly with no proper cleanup, even though before this we had
already successfully done gid_table_setup_one, that results in
the kernel WARN below.

Do proper cleanup using gid_table_cleanup_one before returning
the err in order to fix the issue.

WARNING: CPU: 4 PID: 922 at drivers/infiniband/core/cache.c:806 gid_table_release_one+0x181/0x1a0
Modules linked in:
CPU: 4 UID: 0 PID: 922 Comm: c_repro Not tainted 6.11.0-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:gid_table_release_one+0x181/0x1a0
Code: 44 8b 38 75 0c e8 2f cb 34 ff 4d 8b b5 28 05 00 00 e8 23 cb 34 ff 44 89 f9 89 da 4c 89 f6 48 c7 c7 d0 58 14 83 e8 4f de 21 ff <0f> 0b 4c 8b 75 30 e9 54 ff ff ff 48 8    3 c4 10 5b 5d 41 5c 41 5d 41
RSP: 0018:ffffc90002b835b0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff811c8527
RDX: 0000000000000000 RSI: ffffffff811c8534 RDI: 0000000000000001
RBP: ffff8881011b3d00 R08: ffff88810b3abe00 R09: 205d303839303631
R10: 666572207972746e R11: 72746e6520444947 R12: 0000000000000001
R13: ffff888106390000 R14: ffff8881011f2110 R15: 0000000000000001
FS:  00007fecc3b70800(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000340 CR3: 000000010435a001 CR4: 00000000003706b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? show_regs+0x94/0xa0
 ? __warn+0x9e/0x1c0
 ? gid_table_release_one+0x181/0x1a0
 ? report_bug+0x1f9/0x340
 ? gid_table_release_one+0x181/0x1a0
 ? handle_bug+0xa2/0x110
 ? exc_invalid_op+0x31/0xa0
 ? asm_exc_invalid_op+0x16/0x20
 ? __warn_printk+0xc7/0x180
 ? __warn_printk+0xd4/0x180
 ? gid_table_release_one+0x181/0x1a0
 ib_device_release+0x71/0xe0
 ? __pfx_ib_device_release+0x10/0x10
 device_release+0x44/0xd0
 kobject_put+0x135/0x3d0
 put_device+0x20/0x30
 rxe_net_add+0x7d/0xa0
 rxe_newlink+0xd7/0x190
 nldev_newlink+0x1b0/0x2a0
 ? __pfx_nldev_newlink+0x10/0x10
 rdma_nl_rcv_msg+0x1ad/0x2e0
 rdma_nl_rcv_skb.constprop.0+0x176/0x210
 netlink_unicast+0x2de/0x400
 netlink_sendmsg+0x306/0x660
 __sock_sendmsg+0x110/0x120
 ____sys_sendmsg+0x30e/0x390
 ___sys_sendmsg+0x9b/0xf0
 ? kstrtouint+0x6e/0xa0
 ? kstrtouint_from_user+0x7c/0xb0
 ? get_pid_task+0xb0/0xd0
 ? proc_fail_nth_write+0x5b/0x140
 ? __fget_light+0x9a/0x200
 ? preempt_count_add+0x47/0xa0
 __sys_sendmsg+0x61/0xd0
 do_syscall_64+0x50/0x110
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 1901b91f99 ("IB/core: Fix potential NULL pointer dereference in pkey cache")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/79137687d829899b0b1c9835fcb4b258004c439a.1725273354.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-04 10:33:09 +03:00
Chris Mi
112e6e83a8 IB/mlx5: Fix UMR pd cleanup on error flow of driver init
The cited commit moves the pd allocation from function
mlx5r_umr_resource_cleanup() to a new function mlx5r_umr_cleanup().
So the fix in commit [1] is broken. In error flow, will hit panic [2].

Fix it by checking pd pointer to avoid panic if it is NULL;

[1] RDMA/mlx5: Fix UMR cleanup on error flow of driver init
[2]
 [  347.567063] infiniband mlx5_0: Couldn't register device with driver model
 [  347.591382] BUG: kernel NULL pointer dereference, address: 0000000000000020
 [  347.593438] #PF: supervisor read access in kernel mode
 [  347.595176] #PF: error_code(0x0000) - not-present page
 [  347.596962] PGD 0 P4D 0
 [  347.601361] RIP: 0010:ib_dealloc_pd_user+0x12/0xc0 [ib_core]
 [  347.604171] RSP: 0018:ffff888106293b10 EFLAGS: 00010282
 [  347.604834] RAX: 0000000000000000 RBX: 000000000000000e RCX: 0000000000000000
 [  347.605672] RDX: ffff888106293ad0 RSI: 0000000000000000 RDI: 0000000000000000
 [  347.606529] RBP: 0000000000000000 R08: ffff888106293ae0 R09: ffff888106293ae0
 [  347.607379] R10: 0000000000000a06 R11: 0000000000000000 R12: 0000000000000000
 [  347.608224] R13: ffffffffa0704dc0 R14: 0000000000000001 R15: 0000000000000001
 [  347.609067] FS:  00007fdc720cd9c0(0000) GS:ffff88852c880000(0000) knlGS:0000000000000000
 [  347.610094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [  347.610727] CR2: 0000000000000020 CR3: 0000000103012003 CR4: 0000000000370eb0
 [  347.611421] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [  347.612113] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [  347.612804] Call Trace:
 [  347.613130]  <TASK>
 [  347.613417]  ? __die+0x20/0x60
 [  347.613793]  ? page_fault_oops+0x150/0x3e0
 [  347.614243]  ? free_msg+0x68/0x80 [mlx5_core]
 [  347.614840]  ? cmd_exec+0x48f/0x11d0 [mlx5_core]
 [  347.615359]  ? exc_page_fault+0x74/0x130
 [  347.615808]  ? asm_exc_page_fault+0x22/0x30
 [  347.616273]  ? ib_dealloc_pd_user+0x12/0xc0 [ib_core]
 [  347.616801]  mlx5r_umr_cleanup+0x23/0x90 [mlx5_ib]
 [  347.617365]  mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x36/0x40 [mlx5_ib]
 [  347.618025]  __mlx5_ib_add+0x96/0xd0 [mlx5_ib]
 [  347.618539]  mlx5r_probe+0xe9/0x310 [mlx5_ib]
 [  347.619032]  ? kernfs_add_one+0x107/0x150
 [  347.619478]  ? __mlx5_ib_add+0xd0/0xd0 [mlx5_ib]
 [  347.619984]  auxiliary_bus_probe+0x3e/0x90
 [  347.620448]  really_probe+0xc5/0x3a0
 [  347.620857]  __driver_probe_device+0x80/0x160
 [  347.621325]  driver_probe_device+0x1e/0x90
 [  347.621770]  __driver_attach+0xec/0x1c0
 [  347.622213]  ? __device_attach_driver+0x100/0x100
 [  347.622724]  bus_for_each_dev+0x71/0xc0
 [  347.623151]  bus_add_driver+0xed/0x240
 [  347.623570]  driver_register+0x58/0x100
 [  347.623998]  __auxiliary_driver_register+0x6a/0xc0
 [  347.624499]  ? driver_register+0xae/0x100
 [  347.624940]  ? 0xffffffffa0893000
 [  347.625329]  mlx5_ib_init+0x16a/0x1e0 [mlx5_ib]
 [  347.625845]  do_one_initcall+0x4a/0x2a0
 [  347.626273]  ? gcov_event+0x2e2/0x3a0
 [  347.626706]  do_init_module+0x8a/0x260
 [  347.627126]  init_module_from_file+0x8b/0xd0
 [  347.627596]  __x64_sys_finit_module+0x1ca/0x2f0
 [  347.628089]  do_syscall_64+0x4c/0x100

Fixes: 638420115c ("IB/mlx5: Create UMR QP just before first reg_mr occurs")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Link: https://patch.msgid.link/778c40c60287992da5d6ec92bb07b67f7bb5e6ef.1725273295.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-04 10:28:49 +03:00
Kalesh AP
dc116b7fdd RDMA/bnxt_re: Add support for MR Relaxed Ordering
Some of the adapters support Relaxed Ordering for the MRs.
Driver queries support for Memory region relax ordering  support from
firmware and  set relax ordering bit in REGISTER_MR request, if the users
request for the support. Also, this is supported only if the PCIe device
has enabled relaxed ordering attribute.

Reviewed-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reviewed-by: Vijay Kumar Mandadapu <vijaykumar.mandadapu@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725256351-12751-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 12:33:47 +03:00
Kalesh AP
f786eebbbe RDMA/bnxt_re: Avoid an extra hwrm per MR creation
Firmware now have a new mr registration command where
both MR allocation and registration can be done in a
single hwrm command. Driver has to issue this new hwrm
command whenever the support flag is set. This reduces
the number of hwrm issued per MR creation and speed up
the MR creation.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725256351-12751-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 12:33:47 +03:00
Kalesh AP
b98d969719 RDMA/bnxt_re: Rename a variable
Renaming flags to access_flags for clarity.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725256351-12751-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 12:33:47 +03:00
Kalesh AP
543b455c6e RDMA/bnxt_re: Update HW interface headers
Updating the HW structures for the pcie relax ordering support.
Newly added interface structures will be used in the
followup patch.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725256351-12751-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 12:33:47 +03:00
Long Li
4a3b99bc04 RDMA/mana_ib: use the correct page size for mapping user-mode doorbell page
When mapping doorbell page from user-mode, the driver should use the system
page size as this memory is allocated via mmap() from user-mode.

Cc: stable@vger.kernel.org
Fixes: 0266a17763 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1725030993-16213-2-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 10:32:08 +03:00
Long Li
9e517a8e9d RDMA/mana_ib: use the correct page table index based on hardware page size
MANA hardware uses 4k page size. When calculating the page table index,
it should use the hardware page size, not the system page size.

Cc: stable@vger.kernel.org
Fixes: 0266a17763 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1725030993-16213-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 10:32:08 +03:00
Chandramohan Akula
181028a0d8 RDMA/bnxt_re: Share a page to expose per SRQ info with userspace
Gen P7 adapters needs to share a toggle bits information received
in kernel driver with the user space. User space needs this
info to arm the SRQ.

User space application can get this page using the
UAPI routines. Library will mmap this page and get the
toggle bits to be used in the next ARM Doorbell.

Uses a hash list to map the SRQ structure from the SRQ ID.
SRQ structure is retrieved from the hash list while the
library calls the UAPI routine to get the toggle page
mapping. Currently the full page is mapped per SRQ. This
can be optimized to enable multiple SRQs from the same
application share the same page and different offsets
in the page

Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1724945645-14989-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 10:10:36 +03:00
Chandramohan Akula
b4207630e0 RDMA/bnxt_re: Refactor the BNXT_RE_METHOD_GET_TOGGLE_MEM method
Refactor the code in this function to have common code.
This is used in subsequent patches.

Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1724945645-14989-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 10:10:36 +03:00
Hongguang Gao
640c2cf84e RDMA/bnxt_re: Get the toggle bits from SRQ events
SRQ arming requires the toggle bits received from hardware.
Get the toggle bits from SRQ notification for the
gen p7 adapters. This value will be zero for the older adapters.

Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1724945645-14989-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 10:10:36 +03:00
Shen Lichuan
e012316d83 RDMA/rdmavt: Convert to use ERR_CAST()
As opposed to open-code, using the ERR_CAST macro clearly indicates that
this is a pointer to an error value and a type conversion was performed.

Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Link: https://patch.msgid.link/20240828082720.33231-1-shenlichuan@vivo.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-09-02 10:09:42 +03:00
Yue Haibing
2d10b05bce RDMA/cxgb4: Remove unused declarations
Since commit be4c9bad9d ("MAINTAINERS: Add cxgb4 and iw_cxgb4 entries")
c4iw_post_terminate() declaration is not used anymore.

And other declarations were never implemented since introduction in
commit cfdda9d764 ("RDMA/cxgb4: Add driver for Chelsio T4 RNIC").

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20240824091629.3659565-1-yuehaibing@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Jack Wang
e5bba9e027 RDMA/rtrs-clt: Remove an extra space
No functional changes.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Alexei Pastuchov <alexei.pastuchov@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-12-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Jack Wang
bab9f8db42 RDMA/rtrs-clt: Do local invalidate after write io completion
Switch local invalidate after write io completion avoid the
chain usage of WR, this fixed the local protection error on
LOCAL INVALIDATE WR.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-11-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Grzegorz Prajsner
667db86bcb RDMA/rtrs: Register ib event handler
Use ib_register_event_handler() to register event handlers for both
client and server side. For now, all those handlers do, is to print
type of incoming event.

Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-10-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Md Haris Iqbal
d0e62bf7b5 RDMA/rtrs-srv: Avoid null pointer deref during path establishment
For RTRS path establishment, RTRS client initiates and completes con_num
of connections. After establishing all its connections, the information
is exchanged between the client and server through the info_req message.
During this exchange, it is essential that all connections have been
established, and the state of the RTRS srv path is CONNECTED.

So add these sanity checks, to make sure we detect and abort process in
error scenarios to avoid null pointer deref.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-9-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Jack Wang
ff73958905 RDMA/rtrs-clt: Print request type for errors
Extend the output to print also the request type.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-8-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Md Haris Iqbal
3e4289b29e RDMA/rtrs-clt: Reset cid to con_num - 1 to stay in bounds
In the function init_conns(), after the create_con() and create_cm() for
loop if something fails. In the cleanup for loop after the destroy tag, we
access out of bound memory because cid is set to clt_path->s.con_num.

This commits resets the cid to clt_path->s.con_num - 1, to stay in bounds
in the cleanup loop later.

Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-7-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Jack Wang
6793f9581f RDMA/rtrs-clt: Reuse need_inval from mr
mr has a member need_inval, which can be used to indicate if
local invalidate is needed, switch to it and remove need_inv
from rtrs_clt_io_req.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-6-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Jack Wang
3258cbbd86 RDMA/rtrs: Reset hb_missed_cnt after receiving other traffic from peer
Reset hb_missed_cnt after receiving traffic from other peer, so
hb is more robust again high load on host or network.

Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-5-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Jack Wang
53c26f3ecd RDMA/rtrs-clt: Rate limit errors in IO path
On network errors, a large number of these logs are printed due to all the
inflight IOs, rate limit them so they do not clutter kernel log.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-4-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Jack Wang
8c8dd4e13b RDMA/rtrs-clt: Fix need_inv setting in error case
In some cases need_inv can be missed for write requests, additionally
driver has to handle missing invalidates for write requests. While at
it, remove the else case from write invalidate path as it is possible
to reach there.

Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-3-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:35 +03:00
Md Haris Iqbal
4842cfb07a RDMA/rtrs: For HB error add additional clt/srv specific logging
In case of HB error, we need to know the specific path on which it
happened, for better debugging. Since the clt/srv path structures are not
available in rtrs.c, it needs to be done in the individual HB error
handler.

This commit add those loging. A sample kernel log output after this commit:

rtrs_core L357: <blya>: HB missed max reached.
rtrs_server L717: <blya>: HB err handler for path=ip:x.x.x.x@ip:x.x.x.x
.
.
rtrs_core L357: <blya>: HB missed max reached.
rtrs_client L1519: <blya>: HB err handler for path=ip:x.x.x.x@ip:x.x.x.x

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-2-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-08-28 15:24:34 +03:00
Jason Gunthorpe
34cd192881 Merge branch 'bnxt_re_variable_wqes' into rdma.git for-next
Selvin Xavier says:

=============
Enable the Variable size Work Queue entry support for Gen P7
adapters. This would help in the better utilization of the queue memory
and pci bandwidth due to the smaller send queue Work entries.
=============

Based on v6.11-rc5 for dependencies.

* bnxt_re_variable_wqes: (829 commits)
  RDMA/bnxt_re: Enable variable size WQEs for user space applications
  RDMA/bnxt_re: Handle variable WQE support for user applications
  RDMA/bnxt_re: Fix the table size for PSN/MSN entries
  RDMA/bnxt_re: Get the WQE index from slot index while completing the WQEs
  RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters
  Linux 6.11-rc5
  ...

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-28 09:11:29 -03:00
Selvin Xavier
10a104c0de RDMA/bnxt_re: Enable variable size WQEs for user space applications
Add backward compatibility code to enable variable size WQEs only if the
user lib supports it.

Link: https://patch.msgid.link/r/1724042847-1481-6-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-27 10:15:57 -03:00
Selvin Xavier
d8ea645d69 RDMA/bnxt_re: Handle variable WQE support for user applications
User library calculates the number of slots required for user applications
and it can pass that information to the driver.  Driver can use this value
and update the HW directly. This mechanism is currently used only for the
newly introduced variable size WQEs.

Extend the bnxt_re_qp_req structure to pass the Send Queue slot count.
Reorganize the code to get the sq_slots before initializing the Send Queue
attributes.

Link: https://patch.msgid.link/r/1724042847-1481-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-27 10:15:57 -03:00
Selvin Xavier
b930d0bac9 RDMA/bnxt_re: Fix the table size for PSN/MSN entries
HW MSN table size is always a power of 2. So the pages should be mapped
accordingly.

Use the power of two calculation while get the number of PSN/MSN entries.

Fixes: 6f6bfbc595 ("RDMA/bnxt_re: Expose the MSN table capability for user library")
Link: https://patch.msgid.link/r/1724042847-1481-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-27 10:15:57 -03:00
Selvin Xavier
51edebb734 RDMA/bnxt_re: Get the WQE index from slot index while completing the WQEs
While reporting the completions, SQ Work Queue index is required to
identify the WQE that generated the completions. In variable WQE mode, FW
returns the slot index for Error completions. Driver need to walk through
the shadow queue between the consumer index and producer index and matches
the slot index returned by FW. If a match is found, the next index of the
shadow queue is the WQE index to be considered for remaining poll_cq loop.

Link: https://patch.msgid.link/r/1724042847-1481-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-27 10:15:57 -03:00
Selvin Xavier
de1d364c38 RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters
Variable size WQE means that each send Work Queue Entry to HW can use
different WQE sizes as opposed to the static WQE size on the current
devices. Set variable WQE mode for Gen P7 devices. Depth of the Queue will
be a multiple of slot which is 16 bytes. The number of slots should be a
multiple of 256 as per the HW requirement.

Initialize the Software shadow queue to hold requests equal to the number
of slots. Also, do not expose the variable size WQE capability until the
last patch in the series.

Link: https://patch.msgid.link/r/1724042847-1481-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-27 10:15:57 -03:00
Yehuda Yitschak
04e36fd27a RDMA/efa: Add support for node guid
Propagate the unique, per device, ID in the device attributes to the
standard node_guid value in IB device.

Link: https://patch.msgid.link/r/20240822171143.2800-1-mrgolin@amazon.com
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Yehuda Yitschak <yehuday@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:52:45 -03:00
Zhu Yanjun
86dfdd8288 RDMA/iwcm: Fix WARNING:at_kernel/workqueue.c:#check_flush_dependency
In the commit aee2424246 ("RDMA/iwcm: Fix a use-after-free related to
destroying CM IDs"), the function flush_workqueue is invoked to flush the
work queue iwcm_wq.

But at that time, the work queue iwcm_wq was created via the function
alloc_ordered_workqueue without the flag WQ_MEM_RECLAIM.

Because the current process is trying to flush the whole iwcm_wq, if
iwcm_wq doesn't have the flag WQ_MEM_RECLAIM, verify that the current
process is not reclaiming memory or running on a workqueue which doesn't
have the flag WQ_MEM_RECLAIM as that can break forward-progress guarantee
leading to a deadlock.

The call trace is as below:

[  125.350876][ T1430] Call Trace:
[  125.356281][ T1430]  <TASK>
[ 125.361285][ T1430] ? __warn (kernel/panic.c:693)
[ 125.367640][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
[ 125.375689][ T1430] ? report_bug (lib/bug.c:180 lib/bug.c:219)
[ 125.382505][ T1430] ? handle_bug (arch/x86/kernel/traps.c:239)
[ 125.388987][ T1430] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1))
[ 125.395831][ T1430] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621)
[ 125.403125][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
[ 125.410984][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
[ 125.418764][ T1430] __flush_workqueue (kernel/workqueue.c:3970)
[ 125.426021][ T1430] ? __pfx___might_resched (kernel/sched/core.c:10151)
[ 125.433431][ T1430] ? destroy_cm_id (drivers/infiniband/core/iwcm.c:375) iw_cm
[ 125.441209][ T1430] ? __pfx___flush_workqueue (kernel/workqueue.c:3910)
[ 125.473900][ T1430] ? _raw_spin_lock_irqsave (arch/x86/include/asm/atomic.h:107 include/linux/atomic/atomic-arch-fallback.h:2170 include/linux/atomic/atomic-instrumented.h:1302 include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:187 include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162)
[ 125.473909][ T1430] ? __pfx__raw_spin_lock_irqsave (kernel/locking/spinlock.c:161)
[ 125.482537][ T1430] _destroy_id (drivers/infiniband/core/cma.c:2044) rdma_cm
[ 125.495072][ T1430] nvme_rdma_free_queue (drivers/nvme/host/rdma.c:656 drivers/nvme/host/rdma.c:650) nvme_rdma
[ 125.505827][ T1430] nvme_rdma_reset_ctrl_work (drivers/nvme/host/rdma.c:2180) nvme_rdma
[ 125.505831][ T1430] process_one_work (kernel/workqueue.c:3231)
[ 125.515122][ T1430] worker_thread (kernel/workqueue.c:3306 kernel/workqueue.c:3393)
[ 125.515127][ T1430] ? __pfx_worker_thread (kernel/workqueue.c:3339)
[ 125.531837][ T1430] kthread (kernel/kthread.c:389)
[ 125.539864][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
[ 125.550628][ T1430] ret_from_fork (arch/x86/kernel/process.c:147)
[ 125.558840][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
[ 125.558844][ T1430] ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
[  125.566487][ T1430]  </TASK>
[  125.566488][ T1430] ---[ end trace 0000000000000000 ]---

Fixes: aee2424246 ("RDMA/iwcm: Fix a use-after-free related to destroying CM IDs")
Link: https://patch.msgid.link/r/20240820113336.19860-1-yanjun.zhu@linux.dev
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202408151633.fc01893c-oliver.sang@intel.com
Tested-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:44:43 -03:00
zhenwei pi
444948ee12 RDMA/rxe: Fix __bth_set_resv6a
__bth_set_resv6a is used to clear BIT [24, 29] of rxe_bth::qpn, the
wrong expression leads other BITs into 1.

Link: https://patch.msgid.link/r/20240822065223.1117056-4-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:42:38 -03:00
zhenwei pi
938aa9a333 RDMA/rxe: Fix misspelling of 'rmda'
Fix 'rmda' into 'RDMA'.

Link: https://patch.msgid.link/r/20240822065223.1117056-3-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:42:38 -03:00
zhenwei pi
c87c5f47ff RDMA/rxe: Use sizeof instead of hard code number
Use 'sizeof(union rdma_network_hdr)' instead of hard code GRH length
for GSI and UD.

Link: https://patch.msgid.link/r/20240822065223.1117056-2-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:42:38 -03:00
Jinjie Ruan
ae46d3fc17 RDMA/mlx4: Simplify an alloc_ordered_workqueue() invocation
Let alloc_ordered_workqueue() format the workqueue name instead of calling
snprintf() explicitly.

Link: https://patch.msgid.link/r/20240823101840.515398-5-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:37:49 -03:00
Jinjie Ruan
87a55daa67 RDMA/mlx4: Simplify an alloc_ordered_workqueue() invocation
Let alloc_ordered_workqueue() format the workqueue name instead of calling
snprintf() explicitly.

Link: https://patch.msgid.link/r/20240823101840.515398-4-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:37:49 -03:00
Jinjie Ruan
7229d7b64e RDMA/mad: Simplify an alloc_ordered_workqueue() invocation
Let alloc_ordered_workqueue() format the workqueue name instead of calling
snprintf() explicitly.

Link: https://patch.msgid.link/r/20240823101840.515398-3-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:37:49 -03:00
Jinjie Ruan
92c7ad8364 RDMA/qib: Simplify an alloc_ordered_workqueue() invocation
Let alloc_ordered_workqueue() format the workqueue name instead of
calling snprintf() explicitly.

Link: https://patch.msgid.link/r/20240823101840.515398-2-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-23 11:37:49 -03:00
Zhang Zekun
e2e641fe1c RDMA/ipoib: Remove unused declarations
There are some declarations without function definition, which are listed
as below:

1. ipoib_ib_tx_timer_func() has also been removed since
   commit 8966e28d2e ("IB/ipoib: Use NAPI in UD/TX flows")

2. ipoib_pkey_event() has been removed since
   commit ee1e2c82c2 ("IPoIB: Refresh paths instead of flushing them on
   SM change events")

3. ipoib_mcast_dev_down() has been removed since
   commit 988bd50300 ("IPoIB: Fix memory leak of multicast group
   structures")

4. ipoib_pkey_open() has been removed since
   commit dd57c9308a ("IB/ipoib: Avoid multicast join attempts with
   invalid P_key")

Remove these unused declarations.

Link: https://patch.msgid.link/r/20240818055702.79547-3-zhangzekun11@huawei.com
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-19 15:19:52 -03:00
Zhang Zekun
1fb797af8a RDMA/core: Remove unused declaration rdma_resolve_ip_route()
The definition of rdma_resolve_ip_route() has been removed. Remove the
unused declaration.

Fixes: 6aaecd3856 ("RDMA/core: Simplify roce_resolve_route_from_path()")
Link: https://patch.msgid.link/r/20240818055702.79547-2-zhangzekun11@huawei.com
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-08-19 15:19:52 -03:00