Commit Graph

14373 Commits

Author SHA1 Message Date
Yonatan Nachum
fe0812e4bc RDMA/efa: Remove duplicate aenq enable macro
We have the same macro in main and verbs files and we don't use the macro
in the verbs file, remove it.

Link: https://lore.kernel.org/r/20240624160918.27060-3-mrgolin@amazon.com
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Reviewed-by: Gal Pressman <gal.pressman@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26 12:35:00 -03:00
Gal Pressman
47f9b4190a RDMA/efa: Use offset_in_page() function
Use offset_in_page() instead of open-coding it.

Link: https://lore.kernel.org/r/20240624160918.27060-2-mrgolin@amazon.com
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26 12:34:43 -03:00
Christophe JAILLET
5a905e33b2 RDMA/hfi1: Constify struct mmu_rb_ops
'struct mmu_rb_ops' is not modified in this driver.

Constifying this structure moves some data to a read-only section, so
increase overall security.

On a x86_64, with allmodconfig, as an example:
Before:
======
   text	   data	    bss	    dec	    hex	filename
  10879	    164	      0	  11043	   2b23	drivers/infiniband/hw/hfi1/pin_system.o

After:
=====
   text	   data	    bss	    dec	    hex	filename
  10907	    140	      0	  11047	   2b27	drivers/infiniband/hw/hfi1/pin_system.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/b826dd05eefa5f4d6a7a1b4d191eaf37c714ed04.1719259997.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26 10:53:29 -03:00
Honggang LI
4adcaf969d RDMA/rxe: Don't set BTH_ACK_MASK for UC or UD QPs
BTH_ACK_MASK bit is used to indicate that an acknowledge
(for this packet) should be scheduled by the responder.
Both UC and UD QPs are unacknowledged, so don't set
BTH_ACK_MASK for UC or UD QPs.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Honggang LI <honggangli@163.com>
Link: https://lore.kernel.org/r/20240624020348.494338-1-honggangli@163.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26 10:53:29 -03:00
Max Gurtovoy
58945ddd71 IB/isert: remove the handling of last WQE reached event
This event is raised for QPs that are associated with a Shared RQ (SRQ).
The iSER target does not support SRQ. Remove this dead code.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/20240619171153.34631-3-mgurtovoy@nvidia.com
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26 10:53:29 -03:00
Max Gurtovoy
844bc12e6d IB/core: add support for draining Shared receive queues
To avoid leakage for QPs assocoated with SRQ, according to IB spec
(section 10.3.1):

"Note, for QPs that are associated with an SRQ, the Consumer should take
the QP through the Error State before invoking a Destroy QP or a Modify
QP to the Reset State. The Consumer may invoke the Destroy QP without
first performing a Modify QP to the Error State and waiting for the Affiliated
Asynchronous Last WQE Reached Event. However, if the Consumer
does not wait for the Affiliated Asynchronous Last WQE Reached Event,
then WQE and Data Segment leakage may occur. Therefore, it is good
programming practice to teardown a QP that is associated with an SRQ
by using the following process:
 - Put the QP in the Error State;
 - wait for the Affiliated Asynchronous Last WQE Reached Event;
 - either:
   - drain the CQ by invoking the Poll CQ verb and either wait for CQ
     to be empty or the number of Poll CQ operations has exceeded
     CQ capacity size; or
   - post another WR that completes on the same CQ and wait for this
     WR to return as a WC;
 - and then invoke a Destroy QP or Reset QP."

Catch the Last WQE Reached Event in the core layer during drain QP flow.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Link: https://lore.kernel.org/r/20240619171153.34631-2-mgurtovoy@nvidia.com
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26 10:53:29 -03:00
Leon Romanovsky
5953e0647c RDMA/mlx4: Fix truncated output warning in alias_GUID.c
drivers/infiniband/hw/mlx4/alias_GUID.c: In function ‘mlx4_ib_init_alias_guid_service’:
drivers/infiniband/hw/mlx4/alias_GUID.c:878:74: error: ‘%d’ directive
output may be truncated writing between 1 and 11 bytes into a region of
size 5 [-Werror=format-truncation=]
  878 |                 snprintf(alias_wq_name, sizeof alias_wq_name, "alias_guid%d", i);
      |                                                                          ^~
drivers/infiniband/hw/mlx4/alias_GUID.c:878:63: note: directive argument in the range [-2147483641, 2147483646]
  878 |                 snprintf(alias_wq_name, sizeof alias_wq_name, "alias_guid%d", i);
      |                                                               ^~~~~~~~~~~~~~
drivers/infiniband/hw/mlx4/alias_GUID.c:878:17: note: ‘snprintf’ output
between 12 and 22 bytes into a destination of size 15
  878 |                 snprintf(alias_wq_name, sizeof alias_wq_name, "alias_guid%d", i);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

Fixes: a0c64a17ab ("mlx4: Add alias_guid mechanism")
Link: https://lore.kernel.org/r/1951c9500109ca7e36dcd523f8a5f2d0d2a608d1.1718554641.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26 10:53:29 -03:00
Leon Romanovsky
0d2e6992fc RDMA/mlx4: Fix truncated output warning in mad.c
Increase size of the name array to avoid truncated output warning.

drivers/infiniband/hw/mlx4/mad.c: In function ‘mlx4_ib_alloc_demux_ctx’:
drivers/infiniband/hw/mlx4/mad.c:2197:47: error: ‘%d’ directive output
may be truncated writing between 1 and 11 bytes into a region of size 4
[-Werror=format-truncation=]
 2197 |         snprintf(name, sizeof(name), "mlx4_ibt%d", port);
      |                                               ^~
drivers/infiniband/hw/mlx4/mad.c:2197:38: note: directive argument in
the range [-2147483645, 2147483647]
 2197 |         snprintf(name, sizeof(name), "mlx4_ibt%d", port);
      |                                      ^~~~~~~~~~~~
drivers/infiniband/hw/mlx4/mad.c:2197:9: note: ‘snprintf’ output between
10 and 20 bytes into a destination of size 12
 2197 |         snprintf(name, sizeof(name), "mlx4_ibt%d", port);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/mlx4/mad.c:2205:48: error: ‘%d’ directive output
may be truncated writing between 1 and 11 bytes into a region of size 3
[-Werror=format-truncation=]
 2205 |         snprintf(name, sizeof(name), "mlx4_ibwi%d", port);
      |                                                ^~
drivers/infiniband/hw/mlx4/mad.c:2205:38: note: directive argument in
the range [-2147483645, 2147483647]
 2205 |         snprintf(name, sizeof(name), "mlx4_ibwi%d", port);
      |                                      ^~~~~~~~~~~~~
drivers/infiniband/hw/mlx4/mad.c:2205:9: note: ‘snprintf’ output between
11 and 21 bytes into a destination of size 12
 2205 |         snprintf(name, sizeof(name), "mlx4_ibwi%d", port);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/mlx4/mad.c:2213:48: error: ‘%d’ directive output
may be truncated writing between 1 and 11 bytes into a region of size 3
[-Werror=format-truncation=]
 2213 |         snprintf(name, sizeof(name), "mlx4_ibud%d", port);
      |                                                ^~
drivers/infiniband/hw/mlx4/mad.c:2213:38: note: directive argument in
the range [-2147483645, 2147483647]
 2213 |         snprintf(name, sizeof(name), "mlx4_ibud%d", port);
      |                                      ^~~~~~~~~~~~~
drivers/infiniband/hw/mlx4/mad.c:2213:9: note: ‘snprintf’ output between
11 and 21 bytes into a destination of size 12
 2213 |         snprintf(name, sizeof(name), "mlx4_ibud%d", port);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[6]: *** [scripts/Makefile.build:244: drivers/infiniband/hw/mlx4/mad.o] Error 1

Fixes: fc06573dfa ("IB/mlx4: Initialize SR-IOV IB support for slaves in master context")
Link: https://lore.kernel.org/r/f3798b3ce9a410257d7e1ec7c9e285f1352e256a.1718554569.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-26 10:53:29 -03:00
Konstantin Taranov
82a5cc783d RDMA/mana_ib: Ignore optional access flags for MRs
Ignore optional ib_access_flags when an MR is created.

Fixes: 0266a17763 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1717575368-14879-1-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Patrisious Haddad
36ab7ada64 RDMA/mlx5: Add check for srq max_sge attribute
max_sge attribute is passed by the user, and is inserted and used
unchecked, so verify that the value doesn't exceed maximum allowed value
before using it.

Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/277ccc29e8d57bfd53ddeb2ac633f2760cf8cdd0.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Yishai Hadas
81497c148b RDMA/mlx5: Fix unwind flow as part of mlx5_ib_stage_init_init
Fix unwind flow as part of mlx5_ib_stage_init_init to use the correct
goto upon an error.

Fixes: 758ce14aee ("RDMA/mlx5: Implement MACsec gid addition and deletion")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/aa40615116eda14ec9eca21d52017d632ea89188.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Jason Gunthorpe
2e4c02fdec RDMA/mlx5: Ensure created mkeys always have a populated rb_key
cachable and mmkey.rb_key together are used by mlx5_revoke_mr() to put the
MR/mkey back into the cache. In all cases they should be set correctly.

alloc_cacheable_mr() was setting cachable but not filling rb_key,
resulting in cache_ent_find_and_store() bucketing them all into a 0 length
entry.

implicit_get_child_mr()/mlx5_ib_alloc_implicit_mr() failed to set cachable
or rb_key at all, so the cache was not working at all for implicit ODP.

Cc: stable@vger.kernel.org
Fixes: 8c1185fef6 ("RDMA/mlx5: Change check for cacheable mkeys")
Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7778c02dfa0999a30d6746c79a23dd7140a9c729.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Jason Gunthorpe
f637040c33 RDMA/mlx5: Follow rb_key.ats when creating new mkeys
When a cache ent already exists but doesn't have any mkeys in it the cache
will automatically create a new one based on the specification in the
ent->rb_key.

ent->ats was missed when creating the new key and so ma_translation_mode
was not being set even though the ent requires it.

Cc: stable@vger.kernel.org
Fixes: 73d09b2fe8 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/7c5613458ecb89fbe5606b7aa4c8d990bdea5b9a.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Jason Gunthorpe
c1eb251259 RDMA/mlx5: Remove extra unlock on error path
The below commit lifted the locking out of this function but left this
error path unlock behind resulting in unbalanced locking. Remove the
missed unlock too.

Cc: stable@vger.kernel.org
Fixes: 627122280c ("RDMA/mlx5: Add work to remove temporary entries from the cache")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/78090c210c750f47219b95248f9f782f34548bb1.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:19:36 -03:00
Leon Romanovsky
a92fbeac7e RDMA/cache: Release GID table even if leak is detected
When the table is released, we nullify pointer to GID table, it means
that in case GID entry leak is detected, we will leak table too.

Delete code that prevents table destruction.

Fixes: b150c3862d ("IB/core: Introduce GID entry reference counts")
Link: https://lore.kernel.org/r/a62560af06ba82c88ef9194982bfa63d14768ff9.1716900410.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-21 10:17:44 -03:00
Leon Romanovsky
b7161db2d9 Merge branch 'mlx5-next' into wip/leon-for-next
The req_transport_retries_exceeded counter shows the number of times
requester detected transport retries exceed error.

The req_rnr_retries_exceeded counter show the number of times the
requester detected RNR NAKs retries exceed error.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:54:04 +03:00
Patrisious Haddad
b339e0a39d RDMA/mlx5: Add Qcounters req_transport_retries_exceeded/req_rnr_retries_exceeded
The req_transport_retries_exceeded counter shows the number of times
requester detected transport retries exceed error.

The req_rnr_retries_exceeded counter show the number of times the
requester detected RNR NAKs retries exceed error.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/250466af94f4989d638fab168e246035530e912f.1718301543.git.leon@kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:53:23 +03:00
Chiara Meiohas
a4e540119b RDMA/mlx5: Set mkeys for dmabuf at PAGE_SIZE
Set the mkey for dmabuf at PAGE_SIZE to support any SGL
after a move operation.

ib_umem_find_best_pgsz returns 0 on error, so it is
incorrect to check the returned page_size against PAGE_SIZE

Fixes: 90da7dc820 ("RDMA/mlx5: Support dma-buf based userspace memory region")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/1e2289b9133e89f273a4e68d459057d032cbc2ce.1718301631.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:39:29 +03:00
Jianbo Liu
5895e70f2e IB/mlx5: Allocate resources just before first QP/SRQ is created
Previously, all IB dev resources are initialized on driver load. As
they are not always used, move the initialization to the time when
they are needed.

To be more specific, move PD (p0) and CQ (c0) initialization to the
time when the first SRQ is created. and move SRQs(s0 and s1)
initialization to the time first QP is created. To avoid concurrent
creations, two new mutexes are also added.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Link: https://lore.kernel.org/r/98c3e53a8cc0bdfeb6dec6e5bb8b037d78ab00d8.1717409369.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:37:57 +03:00
Jianbo Liu
638420115c IB/mlx5: Create UMR QP just before first reg_mr occurs
UMR QP is not used in some cases, so move QP and its CQ creations from
driver load flow to the time first reg_mr occurs, that is when MR
interfaces are first called.

The initialization of dev->umrc.pd and dev->umrc.lock is still done in
driver load because pd is needed for mlx5_mkey_cache_init and the lock
is reused to protect against the concurrent creation.

When testing 4G bytes memory registration latency with rtool [1] and 8
threads in parallel, there is minor performance degradation (<5% for
the max latency) is seen for the first reg_mr with this change.

Link: https://github.com/paravmellanox/rtool [1]

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Link: https://lore.kernel.org/r/55d3c4f8a542fd974d8a4c5816eccfb318a59b38.1717409369.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:37:50 +03:00
Leon Romanovsky
ae6f6dd5fd Delay mlx5_ib internal resources allocations
From: Leon Romanovsky <leonro@nvidia.com>

Internal mlx5_ib resources are created during mlx5_ib module load. This
behavior is not optimal because it consumes resources that are not
needed when SFs are created. This patch series delays the creation of
mlx5_ib internal resources to the stage when they actually used.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:35:47 +03:00
Jianbo Liu
d98995b4bf net/mlx5: Reimplement write combining test
The test of write combining was added before in mlx5_ib driver. It
opens UD QP and posts NOP WQEs, and uses BlueFlame doorbell. When
BlueFlame is used, WQEs get written directly to a PCI BAR of the
device (in addition to memory) so that the device handles them without
having to access memory.

In this test, the WQEs written in memory are different from the ones
written to the BlueFlame which request CQE update. By checking the
completion reports posted on CQ, we can know if BlueFlame succeeds or
not. The write combining must be supported if BlueFlame succeeds as
its register is written using write combining.

This patch reimplements the test in the same way, but using a pair of
SQ and CQ only. It is moved to mlx5_core as a general feature used by
both mlx5_core and mlx5_ib.

Besides, save write combine test result of the PCI function, so that
its thousands of child functions such as SF can query without paying
the time and resource penalty by itself. The test function is called
only after failing to get the cached result. With this enhancement,
all thousands of SFs of the PF attached to same driver no longer need
to perform WC check explicitly, which is already done in the system.
This saves several commands per SF, thereby speeds up SF creation and
also saves completion EQ creation.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/4ff5a8cc4c5b5b0d98397baa45a5019bcdbf096e.1717409369.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 18:33:59 +03:00
Leon Romanovsky
ef5513526b Merge branch 'mana-shared' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Leon Romanovsky says:

====================
net: mana: Allow variable size indirection table

Like we talked, I created new shared branch for this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=mana-shared

* 'mana-shared' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
   net: mana: Allow variable size indirection table
====================

Link: https://lore.kernel.org/all/20240612183051.GE4966@unreal
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-16 10:51:25 +03:00
Shradha Gupta
7fc45cb686 net: mana: Allow variable size indirection table
Allow variable size indirection table allocation in MANA instead
of using a constant value MANA_INDIRECT_TABLE_SIZE.
The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
indirection table is allocated dynamically.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Link: https://lore.kernel.org/r/1718015319-9609-1-git-send-email-shradhagupta@linux.microsoft.com
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-12 21:08:42 +03:00
Konstantin Taranov
2a1251e3db RDMA/mana_ib: Process QP error events in mana_ib
Process QP fatal events from the error event queue.
For that, find the QP, using QPN from the event, and then call its
event_handler. To find the QPs, store created RC QPs in an xarray.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1717754897-19858-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Wei Hu <weh@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09 13:46:44 +03:00
Honggang LI
f67ac0061c RDMA/rxe: Fix responder length checking for UD request packets
According to the IBA specification:
If a UD request packet is detected with an invalid length, the request
shall be an invalid request and it shall be silently dropped by
the responder. The responder then waits for a new request packet.

commit 689c5421bf ("RDMA/rxe: Fix incorrect responder length checking")
defers responder length check for UD QPs in function `copy_data`.
But it introduces a regression issue for UD QPs.

When the packet size is too large to fit in the receive buffer.
`copy_data` will return error code -EINVAL. Then `send_data_in`
will return RESPST_ERR_MALFORMED_WQE. UD QP will transfer into
ERROR state.

Fixes: 689c5421bf ("RDMA/rxe: Fix incorrect responder length checking")
Signed-off-by: Honggang LI <honggangli@163.com>
Link: https://lore.kernel.org/r/20240523094617.141148-1-honggangli@163.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09 11:51:32 +03:00
Bart Van Assche
aee2424246 RDMA/iwcm: Fix a use-after-free related to destroying CM IDs
iw_conn_req_handler() associates a new struct rdma_id_private (conn_id) with
an existing struct iw_cm_id (cm_id) as follows:

        conn_id->cm_id.iw = cm_id;
        cm_id->context = conn_id;
        cm_id->cm_handler = cma_iw_handler;

rdma_destroy_id() frees both the cm_id and the struct rdma_id_private. Make
sure that cm_work_handler() does not trigger a use-after-free by only
freeing of the struct rdma_id_private after all pending work has finished.

Cc: stable@vger.kernel.org
Fixes: 59c68ac31e ("iw_cm: free cm_id resources on the last deref")
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240605145117.397751-6-bvanassche@acm.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09 11:25:10 +03:00
Bart Van Assche
a1babdb5b6 RDMA/iwcm: Simplify cm_work_handler()
Instead of complicating the code to avoid a spin_lock_irqsave() /
spin_lock_irqrestore() pair before returning, simplify the code by removing
the local variable 'empty'.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240605145117.397751-5-bvanassche@acm.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09 11:15:27 +03:00
Bart Van Assche
e1168f09b3 RDMA/iwcm: Simplify cm_event_handler()
queue_work() can test efficiently whether or not work is pending. Hence,
simplify cm_event_handler() by always calling queue_work() instead of only
if the list with pending work is empty.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240605145117.397751-4-bvanassche@acm.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09 11:15:27 +03:00
Bart Van Assche
fc772e38bc RDMA/iwcm: Change the return type of iwcm_deref_id()
Since iwcm_deref_id() returns either 0 or 1, change its return type from
'int' into 'bool'.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240605145117.397751-3-bvanassche@acm.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09 11:15:27 +03:00
Bart Van Assche
b1bc15f8fb RDMA/iwcm: Use list_first_entry() where appropriate
Improve source code readability by using list_first_entry() where appropriate.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240605145117.397751-2-bvanassche@acm.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-09 11:15:27 +03:00
Konstantin Taranov
c8683b995d RDMA/mana_ib: extend query device
Fill in properties of the ib device.
Order the assignment in the order of fields in the struct ib_device_attr.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1717070117-1234-3-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-02 11:50:15 +03:00
Konstantin Taranov
65357e2c16 RDMA/mana_ib: set node_guid
Use the mac address for the node_guid of the IB device.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1717070117-1234-2-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-06-02 11:50:15 +03:00
Honggang LI
03fa18a992 RDMA/rxe: Fix data copy for IB_SEND_INLINE
For RDMA Send and Write with IB_SEND_INLINE, the memory buffers
specified in sge list will be placed inline in the Send Request.

The data should be copied by CPU from the virtual addresses of
corresponding sge list DMA addresses.

Cc: stable@kernel.org
Fixes: 8d7c7c0eeb ("RDMA: Add ib_virt_dma_to_page()")
Signed-off-by: Honggang LI <honggangli@163.com>
Link: https://lore.kernel.org/r/20240516095052.542767-1-honggangli@163.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 16:20:18 +03:00
Michael Margolin
2d0e7ba468 RDMA/efa: Properly handle unexpected AQ completions
Do not try to handle admin command completion if it has an unexpected
command id and print a relevant error message.

Reviewed-by: Firas Jahjah <firasj@amazon.com>
Reviewed-by: Yehuda Yitschak <yehuday@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://lore.kernel.org/r/20240513064630.6247-1-mrgolin@amazon.com
Reviewed-by: Gal Pressman <gal.pressman@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 15:48:13 +03:00
Konstantin Taranov
e095405b45 RDMA/mana_ib: Modify QP state
Implement modify QP state for RC QPs.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1716366242-558-4-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 15:26:57 +03:00
Konstantin Taranov
fdefb91849 RDMA/mana_ib: Implement uapi to create and destroy RC QP
Implement user requests to create and destroy an RC QP.
As the user does not have an FMR queue, it is skipped and NO_FMR flag
is used.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1716366242-558-3-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 15:26:57 +03:00
Konstantin Taranov
53657a0419 RDMA/mana_ib: Create and destroy RC QP
Implement HW requests to create and destroy an RC QP.
An RC QP may have 5 queues.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1716366242-558-2-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 15:26:57 +03:00
Christophe JAILLET
38c02d813a RDMA/irdma: Annotate flexible array with __counted_by() in struct irdma_qvlist_info
'num_vectors' is used to count the number of elements in the 'qv_info'
flexible array in "struct irdma_qvlist_info".

So annotate it with __counted_by() to make it explicit and enable some
additional checks.

This allocation is done in irdma_save_msix_info().

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/2ca8b14adf79c4795d7aa95bbfc79253a6bfed82.1716102112.git.christophe.jaillet@wanadoo.fr
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 15:14:19 +03:00
Michael Margolin
435cdbe9f7 RDMA/efa: Fail probe on missing BARs
In case any of PCI BARs is missing during device probe we would like to
fail as early as possible. Fail if any of the required BARs isn't listed
as a memory BAR.

Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://lore.kernel.org/r/20240513081019.26998-1-mrgolin@amazon.com
Reviewed-by: Gal Pressman <gal.pressman@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 15:07:55 +03:00
Selvin Xavier
056620da89 RDMA/bnxt_re: Fix the max msix vectors macro
bnxt_re no longer decide the number of MSI-x vectors used by itself.
Its decided by bnxt_en now. So when bnxt_en changes this value, system
crash is seen.

Depend on the max value reported by bnxt_en instead of using the its own macros.

Fixes: 3034322113 ("bnxt_en: Remove runtime interrupt vector allocation")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1716195418-11767-1-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 15:00:07 +03:00
Selvin Xavier
6f6bfbc595 RDMA/bnxt_re: Expose the MSN table capability for user library
BNXT_RE_COMP_MASK_UCNTX_HW_RETX_ENABLED was introduced to share the HW
retransmit capability between driver and lib. The main difference in
implementation for HW Retransmit support is the usage of MSN table or
PSN table . When HW retrans is enabled, HW expects MSN table to be
allocated by driver/lib, else PSN table (for older adapters).

FW expose a new field which gives MSN capability. Drivers and libs can
depend on the new field instead of HW Retrasns capability. For
adapters which support HW_RETX feature, MSN table capability will be
set. For older adapters, this value will be 0(to maintain backward
compatibility with older FW).

Rename UAPI just to capture the correct name of the HW capability
that driver/library is interested in. No functional impact even if
older rdma-core is used.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1716876697-25970-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 13:28:02 +03:00
Selvin Xavier
8d310ba845 RDMA/bnxt_re: Allow MSN table capability check
FW reports the HW capability to use PSN table or MSN table and
driver/library need to select it based on this capability.
Use the new capability instead of the older capability check for HW
retransmission while handling the MSN/PSN table. FW report
zero (PSN table) for older adapters to maintain backward compatibility.

Also, Updated the FW interface structures to handle the new fields.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1716876697-25970-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-30 13:28:02 +03:00
Steven Rostedt (Google)
2c92ca849f tracing/treewide: Remove second parameter of __assign_str()
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.

This means that with:

  __string(field, mystring)

Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.

There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:

  git grep -l __assign_str | while read a ; do
      sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
      mv /tmp/test-file $a;
  done

I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.

Note, the same updates will need to be done for:

  __assign_str_len()
  __assign_rel_str()
  __assign_rel_str_len()

I tested this with both an allmodconfig and an allyesconfig (build only for both).

[1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/

Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts.
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>	# xfs
Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22 20:14:47 -04:00
Linus Torvalds
d90be6e4aa Driver core changes for 6.10-rc1
Here is the small set of driver core and kernfs changes for 6.10-rc1.
 
 Nothing major here at all, just a small set of changes for some driver
 core apis, and minor fixups.  Included in here are:
   - sysfs_bin_attr_simple_read() helper added and used
   - device_show_string() helper added and used
 All usages of these were acked by the various maintainers.  Also in here
 are:
   - kernfs minor cleanup
   - removed unused functions
   - typo fix in documentation
   - pay attention to sysfs_create_link() failures in module.c finally.
 
 All of these have been in linux-next for a very long time with no
 reported problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZk3+hQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylfTwCfUyHWkDZuZ7ehdtjzfmcd4EKZBK8An3AAV99G
 ox8PXMxuFTaUEdT/69FQ
 =2sEo
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the small set of driver core and kernfs changes for 6.10-rc1.

  Nothing major here at all, just a small set of changes for some driver
  core apis, and minor fixups. Included in here are:

   - sysfs_bin_attr_simple_read() helper added and used

   - device_show_string() helper added and used

  All usages of these were acked by the various maintainers. Also in
  here are:

   - kernfs minor cleanup

   - removed unused functions

   - typo fix in documentation

   - pay attention to sysfs_create_link() failures in module.c finally

  All of these have been in linux-next for a very long time with no
  reported problems"

* tag 'driver-core-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  device property: Fix a typo in the description of device_get_child_node_count()
  kernfs: mount: Remove unnecessary ‘NULL’ values from knparent
  scsi: Use device_show_string() helper for sysfs attributes
  platform/x86: Use device_show_string() helper for sysfs attributes
  perf: Use device_show_string() helper for sysfs attributes
  IB/qib: Use device_show_string() helper for sysfs attributes
  hwmon: Use device_show_string() helper for sysfs attributes
  driver core: Add device_show_string() helper for sysfs attributes
  treewide: Use sysfs_bin_attr_simple_read() helper
  sysfs: Add sysfs_bin_attr_simple_read() helper
  module: don't ignore sysfs_create_link() failures
  driver core: Remove unused platform_notify, platform_notify_remove
2024-05-22 12:13:40 -07:00
Linus Torvalds
f0bae243b2 pci-v6.10-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmZLzNIUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vwr/Q//STe2XGKI8bAKqP2wbbkzm+ISnK4A
 Lqf3FEAIXunxDRspszfXKKV2p4vaIkmOFiwIdtp/kWvd0DQn5+ATXJ/iQtp8aFX/
 R+6BQ7EZc2G7fN5fbQuK54+CvmWEpkKEMbXYbd6ivQ14Cijdb3Nbu+w+DYFjS+6C
 k2a9lS1bTW7Xcy0fyiO1w6GQiWqtmOH8U3OlQtIrI0EVkDG9OG1LsLuc92/FgkOo
 REN+sU+hX1K5fHrvm2CtjYDn/9/B6bJ/It22H1dPgUL9nKvKC67fYzosMtUCOX1M
 6XSPjZIuXOmQGeZXHhpSlVwaidxoUjYO98I7nMquxKdCy6yct3geK7ULG/xeQCgD
 ML7MGQB4+sTiSWalXUQaziKqF1FIDEvU3HMGXFWnoBL5l56eRp8KS1EI9Eqk9pU3
 pk9fJaCkcFnkzPtMFzqPOm5q9zUZ6bGbfYb0hs72TUKplmVDhFo2T1YsW2AOyHZ7
 mjuDzUYZX0H7uM1tntA56IgZX+oNOrLvhBt5L5M/BQeCsZFBBUfIcAEaYoL9LwXO
 AYgIG3jdqzHHyAUzutJF+XHKinJLMHm0XVYbFmO6saPhFzrUJSNHqT7NzW1DGGTl
 OnO8e1WNMX1EcnKvnc6fXyGmM3SgVwy45FsbG/zRnhn4uBKqKtjrh6uX/myA22LK
 CSeqSUK9XmXxFNA=
 =xjoS
 -----END PGP SIGNATURE-----

Merge tag 'pci-v6.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci updates from Bjorn Helgaas:
 "Enumeration:

   - Skip E820 checks for MCFG ECAM regions for new (2016+) machines,
     since there's no requirement to describe them in E820 and some
     platforms require ECAM to work (Bjorn Helgaas)

   - Rename PCI_IRQ_LEGACY to PCI_IRQ_INTX to be more specific (Damien
     Le Moal)

   - Remove last user and pci_enable_device_io() (Heiner Kallweit)

   - Wait for Link Training==0 to avoid possible race (Ilpo Järvinen)

   - Skip waiting for devices that have been disconnected while
     suspended (Ilpo Järvinen)

   - Clear Secondary Status errors after enumeration since Master Aborts
     and Unsupported Request errors are an expected part of enumeration
     (Vidya Sagar)

  MSI:

   - Remove unused IMS (Interrupt Message Store) support (Bjorn Helgaas)

  Error handling:

   - Mask Genesys GL975x SD host controller Replay Timer Timeout
     correctable errors caused by a hardware defect; the errors cause
     interrupts that prevent system suspend (Kai-Heng Feng)

   - Fix EDR-related _DSM support, which previously evaluated revision 5
     but assumed revision 6 behavior (Kuppuswamy Sathyanarayanan)

  ASPM:

   - Simplify link state definitions and mask calculation (Ilpo
     Järvinen)

  Power management:

   - Avoid D3cold for HP Pavilion 17 PC/1972 PCIe Ports, where BIOS
     apparently doesn't know how to put them back in D0 (Mario
     Limonciello)

  CXL:

   - Support resetting CXL devices; special handling required because
     CXL Ports mask Secondary Bus Reset by default (Dave Jiang)

  DOE:

   - Support DOE Discovery Version 2 (Alexey Kardashevskiy)

  Endpoint framework:

   - Set endpoint BAR to be 64-bit if the driver says that's all the
     device supports, in addition to doing so if the size is >2GB
     (Niklas Cassel)

   - Simplify endpoint BAR allocation and setting interfaces (Niklas
     Cassel)

  Cadence PCIe controller driver:

   - Drop DT binding redundant msi-parent and pci-bus.yaml (Krzysztof
     Kozlowski)

  Cadence PCIe endpoint driver:

   - Configure endpoint BARs to be 64-bit based on the BAR type, not the
     BAR value (Niklas Cassel)

  Freescale Layerscape PCIe controller driver:

   - Convert DT binding to YAML (Frank Li)

  MediaTek MT7621 PCIe controller driver:

   - Add DT binding missing 'reg' property for child Root Ports
     (Krzysztof Kozlowski)

   - Fix theoretical string truncation in PHY name (Sergio Paracuellos)

  NVIDIA Tegra194 PCIe controller driver:

   - Return success for endpoint probe instead of falling through to the
     failure path (Vidya Sagar)

  Renesas R-Car PCIe controller driver:

   - Add DT binding missing IOMMU properties (Geert Uytterhoeven)

   - Add DT binding R-Car V4H compatible for host and endpoint mode
     (Yoshihiro Shimoda)

  Rockchip PCIe controller driver:

   - Configure endpoint BARs to be 64-bit based on the BAR type, not the
     BAR value (Niklas Cassel)

   - Add DT binding missing maxItems to ep-gpios (Krzysztof Kozlowski)

   - Set the Subsystem Vendor ID, which was previously zero because it
     was masked incorrectly (Rick Wertenbroek)

  Synopsys DesignWare PCIe controller driver:

   - Restructure DBI register access to accommodate devices where this
     requires Refclk to be active (Manivannan Sadhasivam)

   - Remove the deinit() callback, which was only need by the
     pcie-rcar-gen4, and do it directly in that driver (Manivannan
     Sadhasivam)

   - Add dw_pcie_ep_cleanup() so drivers that support PERST# can clean
     up things like eDMA (Manivannan Sadhasivam)

   - Rename dw_pcie_ep_exit() to dw_pcie_ep_deinit() to make it parallel
     to dw_pcie_ep_init() (Manivannan Sadhasivam)

   - Rename dw_pcie_ep_init_complete() to dw_pcie_ep_init_registers() to
     reflect the actual functionality (Manivannan Sadhasivam)

   - Call dw_pcie_ep_init_registers() directly from all the glue
     drivers, not just those that require active Refclk from the host
     (Manivannan Sadhasivam)

   - Remove the "core_init_notifier" flag, which was an obscure way for
     glue drivers to indicate that they depend on Refclk from the host
     (Manivannan Sadhasivam)

  TI J721E PCIe driver:

   - Add DT binding J784S4 SoC Device ID (Siddharth Vadapalli)

   - Add DT binding J722S SoC support (Siddharth Vadapalli)

  TI Keystone PCIe controller driver:

   - Add DT binding missing num-viewport, phys and phy-name properties
     (Jan Kiszka)

  Miscellaneous:

   - Constify and annotate with __ro_after_init (Heiner Kallweit)

   - Convert DT bindings to YAML (Krzysztof Kozlowski)

   - Check for kcalloc() failure in of_pci_prop_intr_map() (Duoming
     Zhou)"

* tag 'pci-v6.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (97 commits)
  PCI: Do not wait for disconnected devices when resuming
  x86/pci: Skip early E820 check for ECAM region
  PCI: Remove unused pci_enable_device_io()
  ata: pata_cs5520: Remove unnecessary call to pci_enable_device_io()
  PCI: Update pci_find_capability() stub return types
  PCI: Remove PCI_IRQ_LEGACY
  scsi: vmw_pvscsi: Do not use PCI_IRQ_LEGACY instead of PCI_IRQ_LEGACY
  scsi: pmcraid: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: mpt3sas: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: megaraid_sas: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: ipr: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: hpsa: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  scsi: arcmsr: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  wifi: rtw89: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
  dt-bindings: PCI: rockchip,rk3399-pcie: Add missing maxItems to ep-gpios
  Revert "genirq/msi: Provide constants for PCI/IMS support"
  Revert "x86/apic/msi: Enable PCI/IMS"
  Revert "iommu/vt-d: Enable PCI/IMS"
  Revert "iommu/amd: Enable PCI/IMS"
  Revert "PCI/MSI: Provide IMS (Interrupt Message Store) support"
  ...
2024-05-21 10:09:28 -07:00
Linus Torvalds
25f4874662 RDMA v6.10 merge window
Normal set of driver updates and small fixes:
 
 - Small improvements and fixes for erdma, efa, hfi1, bnxt_re
 
 - Fix a UAF crash after module unload on leaking restrack entry
 
 - Continue adding full RDMA support in mana with support for EQs, GID's
   and CQs
 
 - Improvements to the mkey cache in mlx5
 
 - DSCP traffic class support in hns and several bug fixes
 
 - Cap the maximum number of MADs in the receive queue to avoid OOM
 
 - Another batch of rxe bug fixes from large scale testing
 
 - __iowrite64_copy() optimizations for write combining MMIO memory
 
 - Remove NULL checks before dev_put/hold()
 
 - EFA support for receive with immediate
 
 - Fix a recent memleaking regression in a cma error path
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZkeo2gAKCRCFwuHvBreF
 YbuNAQChzGmS4F0JAn5Wj0CDvkZghELqtvzEb92SzqcgdyQafAD/fC7f23LJ4OsO
 1ZIaQEZu7j9DVg5PKFZ7WfdXjGTKqwA=
 =QRXg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Aside from the usual things this has an arch update for
  __iowrite64_copy() used by the RDMA drivers.

  This API was intended to generate large 64 byte MemWr TLPs on PCI.
  These days most processors had done this by just repeating writel() in
  a loop. S390 and some new ARM64 designs require a special helper to
  get this to generate.

   - Small improvements and fixes for erdma, efa, hfi1, bnxt_re

   - Fix a UAF crash after module unload on leaking restrack entry

   - Continue adding full RDMA support in mana with support for EQs,
     GID's and CQs

   - Improvements to the mkey cache in mlx5

   - DSCP traffic class support in hns and several bug fixes

   - Cap the maximum number of MADs in the receive queue to avoid OOM

   - Another batch of rxe bug fixes from large scale testing

   - __iowrite64_copy() optimizations for write combining MMIO memory

   - Remove NULL checks before dev_put/hold()

   - EFA support for receive with immediate

   - Fix a recent memleaking regression in a cma error path"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (70 commits)
  RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw
  RDMA/IPoIB: Fix format truncation compilation errors
  bnxt_re: avoid shift undefined behavior in bnxt_qplib_alloc_init_hwq
  RDMA/efa: Support QP with unsolicited write w/ imm. receive
  IB/hfi1: Remove generic .ndo_get_stats64
  IB/hfi1: Do not use custom stat allocator
  RDMA/hfi1: Use RMW accessors for changing LNKCTL2
  RDMA/mana_ib: implement uapi for creation of rnic cq
  RDMA/mana_ib: boundary check before installing cq callbacks
  RDMA/mana_ib: introduce a helper to remove cq callbacks
  RDMA/mana_ib: create and destroy RNIC cqs
  RDMA/mana_ib: create EQs for RNIC CQs
  RDMA/core: Remove NULL check before dev_{put, hold}
  RDMA/ipoib: Remove NULL check before dev_{put, hold}
  RDMA/mlx5: Remove NULL check before dev_{put, hold}
  RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources.
  RDMA/core: Add an option to display driver-specific QPs in the rdmatool
  RDMA/efa: Add shutdown notifier
  RDMA/mana_ib: Fix missing ret value
  IB/mlx5: Use __iowrite64_copy() for write combining stores
  ...
2024-05-18 13:04:15 -07:00
Zhu Yanjun
9c0731832d RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw
When running blktests nvme/rdma, the following kmemleak issue will appear.

kmemleak: Kernel memory leak detector initialized (mempool available:36041)
kmemleak: Automatic memory scanning thread started
kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 8 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 17 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

unreferenced object 0xffff88855da53400 (size 192):
  comm "rdma", pid 10630, jiffies 4296575922
  hex dump (first 32 bytes):
    37 00 00 00 00 00 00 00 c0 ff ff ff 1f 00 00 00  7...............
    10 34 a5 5d 85 88 ff ff 10 34 a5 5d 85 88 ff ff  .4.].....4.]....
  backtrace (crc 47f66721):
    [<ffffffff911251bd>] kmalloc_trace+0x30d/0x3b0
    [<ffffffffc2640ff7>] alloc_gid_entry+0x47/0x380 [ib_core]
    [<ffffffffc2642206>] add_modify_gid+0x166/0x930 [ib_core]
    [<ffffffffc2643468>] ib_cache_update.part.0+0x6d8/0x910 [ib_core]
    [<ffffffffc2644e1a>] ib_cache_setup_one+0x24a/0x350 [ib_core]
    [<ffffffffc263949e>] ib_register_device+0x9e/0x3a0 [ib_core]
    [<ffffffffc2a3d389>] 0xffffffffc2a3d389
    [<ffffffffc2688cd8>] nldev_newlink+0x2b8/0x520 [ib_core]
    [<ffffffffc2645fe3>] rdma_nl_rcv_msg+0x2c3/0x520 [ib_core]
    [<ffffffffc264648c>]
rdma_nl_rcv_skb.constprop.0.isra.0+0x23c/0x3a0 [ib_core]
    [<ffffffff9270e7b5>] netlink_unicast+0x445/0x710
    [<ffffffff9270f1f1>] netlink_sendmsg+0x761/0xc40
    [<ffffffff9249db29>] __sys_sendto+0x3a9/0x420
    [<ffffffff9249dc8c>] __x64_sys_sendto+0xdc/0x1b0
    [<ffffffff92db0ad3>] do_syscall_64+0x93/0x180
    [<ffffffff92e00126>] entry_SYSCALL_64_after_hwframe+0x71/0x79

The root cause: rdma_put_gid_attr is not called when sgid_attr is set
to ERR_PTR(-ENODEV).

Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/19bf5745-1b3b-4b8a-81c2-20d945943aaf@linux.dev/T/
Fixes: f8ef1be816 ("RDMA/cma: Avoid GID lookups on iWARP devices")
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240510211247.31345-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-12 13:04:11 +03:00
Leon Romanovsky
49ca2b2ef3 RDMA/IPoIB: Fix format truncation compilation errors
Truncate the device name to store IPoIB VLAN name.

[leonro@5b4e8fba4ddd kernel]$ make -s -j 20 allmodconfig
[leonro@5b4e8fba4ddd kernel]$ make -s -j 20 W=1 drivers/infiniband/ulp/ipoib/
drivers/infiniband/ulp/ipoib/ipoib_vlan.c: In function ‘ipoib_vlan_add’:
drivers/infiniband/ulp/ipoib/ipoib_vlan.c:187:52: error: ‘%04x’
directive output may be truncated writing 4 bytes into a region of size
between 0 and 15 [-Werror=format-truncation=]
  187 |         snprintf(intf_name, sizeof(intf_name), "%s.%04x",
      |                                                    ^~~~
drivers/infiniband/ulp/ipoib/ipoib_vlan.c:187:48: note: directive
argument in the range [0, 65535]
  187 |         snprintf(intf_name, sizeof(intf_name), "%s.%04x",
      |                                                ^~~~~~~~~
drivers/infiniband/ulp/ipoib/ipoib_vlan.c:187:9: note: ‘snprintf’ output
between 6 and 21 bytes into a destination of size 16
  187 |         snprintf(intf_name, sizeof(intf_name), "%s.%04x",
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  188 |                  ppriv->dev->name, pkey);
      |                  ~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[6]: *** [scripts/Makefile.build:244: drivers/infiniband/ulp/ipoib/ipoib_vlan.o] Error 1
make[6]: *** Waiting for unfinished jobs....

Fixes: 9baa0b0364 ("IB/ipoib: Add rtnl_link_ops support")
Link: https://lore.kernel.org/r/e9d3e1fef69df4c9beaf402cc3ac342bad680791.1715240029.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2024-05-12 12:36:46 +03:00
Jakub Kicinski
e7073830cc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
  35d92abfba ("net: hns3: fix kernel crash when devlink reload during initialization")
  2a1a1a7b5f ("net: hns3: add command queue trace for hns3")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-09 10:01:01 -07:00
Linus Torvalds
1bbc991585 qibfs leak fix
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZjwicQAKCRBZ7Krx/gZQ
 67WgAP4iDssAVApazVvOUecQBtOd6xBvQUf1k3acMKa03hcdwAEAmM8qN0MpYUkz
 MyVA6oRvC8selaWh62eXamWBgjc1Rwo=
 =jLRy
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dentry leak fix from Al Viro:
 "Dentry leak fix in the qibfs driver that I forgot to send a pull
  request for ;-/

  My apologies - it actually sat in vfs.git#fixes for more than two
  months..."

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  qibfs: fix dentry leak
2024-05-09 08:39:10 -07:00
Michal Schmidt
78cfd17142 bnxt_re: avoid shift undefined behavior in bnxt_qplib_alloc_init_hwq
Undefined behavior is triggered when bnxt_qplib_alloc_init_hwq is called
with hwq_attr->aux_depth != 0 and hwq_attr->aux_stride == 0.
In that case, "roundup_pow_of_two(hwq_attr->aux_stride)" gets called.
roundup_pow_of_two is documented as undefined for 0.

Fix it in the one caller that had this combination.

The undefined behavior was detected by UBSAN:
  UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
  shift exponent 64 is too large for 64-bit type 'long unsigned int'
  CPU: 24 PID: 1075 Comm: (udev-worker) Not tainted 6.9.0-rc6+ #4
  Hardware name: Abacus electric, s.r.o. - servis@abacus.cz Super Server/H12SSW-iN, BIOS 2.7 10/25/2023
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5d/0x80
   ubsan_epilogue+0x5/0x30
   __ubsan_handle_shift_out_of_bounds.cold+0x61/0xec
   __roundup_pow_of_two+0x25/0x35 [bnxt_re]
   bnxt_qplib_alloc_init_hwq+0xa1/0x470 [bnxt_re]
   bnxt_qplib_create_qp+0x19e/0x840 [bnxt_re]
   bnxt_re_create_qp+0x9b1/0xcd0 [bnxt_re]
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __kmalloc+0x1b6/0x4f0
   ? create_qp.part.0+0x128/0x1c0 [ib_core]
   ? __pfx_bnxt_re_create_qp+0x10/0x10 [bnxt_re]
   create_qp.part.0+0x128/0x1c0 [ib_core]
   ib_create_qp_kernel+0x50/0xd0 [ib_core]
   create_mad_qp+0x8e/0xe0 [ib_core]
   ? __pfx_qp_event_handler+0x10/0x10 [ib_core]
   ib_mad_init_device+0x2be/0x680 [ib_core]
   add_client_context+0x10d/0x1a0 [ib_core]
   enable_device_and_get+0xe0/0x1d0 [ib_core]
   ib_register_device+0x53c/0x630 [ib_core]
   ? srso_alias_return_thunk+0x5/0xfbef5
   bnxt_re_probe+0xbd8/0xe50 [bnxt_re]
   ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
   auxiliary_bus_probe+0x49/0x80
   ? driver_sysfs_add+0x57/0xc0
   really_probe+0xde/0x340
   ? pm_runtime_barrier+0x54/0x90
   ? __pfx___driver_attach+0x10/0x10
   __driver_probe_device+0x78/0x110
   driver_probe_device+0x1f/0xa0
   __driver_attach+0xba/0x1c0
   bus_for_each_dev+0x8f/0xe0
   bus_add_driver+0x146/0x220
   driver_register+0x72/0xd0
   __auxiliary_driver_register+0x6e/0xd0
   ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
   bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
   ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
   do_one_initcall+0x5b/0x310
   do_init_module+0x90/0x250
   init_module_from_file+0x86/0xc0
   idempotent_init_module+0x121/0x2b0
   __x64_sys_finit_module+0x5e/0xb0
   do_syscall_64+0x82/0x160
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? syscall_exit_to_user_mode_prepare+0x149/0x170
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? syscall_exit_to_user_mode+0x75/0x230
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0x8e/0x160
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __count_memcg_events+0x69/0x100
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? count_memcg_events.constprop.0+0x1a/0x30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? handle_mm_fault+0x1f0/0x300
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_user_addr_fault+0x34e/0x640
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f4e5132821d
  Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e3 db 0c 00 f7 d8 64 89 01 48
  RSP: 002b:00007ffca9c906a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
  RAX: ffffffffffffffda RBX: 0000563ec8a8f130 RCX: 00007f4e5132821d
  RDX: 0000000000000000 RSI: 00007f4e518fa07d RDI: 000000000000003b
  RBP: 00007ffca9c90760 R08: 00007f4e513f6b20 R09: 00007ffca9c906f0
  R10: 0000563ec8a8faa0 R11: 0000000000000246 R12: 00007f4e518fa07d
  R13: 0000000000020000 R14: 0000563ec8409e90 R15: 0000563ec8a8fa60
   </TASK>
  ---[ end trace ]---

Fixes: 0c4dcd6028 ("RDMA/bnxt_re: Refactor hardware queue memory allocation")
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Link: https://lore.kernel.org/r/20240507103929.30003-1-mschmidt@redhat.com
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-09 11:59:31 +03:00
Michael Margolin
2b8af5001a RDMA/efa: Support QP with unsolicited write w/ imm. receive
Add a new EFA flags attribute for QP creation, and support unsolicited
write with immediate flag. QPs created with this flag set will not
consume receive work requests for incoming RDMA write with immediate.
Expose device capability bit for this feature support.

Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://lore.kernel.org/r/20240506151829.6475-1-mrgolin@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-08 11:18:26 +03:00
Eric Dumazet
1eb2cded45 net: annotate writes on dev->mtu from ndo_change_mtu()
Simon reported that ndo_change_mtu() methods were never
updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted
in commit 501a90c945 ("inet: protect against too small
mtu values.")

We read dev->mtu without holding RTNL in many places,
with READ_ONCE() annotations.

It is time to take care of ndo_change_mtu() methods
to use corresponding WRITE_ONCE()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Shannon Nelson <shannon.nelson@amd.com>
Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07 16:19:14 -07:00
Breno Leitao
f483f6a29d IB/hfi1: Remove generic .ndo_get_stats64
Commit 3e2f544dd8 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.

Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240503111333.552360-2-leitao@debian.org
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05 17:10:38 +03:00
Breno Leitao
5194947e6a IB/hfi1: Do not use custom stat allocator
With commit 34d21de99c ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the hfi1 driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240503111333.552360-1-leitao@debian.org
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05 17:10:38 +03:00
Ilpo Järvinen
8f3b7103b4 RDMA/hfi1: Use RMW accessors for changing LNKCTL2
Convert open coded RMW accesses for LNKCTL2 to use
pcie_capability_clear_and_set_word() which makes its easier to
understand what the code tries to do.

In addition, this futureproofs the code. LNKCTL2 is not really owned by
any driver because it is a collection of control bits that PCI core
might need to touch. RMW accessors already have support for proper
locking for a selected set of registers to avoid losing concurrent
updates (LNKCTL2 is not yet among the registers that need protection
but likely will be in the future).

Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/r/20240503133640.15899-1-ilpo.jarvinen@linux.intel.com
Reviewed-by: Dean Luick <dean.luick@cornelisnetworks.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05 16:11:40 +03:00
Konstantin Taranov
44b607ad4c RDMA/mana_ib: implement uapi for creation of rnic cq
Enable users to create RNIC CQs using a corresponding flag.
With the previous request size, an ethernet CQ is created.
As a response, return ID of the created CQ.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1714137160-5222-6-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05 15:27:23 +03:00
Konstantin Taranov
f79edef79b RDMA/mana_ib: boundary check before installing cq callbacks
Add a boundary check inside mana_ib_install_cq_cb to prevent index overflow.

Fixes: 2a31c5a7e0 ("RDMA/mana_ib: Introduce mana_ib_install_cq_cb helper function")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1714137160-5222-5-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05 15:27:23 +03:00
Konstantin Taranov
3e41105263 RDMA/mana_ib: introduce a helper to remove cq callbacks
Intoduce the mana_ib_remove_cq_cb helper to remove cq callbacks.
The helper removes code duplicates.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1714137160-5222-4-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05 15:27:23 +03:00
Konstantin Taranov
5843415916 RDMA/mana_ib: create and destroy RNIC cqs
Implement RNIC requests for creation and destruction of RNIC CQs.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1714137160-5222-3-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05 15:27:23 +03:00
Konstantin Taranov
e73c882f0a RDMA/mana_ib: create EQs for RNIC CQs
Create EQs within mana_ib device. Such EQs are required
for creation of RNIC CQs.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1714137160-5222-2-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05 15:27:23 +03:00
Jules Irenge
48d80b4844 RDMA/core: Remove NULL check before dev_{put, hold}
Coccinelle reports a warning

WARNING: NULL check before dev_{put, hold} functions is not needed

The reason is the call netdev_{put, hold} of dev_{put,hold} will check NULL
There is no need to check before using dev_{put, hold}

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Link: https://lore.kernel.org/r/ZjF1Eedxwhn4JSkz@octinomon.home
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-05 15:12:35 +03:00
Lukas Wunner
7b53e926bf IB/qib: Use device_show_string() helper for sysfs attributes
Deduplicate sysfs ->show() callbacks which expose a string at a static
memory location.  Use the newly introduced device_show_string() helper
in the driver core instead by declaring those sysfs attributes with
DEVICE_STRING_ATTR_RO().

No functional change intended.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Link: https://lore.kernel.org/r/185a88dd3e6e18bf508a875d87c95ea2a5c3fa13.1713608122.git.lukas@wunner.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-05-04 17:37:03 +02:00
Breno Leitao
1c8f43f477 IB/hfi1: allocate dummy net_device dynamically
Embedding net_device into structures prohibits the usage of flexible
arrays in the net_device structure. For more details, see the discussion
at [1].

Un-embed the net_device from struct hfi1_netdev_rx by converting it
into a pointer. Then use the leverage alloc_netdev() to allocate the
net_device object at hfi1_alloc_rx().

[1] https://lore.kernel.org/all/20240229225910.79e224cf@kernel.org/

Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/r/20240430162213.746492-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02 18:20:17 -07:00
Jules Irenge
e4e40a8702 RDMA/ipoib: Remove NULL check before dev_{put, hold}
Coccinelle reports a warning

WARNING: NULL check before dev_{put, hold} functions is not needed

The reason is the call netdev_{put, hold} of dev_{put,hold} will check NULL
There is no need to check before using dev_{put, hold}

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Link: https://lore.kernel.org/r/ZjGDFatHRMI6Eg7M@octinomon.home
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-02 17:46:38 +03:00
Jules Irenge
82e966130d RDMA/mlx5: Remove NULL check before dev_{put, hold}
Coccinelle reports a warning

WARNING: NULL check before dev_{put, hold} functions is not needed

The reason is the call netdev_{put, hold} of dev_{put,hold} will check NULL
There is no need to check before using dev_{put, hold}

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Link: https://lore.kernel.org/r/ZjGC4qXrOwZE0aHi@octinomon.home
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-05-02 17:45:59 +03:00
Eric Dumazet
05d6d49209 inet: introduce dst_rtable() helper
I added dst_rt6_info() in commit
e8dfd42c17 ("ipv6: introduce dst_rt6_info() helper")

This patch does a similar change for IPv4.

Instead of (struct rtable *)dst casts, we can use :

 #define dst_rtable(_ptr) \
             container_of_const(_ptr, struct rtable, dst)

Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-30 18:32:38 -07:00
Chiara Meiohas
fd3af5e218 RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources.
Allow user to see driver-specific QPs (the "driver_detail" QPs)
through the rdmatool, when requested.

When creating DCT, DCI and REG_UMR QPs, we designate them as driver_detail
resources.

When filling the QP info for the rdma tool, for the driver_detail QPs:
-the QP type is IB_QPT_DRIVER
-the subtype is a string with the QP name ("DCT", "DCI", "REG_UMR")

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://lore.kernel.org/r/452432d7d0917f053a80a893a614169857fe3b10.1713268997.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-30 11:19:37 +03:00
Chiara Meiohas
e18fa0bbce RDMA/core: Add an option to display driver-specific QPs in the rdmatool
Utilize the -dd flag (driver-specific details) in the rdmatool
to view driver-specific QPs which are not exposed yet.

Add the netlink attribute to mark request to convey driver details and
use it to return QP subtype as a string.

$ rdma resource show qp link ibp8s0f1
link ibp8s0f1/1 lqpn 360 type UD state RTS sq-psn 0 comm [mlx5_ib]
link ibp8s0f1/1 lqpn 0 type SMI state RTS sq-psn 0 comm [ib_core]
link ibp8s0f1/1 lqpn 1 type GSI state RTS sq-psn 0 comm [ib_core]

$ rdma resource show qp link ibp8s0f1 -dd
link ibp8s0f1/1 lqpn 360 type UD state RTS sq-psn 0 comm [mlx5_ib]
link ibp8s0f1/1 lqpn 465 type DRIVER subtype REG_UMR state RTS sq-psn 0 comm [mlx5_ib]
link ibp8s0f1/1 lqpn 0 type SMI state RTS sq-psn 0 comm [ib_core]
link ibp8s0f1/1 lqpn 1 type GSI state RTS sq-psn 0 comm [ib_core]

$ rdma resource show
0: ibp8s0f0: pd 3 cq 4 qp 3 cm_id 0 mr 0 ctx 0 srq 2
1: ibp8s0f1: pd 3 cq 4 qp 3 cm_id 0 mr 0 ctx 0 srq 2

$ rdma resource show -dd
0: ibp8s0f0: pd 3 cq 4 qp 4 cm_id 0 mr 0 ctx 0 srq 2
1: ibp8s0f1: pd 3 cq 4 qp 4 cm_id 0 mr 0 ctx 0 srq 2

Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://lore.kernel.org/r/2607bb3ddec3cae3443c2ea19e9f700825d20a98.1713268997.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-30 11:19:29 +03:00
Eric Dumazet
e8dfd42c17 ipv6: introduce dst_rt6_info() helper
Instead of (struct rt6_info *)dst casts, we can use :

 #define dst_rt6_info(_ptr) \
         container_of_const(_ptr, struct rt6_info, dst)

Some places needed missing const qualifiers :

ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()

v2: added missing parts (David Ahern)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29 13:32:01 +01:00
Michael Margolin
f847e84015 RDMA/efa: Add shutdown notifier
Add driver function to stop the device and release any active IRQs as
preparation for shutdown. This should fix issues caused by unexpected AQ
interrupts when booting kernel using kexec and possible data integrity
issues when the system is being shutdown during traffic.

Link: https://lore.kernel.org/r/20240425171814.25216-1-mrgolin@amazon.com
Reviewed-by: Firas Jahjah <firasj@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-29 09:02:58 -03:00
Jakub Kicinski
2bd87951de Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

drivers/net/ethernet/ti/icssg/icssg_prueth.c

net/mac80211/chan.c
  89884459a0 ("wifi: mac80211: fix idle calculation with multi-link")
  87f5500285 ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()")
https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/

net/unix/garbage.c
  1971d13ffa ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().")
  4090fa373f ("af_unix: Replace garbage collection algorithm.")

drivers/net/ethernet/ti/icssg/icssg_prueth.c
drivers/net/ethernet/ti/icssg/icssg_common.c
  4dcd0e83ea ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()")
  e2dc7bfd67 ("net: ti: icssg-prueth: Move common functions into a separate file")

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-25 12:41:37 -07:00
Damien Le Moal
d6f0bbcd2f RDMA/vmw_pvrdma: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
Use the macro PCI_IRQ_INTX instead of the deprecated PCI_IRQ_LEGACY macro.

Link: https://lore.kernel.org/r/20240325070944.3600338-13-dlemoal@kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2024-04-25 12:53:31 -05:00
Damien Le Moal
91d343de20 IB/qib: Use PCI_IRQ_INTX instead of PCI_IRQ_LEGACY
Use the macro PCI_IRQ_INTX instead of the deprecated PCI_IRQ_LEGACY macro.

Link: https://lore.kernel.org/r/20240325070944.3600338-12-dlemoal@kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
2024-04-25 12:53:31 -05:00
Konstantin Taranov
f88320b698 RDMA/mana_ib: Fix missing ret value
Set ret to -ENODEV when netdev_master_upper_dev_get_rcu returns NULL.

Fixes: 8b184e4f1c ("RDMA/mana_ib: Enable RoCE on port 1")
Link: https://lore.kernel.org/r/1713881751-21621-1-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-23 12:02:44 -03:00
Jason Gunthorpe
ef302283dd IB/mlx5: Use __iowrite64_copy() for write combining stores
mlx5 has a built in self-test at driver startup to evaluate if the
platform supports write combining to generate a 64 byte PCIe TLP or
not. This has proven necessary because a lot of common scenarios end up
with broken write combining (especially inside virtual machines) and there
is other way to learn this information.

This self test has been consistently failing on new ARM64 CPU
designs (specifically with NVIDIA Grace's implementation of Neoverse
V2). The C loop around writeq() generates some pretty terrible ARM64
assembly, but historically this has worked on a lot of existing ARM64 CPUs
till now.

We see it succeed about 1 time in 10,000 on the worst effected
systems. The CPU architects speculate that the load instructions
interspersed with the stores makes the WC buffers statistically flush too
often and thus the generation of large TLPs becomes infrequent. This makes
the boot up test unreliable in that it indicates no write-combining,
however userspace would be fine since it uses a ST4 instruction.

Further, S390 has similar issues where only the special zpci_memcpy_toio()
will actually generate large TLPs, and the open coded loop does not
trigger it at all.

Fix both ARM64 and S390 by switching to __iowrite64_copy() which now
provides architecture specific variants that have a high change of
generating a large TLP with write combining. x86 continues to use a
similar writeq loop in the generate __iowrite64_copy().

Fixes: 11f552e217 ("IB/mlx5: Test write combining support")
Link: https://lore.kernel.org/r/6-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 17:11:20 -03:00
Bob Pearson
1a633bdc8f RDMA/rxe: Let destroy qp succeed with stuck packet
In some situations a sent packet may get queued in the NIC longer than
than timeout of a ULP. Currently if this happens the ULP may try to reset
the link by destroying the qp and setting up an alternate connection but
will fail because the rxe driver is waiting for the packet to finish
getting sent and be returned to the skb destructor function where the qp
reference holding things up will be dropped. This patch modifies the way
that the qp is passed to the destructor to pass the qp index and not a qp
pointer.  Then the destructor will attempt to lookup the qp from its index
and if it fails exit early. This requires taking a reference on the struct
sock rather than the qp allowing the qp to be destroyed while the sk is
still around waiting for the packet to finish.

Link: https://lore.kernel.org/r/20240329145513.35381-15-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:55:57 -03:00
Bob Pearson
9cc6290991 RDMA/rxe: Get rid of pkt resend on err
Currently the rxe_driver detects packet drops by ip_local_out() which
occur before the packet is sent on the wire and attempts to resend
them. This is redundant with the usual retry mechanism which covers
packets that get dropped in transit to or from the remote node.

The way this is implemented is not robust since it sets need_req_skb and
waits for the number of local skbs outstanding for this qp to drop below a
low water mark. This is racy since the skb may be sent to the destructor
before the requester can set the need_req_skb flag. This will cause a
deadlock in the send path for that qp.

This patch removes this mechanism since the normal retry path will correct
the error and resend the packet and it makes no difference if the packet
is dropped locally or later.

Link: https://lore.kernel.org/r/20240329145513.35381-14-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:34 -03:00
Bob Pearson
55bec1c440 RDMA/rxe: Make rxe_loopback match rxe_send behavior
The rxe send path currently counts the number of skbs outstanding between
the rxe driver and the ethernet driver to prevent too many packets to
accumulate waiting to send. This patch makes the local loopback path
behave the same way. The loopback path forwards the packets to the receive
path which will eventually call kfree_skb on all packets and drop the qp
references. This makes the loopback path more useful for software testing.

Link: https://lore.kernel.org/r/20240329145513.35381-13-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
8776618dbb RDMA/rxe: Fix incorrect rxe_put in error path
In rxe_send() a ref is taken on the qp to keep it alive until the
kfree_skb() has a chance to call the skb destructor rxe_skb_tx_dtor()
which drops the reference. If the packet has an incorrect protocol the
error path just calls kfree_skb() which will call the destructor which
will drop the ref. Currently the driver also calls rxe_put() which is
incorrect. Additionally since the packets sent to rxe_send() are under the
control of the driver and it only ever produces IPV4 or IPV6 packets the
simplest fix is to remove all the code in this block.

Link: https://lore.kernel.org/r/20240329145513.35381-12-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 9eb7f8e44d ("IB/rxe: Move refcounting earlier in rxe_send()")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
23bc06af54 RDMA/rxe: Don't call direct between tasks
Replace calls to rxe_run_task() with rxe_sched_task().  This prevents the
tasks from all running on the same cpu.

This change slightly reduces performance for single qp send and write
benchmarks in loopback mode but greatly improves the performance with
multiple qps because if run task is used all the work tends to be
performed on one cpu. For actual on the wire benchmarks there is no
noticeable performance change.

Link: https://lore.kernel.org/r/20240329145513.35381-11-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
3d807a3ebc RDMA/rxe: Don't call rxe_requester from rxe_completer
Instead of rescheduling rxe_requester from rxe_completer() just extend the
duration of rxe_sender() by one pass. Setting run_requester_again forces
rxe_completer() to return 0 which will cause rxe_sender() to be called at
least one more time.

Link: https://lore.kernel.org/r/20240329145513.35381-10-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
4891f4fed0 RDMA/rxe: Don't schedule rxe_completer from rxe_requester
Now that rxe_completer() is always called serially after rxe_requester()
there is no reason to schedule rxe_completer() from rxe_requester().

Link: https://lore.kernel.org/r/20240329145513.35381-9-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
cd8aaddf0d RDMA/rxe: Remove save/rollback_state in rxe_requester
Now that req.task and comp.task are merged it is no longer necessary to
call save_state() before calling rxe_xmit_pkt() and rollback_state() if
rxe_xmit_pkt() fails. This was done originally to prevent races between
rxe_completer() and rxe_requester() which now cannot happen.

Link: https://lore.kernel.org/r/20240329145513.35381-8-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
67f57892f9 RDMA/rxe: Merge request and complete tasks
Currently the rxe driver has three work queue tasks per qp.  These are the
req.task, comp.task and resp.task which call rxe_requester(),
rxe_completer() and rxe_responder() respectively directly or on work
queues. Each of these subroutines checks to see if there is work to be
performed on the send queue or on the response packet queue or the request
packet queue and will run until there is no work remaining or yield the
cpu and reschedule itself until there is no work remaining.

This commit combines the req.task and comp.task into a single send.task
and renames the resp.task to the recv.task. The combined send.task calls
rxe_requester() and rxe_completer() serially and continues until all work
on both the send queue and the response packet queue are done.

In various benchmarks the performance is either improved or left the
same. At high scale there is a significant reduction in the load on the
cpu.

This is the first step in combining these two tasks. Once they are
serialized cross rescheduling of req.task and comp.task can be more
efficiently handled by just letting the send.task continue to run. This
will be done in the next several patches.

Link: https://lore.kernel.org/r/20240329145513.35381-7-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
ff30e45376 RDMA/rxe: Remove redundant scheduling of rxe_completer
In rxe_post_send_kernel() if the qp is in the error state after posting
the work requests the rxe_completer() task is scheduled.

But, the only way to move the qp into the error state is to call
rxe_qp_error() which also schedules the rxe_completer() task to drain the
queues. Calling it a second time has no effect. This commit removes the
redundant call.

Link: https://lore.kernel.org/r/20240329145513.35381-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
b703374837 RDMA/rxe: Allow good work requests to be executed
A previous commit incorrectly added an 'if(!err)' before scheduling the
requester task in rxe_post_send_kernel(). But if there were send wrs
successfully added to the send queue before a bad wr they might never get
executed.

This commit fixes this by scheduling the requester task if any wqes were
successfully posted in rxe_post_send_kernel() in rxe_verbs.c.

Link: https://lore.kernel.org/r/20240329145513.35381-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 5bf944f241 ("RDMA/rxe: Add error messages")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:32 -03:00
Bob Pearson
2b23b60973 RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt
In rxe_comp_queue_pkt() an incoming response packet skb is enqueued to the
resp_pkts queue and then a decision is made whether to run the completer
task inline or schedule it. Finally the skb is dereferenced to bump a 'hw'
performance counter. This is wrong because if the completer task is
already running in a separate thread it may have already processed the skb
and freed it which can cause a seg fault.  This has been observed
infrequently in testing at high scale.

This patch fixes this by changing the order of enqueuing the packet until
after the counter is accessed.

Link: https://lore.kernel.org/r/20240329145513.35381-4-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 0b1e5b99a4 ("IB/rxe: Add port protocol stats")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:32 -03:00
Michael Guralnik
ca0b44e20a IB/core: Implement a limit on UMAD receive List
The existing behavior of ib_umad, which maintains received MAD
packets in an unbounded list, poses a risk of uncontrolled growth.
As user-space applications extract packets from this list, the rate
of extraction may not match the rate of incoming packets, leading
to potential list overflow.

To address this, we introduce a limit to the size of the list. After
considering typical scenarios, such as OpenSM processing, which can
handle approximately 100k packets per second, and the 1-second retry
timeout for most packets, we set the list size limit to 200k. Packets
received beyond this limit are dropped, assuming they are likely timed
out by the time they are handled by user-space.

Notably, packets queued on the receive list due to reasons like
timed-out sends are preserved even when the list is full.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/7197cb58a7d9e78399008f25036205ceab07fbd5.1713268818.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-21 14:00:35 +03:00
Chengchang Tang
349e859952 RDMA/hns: Modify the print level of CQE error
Too much print may lead to a panic in kernel. Change ibdev_err() to
ibdev_err_ratelimited(), and change the printing level of cqe dump
to debug level.

Fixes: 7c044adca2 ("RDMA/hns: Simplify the cqe code of poll cq")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-11-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
4125269bb9 RDMA/hns: Use complete parentheses in macros
Use complete parentheses to ensure that macro expansion does
not produce unexpected results.

Fixes: a25d13cbe8 ("RDMA/hns: Add the interfaces to support multi hop addressing for the contexts in hip08")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-10-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
wenglianfa
9a84848dce RDMA/hns: Add mutex_destroy()
Add mutex_destroy().

Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-9-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
ee04549328 RDMA/hns: Fix GMV table pagesize
GMV's BA table only supports 4K pages. Currently, PAGESIZE is used to
calculate gmv_bt_num, which will cause an abnormal number of gmv_bt_num
in a 64K OS.

Fixes: d6d91e4621 ("RDMA/hns: Add support for configuring GMV table")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-8-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
wenglianfa
dc3bda6e56 RDMA/hns: Fix mismatch exception rollback
When dma_alloc_coherent() fails in hns_roce_alloc_hem(), just call
kfree() to release hem instead of hns_roce_free_hem().

Fixes: c00743cbf2 ("RDMA/hns: Simplify 'struct hns_roce_hem' allocation")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
a942ec2745 RDMA/hns: Fix UAF for cq async event
The refcount of CQ is not protected by locks. When CQ asynchronous
events and CQ destruction are concurrent, CQ may have been released,
which will cause UAF.

Use the xa_lock() to protect the CQ refcount.

Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
b46494b6f9 RDMA/hns: Fix deadlock on SRQ async events.
xa_lock for SRQ table may be required in AEQ. Use xa_store_irq()/
xa_erase_irq() to avoid deadlock.

Fixes: 81fce6291d ("RDMA/hns: Add SRQ asynchronous event support")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
2ce384307f RDMA/hns: Add max_ah and cq moderation capacities in query_device()
Add max_ah and cq moderation capacities to hns_roce_query_device().

Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
f4caa864af RDMA/hns: Remove unused parameters and variables
Remove unused parameters and variables.

Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Yangyang Li
bfb6be4014 RDMA/hns: Use macro instead of magic number
Use macro instead of magic number.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Zhengchao Shao
203b70fda6 RDMA/hns: Fix return value in hns_roce_map_mr_sg
As described in the ib_map_mr_sg function comment, it returns the number
of sg elements that were mapped to the memory region. However,
hns_roce_map_mr_sg returns the number of pages required for mapping the
DMA area. Fix it.

Fixes: 9b2cf76c9f ("RDMA/hns: Optimize PBL buffer allocation process")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20240411033851.2884771-1-shaozhengchao@huawei.com
Reviewed-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:59:08 +03:00
Konstantin Taranov
8859f009ac RDMA/mana_ib: Configure mac address in RNIC
Set local mac address in RNIC, which is required by the HW.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-7-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
faafb8b126 RDMA/mana_ib: Adding and deleting GIDs
Implement add_gid and del_gid for RNIC.
IPv4 and IPv6 addresses are supported.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-6-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
8b184e4f1c RDMA/mana_ib: Enable RoCE on port 1
Set netdev and RoCEv2 flag to enable GID population on port 1.
Use GIDs of the master netdev. As mc->ports[] stores slave devices,
use a helper to get the master netdev.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-5-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
4bda1d5332 RDMA/mana_ib: Implement port parameters
Implement port parameters for RNIC:
1) extend query_port() method
2) implement get_link_layer()
3) implement query_pkey()

Only port 1 can store GIDs.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-4-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
1a79c2b9d4 RDMA/mana_ib: Create and destroy rnic adapter
Add functions for RNIC creation and destruction.
If creation fails, the ib_probe fails as well.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-3-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
98b889c439 RDMA/mana_ib: Add EQ creation for rnic adapter
Create an error EQ for the RNIC adapter.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-2-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
23f59f4e83 RDMA/mana_ib: Use num_comp_vectors of ib_device
Use num_comp_vectors of struct ib_device instead of max_num_queues
from gdma_context.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712911656-17352-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 13:29:47 +03:00
Jakub Kicinski
0e36c21d76 Merge branch mana-ib-flex of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
Erick Archer says:

====================
mana: Add flex array to struct mana_cfg_rx_steer_req_v2 (part)

The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
trailing elements. Specifically, it uses a "mana_handle_t" array. So,
use the preferred way in the kernel declaring a flexible array [1].

At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions).

Also, avoid the open-coded arithmetic in the memory allocator functions
[2] using the "struct_size" macro.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

Now, it is also possible to use the "flex_array_size" helper to compute
the size of these trailing elements in the "memcpy" function.

Specifically, the first commit adds the flex member and the patches 2 and
3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2".

This code was detected with the help of Coccinelle, and audited and
modified manually. The Coccinelle script used to detect this code pattern
is the following:

virtual report

@rule1@
type t1;
type t2;
identifier i0;
identifier i1;
identifier i2;
identifier ALLOC =~ "kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc";
position p1;
@@

i0 = sizeof(t1) + sizeof(t2) * i1;
...
i2 = ALLOC@p1(..., i0, ...);

@script:python depends on report@
p1 << rule1.p1;
@@

msg = "WARNING: verify allocation on line %s" % (p1[0].line)
coccilib.report.print_report(p1[0],msg)

Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1]
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2]

v1: https://lore.kernel.org/linux-hardening/AS8PR02MB7237974EF1B9BAFA618166C38B382@AS8PR02MB7237.eurprd02.prod.outlook.com/
v2: https://lore.kernel.org/linux-hardening/AS8PR02MB723729C5A63F24C312FC9CD18B3F2@AS8PR02MB7237.eurprd02.prod.outlook.com/
====================

Link: https://lore.kernel.org/r/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 07:38:28 -07:00
Zhu Yanjun
dfcdb38b21 RDMA/rxe: Return the correct errno
In the function __rxe_add_to_pool, the function xa_alloc_cyclic is
called. The return value of the function xa_alloc_cyclic is as below:
"
 Return: 0 if the allocation succeeded without wrapping.  1 if the
 allocation succeeded after wrapping, -ENOMEM if memory could not be
 allocated or -EBUSY if there are no free entries in @limit.
"
But now the function __rxe_add_to_pool only returns -EINVAL. All the
returned error value should be returned to the caller.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240408142142.792413-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-11 14:47:26 +03:00
Konstantin Taranov
c8fc935f4b RDMA/mana_ib: remove useless return values from dbg prints
Remove printing ret value on success as it was always 0.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712672465-29960-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-11 14:30:24 +03:00
Leon Romanovsky
e537deecda RDMA/mana_ib: Add flex array to struct mana_cfg_rx_steer_req_v2
The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
trailing elements. Specifically, it uses a "mana_handle_t" array. So,
use the preferred way in the kernel declaring a flexible array [1].

At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions).

Also, avoid the open-coded arithmetic in the memory allocator functions
[2] using the "struct_size" macro.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

Now, it is also possible to use the "flex_array_size" helper to compute
the size of these trailing elements in the "memcpy" function.

Specifically, the first commit adds the flex member and the patches 2
and 3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2".

This code was detected with the help of Coccinelle, and audited and
modified manually. The Coccinelle script used to detect this code
pattern is the following:

virtual report

@rule1@
type t1;
type t2;
identifier i0;
identifier i1;
identifier i2;
identifier ALLOC =~
"kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc";
position p1;
@@

i0 = sizeof(t1) + sizeof(t2) * i1;
...
i2 = ALLOC@p1(..., i0, ...);

@script:python depends on report@
p1 << rule1.p1;
@@

msg = "WARNING: verify allocation on line %s" % (p1[0].line)
coccilib.report.print_report(p1[0],msg)

[1] https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays
[2] https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://lore.kernel.org/all/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

* mana-ib-flex:
  net: mana: Avoid open coded arithmetic
  RDMA/mana_ib: Prefer struct_size over open coded arithmetic
  net: mana: Add flex array to struct mana_cfg_rx_steer_req_v2
2024-04-11 13:48:53 +03:00
Erick Archer
29b8e13a8b RDMA/mana_ib: Prefer struct_size over open coded arithmetic
This is an effort to get rid of all multiplications from allocation
functions in order to prevent integer overflows [1][2].

As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2"
and this structure ends in a flexible array:

struct mana_cfg_rx_steer_req_v2 {
	[...]
        mana_handle_t indir_tab[] __counted_by(num_indir_entries);
};

the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the calculation "size + size * count" in
the kzalloc() function.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

This way, the code is more readable and safer.

This code was detected with the help of Coccinelle, and audited and
modified manually.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
Link: https://github.com/KSPP/linux/issues/160 [2]
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Link: https://lore.kernel.org/r/AS8PR02MB72375EB06EE1A84A67BE722E8B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-11 13:46:47 +03:00
Junxian Huang
ee20cc17e9 RDMA/hns: Support DSCP
Add support for DSCP configuration. For DSCP, get dscp-prio mapping
via hns3 nic driver api .get_dscp_prio() and fill the SL (in WQE for
UD or in QPC for RC) with the priority value. The prio-tc mapping is
configured to HW by hns3 nic driver. HW will select a corresponding
TC according to SL and the prio-tc mapping.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240315093551.1650088-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-09 10:27:41 +03:00
Guillaume Nault
ec20b28300 ipv4: Set scope explicitly in ip_route_output().
Add a "scope" parameter to ip_route_output() so that callers don't have
to override the tos parameter with the RTO_ONLINK flag if they want a
local scope.

This will allow converting flowi4_tos to dscp_t in the future, thus
allowing static analysers to flag invalid interactions between
"tos" (the DSCP bits) and ECN.

Only three users ask for local scope (bonding, arp and atm). The others
continue to use RT_SCOPE_UNIVERSE. While there, add a comment to warn
users about the limitations of ip_route_output().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com> # infiniband
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-08 13:20:51 +01:00
Or Har-Toov
2ca7e93bc9 RDMA/mlx5: Adding remote atomic access flag to updatable flags
Currently IB_ACCESS_REMOTE_ATOMIC is blocked from being updated via UMR
although in some cases it should be possible. These cases are checked in
mlx5r_umr_can_reconfig function.

Fixes: ef3642c4f5 ("RDMA/mlx5: Fix error unwinds for rereg_mr")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://lore.kernel.org/r/24dac73e2fa48cb806f33a932d97f3e402a5ea2c.1712140377.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:34:54 +03:00
Or Har-Toov
8c1185fef6 RDMA/mlx5: Change check for cacheable mkeys
umem can be NULL for user application mkeys in some cases. Therefore
umem can't be used for checking if the mkey is cacheable and it is
changed for checking a flag that indicates it. Also make sure that
all mkeys which are not returned to the cache will be destroyed.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://lore.kernel.org/r/2690bc5c6896bcb937f89af16a1ff0343a7ab3d0.1712140377.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:34:51 +03:00
Or Har-Toov
0611a8e8b4 RDMA/mlx5: Uncacheable mkey has neither rb_key or cache_ent
As some mkeys can't be modified with UMR due to some UMR limitations,
like the size of translation that can be updated, not all user mkeys can
be cached.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://lore.kernel.org/r/f2742dd934ed73b2d32c66afb8e91b823063880c.1712140377.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:34:47 +03:00
Michael Guralnik
be121ffb38 RDMA/mlx5: Fix port number for counter query in multi-port configuration
Set the correct port when querying PPCNT in multi-port configuration.
Distinguish between cases where switchdev mode was enabled to multi-port
configuration and don't overwrite the queried port to 1 in multi-port
case.

Fixes: 74b30b3ad5 ("RDMA/mlx5: Set local port to one when accessing counters")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/9bfcc8ade958b760a51408c3ad654a01b11f7d76.1712134988.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:33:10 +03:00
Konstantin Taranov
f10242b3da RDMA/mana_ib: Use struct mana_ib_queue for RAW QPs
Use struct mana_ib_queue and its helpers for RAW QPs

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1711483688-24358-5-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-02 11:30:23 +03:00
Konstantin Taranov
688bac28e3 RDMA/mana_ib: Use struct mana_ib_queue for WQs
Use struct mana_ib_queue and its helpers for WQs

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1711483688-24358-4-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-02 11:30:23 +03:00
Konstantin Taranov
60a7ac0b8b RDMA/mana_ib: Use struct mana_ib_queue for CQs
Use struct mana_ib_queue and its helpers for CQs

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1711483688-24358-3-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-02 11:30:23 +03:00
Konstantin Taranov
46f5be7cd4 RDMA/mana_ib: Introduce helpers to create and destroy mana queues
Intoduce helpers to work with mana ib queues (struct mana_ib_queue).
A queue always consists of umem, gdma_region, and id.
A queue can become a WQ or a CQ.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1711483688-24358-2-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-02 11:30:22 +03:00
Mark Zhang
b68e1acb58 RDMA/cm: Print the old state when cm_destroy_id gets timeout
The old state is helpful for debugging, as the current state is always
IB_CM_IDLE when timeout happens.

Fixes: 96d9cbe2f2 ("RDMA/cm: add timeout to cm_destroy_id wait")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/20240322112049.2022994-1-markzhang@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 15:16:36 +03:00
Wenchao Hao
ca537a3477 RDMA/restrack: Fix potential invalid address access
struct rdma_restrack_entry's kern_name was set to KBUILD_MODNAME
in ib_create_cq(), while if the module exited but forgot del this
rdma_restrack_entry, it would cause a invalid address access in
rdma_restrack_clean() when print the owner of this rdma_restrack_entry.

These code is used to help find one forgotten PD release in one of the
ULPs. But it is not needed anymore, so delete them.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
Link: https://lore.kernel.org/r/20240318092320.1215235-1-haowenchao2@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:49:27 +03:00
Boshi Yu
df0e16bab5 RDMA/erdma: Remove unnecessary __GFP_ZERO flag
The dma_alloc_coherent() interface automatically zero the memory returned.
Thus, we do not need to specify the __GFP_ZERO flag explicitly when we call
dma_alloc_coherent().

Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Boshi Yu <boshiyu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240311113821.22482-4-boshiyu@alibaba-inc.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:46:01 +03:00
Boshi Yu
fdb09ed15f RDMA/erdma: Unify the names related to doorbell records
There exist two different names for the doorbell records: db_info and
db_record. We use dbrec for cpu address of the doorbell record and
dbrec_dma for dma address of the doorbell recordi uniformly.

Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Boshi Yu <boshiyu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240311113821.22482-3-boshiyu@alibaba-inc.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:46:01 +03:00
Boshi Yu
f0697bf078 RDMA/erdma: Allocate doorbell records from dma pool
Currently, the 8 byte doorbell record is allocated along with the queue
buffer, which may result in waste of dma space when the queue buffer is
page aligned. To address this issue, we introduce a dma pool named
db_pool and allocate doorbell record from it.

Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Boshi Yu <boshiyu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240311113821.22482-2-boshiyu@alibaba-inc.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:46:01 +03:00
Yanjun.Zhu
481047d7e8 RDMA/rxe: Fix the problem "mutex_destroy missing"
When a mutex lock is not used any more, the function mutex_destroy
should be called to mark the mutex lock uninitialized.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yanjun.Zhu <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240314065140.27468-1-yanjun.zhu@linux.dev
Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:36:41 +03:00
Linus Torvalds
6207b37eb5 RDMA v6.9
Very small update this cycle:
 
 - Minor code improvements in fi, rxe, ipoib, mana, cxgb4, mlx5, irdma,
   rxe, rtrs, mana
 
 - Simplify the hns hem mechanism
 
 - Fix EFA's MSI-X allocation in resource constrained configurations
 
 - Fix a KASN splat in srpt
 
 - Narrow hns's congestion control selection to QPs granularity and allow
   userspace to select it
 
 - Solve a parallel module loading race between the CM module and a driver
   module
 
 - Flexible array cleanup
 
 - Dump hns's SCC Conext to 'rdma res' for debugging
 
 - Make mana build page lists for HW objects that require a 0 offset
   correctly
 
 - Stuck CM ID debugging
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZfgzdQAKCRCFwuHvBreF
 YbS7AQDLy6uJ/1dgrZQ4efcyQDs6H93LG4jWZKoA7F9Oho+MFQEAsQM/UL4nj18O
 T6vHl30N0Ee0aOCqET7HBbnFGKEADAE=
 =KxUj
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "Very small update this cycle:

   - Minor code improvements in fi, rxe, ipoib, mana, cxgb4, mlx5,
     irdma, rxe, rtrs, mana

   - Simplify the hns hem mechanism

   - Fix EFA's MSI-X allocation in resource constrained configurations

   - Fix a KASN splat in srpt

   - Narrow hns's congestion control selection to QPs granularity and
     allow userspace to select it

   - Solve a parallel module loading race between the CM module and a
     driver module

   - Flexible array cleanup

   - Dump hns's SCC Conext to 'rdma res' for debugging

   - Make mana build page lists for HW objects that require a 0 offset
     correctly

   - Stuck CM ID debugging"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (29 commits)
  RDMA/cm: add timeout to cm_destroy_id wait
  RDMA/mana_ib: Use virtual address in dma regions for MRs
  RDMA/mana_ib: Fix bug in creation of dma regions
  RDMA/hns: Append SCC context to the raw dump of QPC
  RDMA/uverbs: Avoid -Wflex-array-member-not-at-end warnings
  RDMA/hns: Support userspace configuring congestion control algorithm with QP granularity
  RDMA/rtrs-clt: Check strnlen return len in sysfs mpath_policy_store()
  RDMA/uverbs: Remove flexible arrays from struct *_filter
  RDMA/device: Fix a race between mad_client and cm_client init
  RDMA/hns: Fix mis-modifying default congestion control algorithm
  RDMA/rxe: Remove unused 'iova' parameter from rxe_mr_init_user
  RDMA/srpt: Do not register event handler until srpt device is fully setup
  RDMA/irdma: Remove duplicate assignment
  RDMA/efa: Limit EQs to available MSI-X vectors
  RDMA/mlx5: Delete unused mlx5_ib_copy_pas prototype
  RDMA/cxgb4: Delete unused c4iw_ep_redirect prototype
  RDMA/mana_ib: Introduce mana_ib_install_cq_cb helper function
  RDMA/mana_ib: Introduce mana_ib_get_netdev helper function
  RDMA/mana_ib: Introduce mdev_to_gc helper function
  RDMA/hns: Simplify 'struct hns_roce_hem' allocation
  ...
2024-03-18 15:34:03 -07:00
Manjunath Patil
96d9cbe2f2 RDMA/cm: add timeout to cm_destroy_id wait
Add timeout to cm_destroy_id, so that userspace can trigger any data
collection that would help in analyzing the cause of delay in destroying
the cm_id.

New noinline function helps dtrace/ebpf programs to hook on to it.
Existing functionality isn't changed except triggering a probe-able new
function at every timeout interval.

We have seen cases where CM messages stuck with MAD layer (either due to
software bug or faulty HCA), leading to cm_id getting stuck in the
following call stack. This patch helps in resolving such issues faster.

kernel: ... INFO: task XXXX:56778 blocked for more than 120 seconds.
...
	Call Trace:
	__schedule+0x2bc/0x895
	schedule+0x36/0x7c
	schedule_timeout+0x1f6/0x31f
 	? __slab_free+0x19c/0x2ba
	wait_for_completion+0x12b/0x18a
	? wake_up_q+0x80/0x73
	cm_destroy_id+0x345/0x610 [ib_cm]
	ib_destroy_cm_id+0x10/0x20 [ib_cm]
	rdma_destroy_id+0xa8/0x300 [rdma_cm]
	ucma_destroy_id+0x13e/0x190 [rdma_ucm]
	ucma_write+0xe0/0x160 [rdma_ucm]
	__vfs_write+0x3a/0x16d
	vfs_write+0xb2/0x1a1
	? syscall_trace_enter+0x1ce/0x2b8
	SyS_write+0x5c/0xd3
	do_syscall_64+0x79/0x1b9
	entry_SYSCALL_64_after_hwframe+0x16d/0x0

Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Link: https://lore.kernel.org/r/20240309063323.458102-1-manjunath.b.patil@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-10 13:17:54 +02:00
Konstantin Taranov
2d5c008157 RDMA/mana_ib: Use virtual address in dma regions for MRs
Introduce mana_ib_create_dma_region() to create dma regions with iova
for MRs. It allows creating MRs with any page offset. Previously,
only page-aligned addresses worked.

For dma regions that must have a zero dma offset (e.g., for queues),
mana_ib_create_zero_offset_dma_region() is added.
To get the zero offset, ib_umem_find_best_pgoff() is used with zero
pgoff_bitmask.

Fixes: 0266a17763 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1709560361-26393-3-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-07 11:32:33 +02:00
Konstantin Taranov
e02497fb65 RDMA/mana_ib: Fix bug in creation of dma regions
Use ib_umem_dma_offset() helper to calculate correct dma offset.

Fixes: 0266a17763 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1709560361-26393-2-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-07 11:32:33 +02:00
wenglianfa
124a9fbe43 RDMA/hns: Append SCC context to the raw dump of QPC
SCCC (SCC Context) is a context with QP granularity that contains
information about congestion control. Dump SCCC and QPC together
to improve troubleshooting.

When dumping raw QPC with rdmatool, there will be a total of 576 bytes
data output, where the first 512 bytes is QPC and the last 64 bytes is
SCCC. When congestion control is disabled, the 64 byte SCCC will be all 0.

Example:
$rdma res show qp -jpr
[ {
        "ifindex": 0,
        "ifname": "hns_0",
	"data": [ 67,0,0,0... 512bytes
		  4,0,2... 64bytes]
  },...
} ]

Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240305055257.823513-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-07 11:26:10 +02:00
Gustavo A. R. Silva
155f04366e RDMA/uverbs: Avoid -Wflex-array-member-not-at-end warnings
-Wflex-array-member-not-at-end is coming in GCC-14, and we are getting
ready to enable it globally.

There are currently a couple of objects (`alloc_head` and `bundle`) in
`struct bundle_priv` that contain a couple of flexible structures:

struct bundle_priv {
        /* Must be first */
        struct bundle_alloc_head alloc_head;

	...

        /*
         * Must be last. bundle ends in a flex array which overlaps
         * internal_buffer.
         */
        struct uverbs_attr_bundle bundle;
        u64 internal_buffer[32];
};

So, in order to avoid ending up with a couple of flexible-array members
in the middle of a struct, we use the `struct_group_tagged()` helper to
separate the flexible array from the rest of the members in the flexible
structures:

struct uverbs_attr_bundle {
        struct_group_tagged(uverbs_attr_bundle_hdr, hdr,
		... the rest of the members
        );
        struct uverbs_attr attrs[];
};

With the change described above, we now declare objects of the type of
the tagged struct without embedding flexible arrays in the middle of
another struct:

struct bundle_priv {
        /* Must be first */
        struct bundle_alloc_head_hdr alloc_head;

        ...

        struct uverbs_attr_bundle_hdr bundle;
        u64 internal_buffer[32];
};

We also use `container_of()` whenever we need to retrieve a pointer
to the flexible structures.

Notice that the `bundle_size` computed in `uapi_compute_bundle_size()`
remains the same.

So, with these changes, fix the following warnings:

drivers/infiniband/core/uverbs_ioctl.c:45:34: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
   45 |         struct bundle_alloc_head alloc_head;
      |                                  ^~~~~~~~~~
drivers/infiniband/core/uverbs_ioctl.c:67:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
   67 |         struct uverbs_attr_bundle bundle;
      |                                   ^~~~~~

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/ZeIgeZ5Sb0IZTOyt@neat
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-03 15:38:44 +02:00
Junxian Huang
6ec429d588 RDMA/hns: Support userspace configuring congestion control algorithm with QP granularity
Currently, congestion control algorithm is statically configured in
FW, and all QPs use the same algorithm(except UD which has a fixed
configuration of DCQCN). This is not flexible enough.

Support userspace configuring congestion control algorithm with QP
granularity while creating QPs. If the algorithm is not specified in
userspace, use the default one.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240301104845.1141083-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-03 15:01:33 +02:00
Eric Dumazet
e353ea9ce4 rtnetlink: prepare nla_put_iflink() to run under RCU
We want to be able to run rtnl_fill_ifinfo() under RCU protection
instead of RTNL in the future.

This patch prepares dev_get_iflink() and nla_put_iflink()
to run either with RTNL or RCU held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-26 11:46:12 +00:00
Al Viro
aa23317d02 qibfs: fix dentry leak
simple_recursive_removal() drops the pinning references to all positives
in subtree.  For the cases when its argument has been kept alive by
the pinning alone that's exactly the right thing to do, but here
the argument comes from dcache lookup, that needs to be balanced by
explicit dput().

Fixes: e41d237818 "qib_fs: switch to simple_recursive_removal()"
Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 23:58:42 -05:00
Alexey Kodanev
7a7b7f575a RDMA/rtrs-clt: Check strnlen return len in sysfs mpath_policy_store()
strnlen() may return 0 (e.g. for "\0\n" string), it's better to
check the result of strnlen() before using 'len - 1' expression
for the 'buf' array index.

Detected using the static analysis tool - Svace.

Fixes: dc3b66a0ce ("RDMA/rtrs-clt: Add a minimum latency multipath policy")
Signed-off-by: Alexey Kodanev <aleksei.kodanev@bell-sw.com>
Link: https://lore.kernel.org/r/20240221113204.147478-1-aleksei.kodanev@bell-sw.com
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-25 18:21:42 +02:00
Erick Archer
14b526f55b RDMA/uverbs: Remove flexible arrays from struct *_filter
When a struct containing a flexible array is included in another struct,
and there is a member after the struct-with-flex-array, there is a
possibility of memory overlap. These cases must be audited [1]. See:

struct inner {
	...
	int flex[];
};

struct outer {
	...
	struct inner header;
	int overlap;
	...
};

This is the scenario for all the "struct *_filter" structures that are
included in the following "struct ib_flow_spec_*" structures:

struct ib_flow_spec_eth
struct ib_flow_spec_ib
struct ib_flow_spec_ipv4
struct ib_flow_spec_ipv6
struct ib_flow_spec_tcp_udp
struct ib_flow_spec_tunnel
struct ib_flow_spec_esp
struct ib_flow_spec_gre
struct ib_flow_spec_mpls

The pattern is like the one shown below:

struct *_filter {
	...
	u8 real_sz[];
};

struct ib_flow_spec_* {
	...
	struct *_filter val;
	struct *_filter mask;
};

In this case, the trailing flexible array "real_sz" is never allocated
and is only used to calculate the size of the structures. Here the use
of the "offsetof" helper can be changed by the "sizeof" operator because
the goal is to get the size of these structures. Therefore, the trailing
flexible arrays can also be removed.

However, due to the trailing padding that can be induced in structs it
is possible that the:

offsetof(struct *_filter, real_sz) != sizeof(struct *_filter)

This situation happens with the "struct ib_flow_ipv6_filter" and to
avoid it the "__packed" macro is used in this structure. But now, the
"sizeof(struct ib_flow_ipv6_filter)" has changed. This is not a problem
since this size is not used in the code.

The situation now is that "sizeof(struct ib_flow_spec_ipv6)" has also
changed (this struct contains the struct ib_flow_ipv6_filter). This is
also not a problem since it is only used to set the size of the "union
ib_flow_spec", which can store all the "ib_flow_spec_*" structures.

Link: https://lore.kernel.org/r/20240217142913.4285-1-erick.archer@gmx.com
Signed-off-by: Erick Archer <erick.archer@gmx.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-02-21 13:28:52 -04:00
Shifeng Li
7a8bccd8b2 RDMA/device: Fix a race between mad_client and cm_client init
The mad_client will be initialized in enable_device_and_get(), while the
devices_rwsem will be downgraded to a read semaphore. There is a window
that leads to the failed initialization for cm_client, since it can not
get matched mad port from ib_mad_port_list, and the matched mad port will
be added to the list after that.

    mad_client    |                       cm_client
------------------|--------------------------------------------------------
ib_register_device|
enable_device_and_get
down_write(&devices_rwsem)
xa_set_mark(&devices, DEVICE_REGISTERED)
downgrade_write(&devices_rwsem)
                  |
                  |ib_cm_init
                  |ib_register_client(&cm_client)
                  |down_read(&devices_rwsem)
                  |xa_for_each_marked (&devices, DEVICE_REGISTERED)
                  |add_client_context
                  |cm_add_one
                  |ib_register_mad_agent
                  |ib_get_mad_port
                  |__ib_get_mad_port
                  |list_for_each_entry(entry, &ib_mad_port_list, port_list)
                  |return NULL
                  |up_read(&devices_rwsem)
                  |
add_client_context|
ib_mad_init_device|
ib_mad_port_open  |
list_add_tail(&port_priv->port_list, &ib_mad_port_list)
up_read(&devices_rwsem)
                  |

Fix it by using down_write(&devices_rwsem) in ib_register_client().

Fixes: d0899892ed ("RDMA/device: Provide APIs from the core code to help unregistration")
Link: https://lore.kernel.org/r/20240203035313.98991-1-lishifeng@sangfor.com.cn
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Shifeng Li <lishifeng@sangfor.com.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-02-21 13:15:50 -04:00
Luoyouming
d20a7cf9f7 RDMA/hns: Fix mis-modifying default congestion control algorithm
Commit 27c5fd271d ("RDMA/hns: The UD mode can only be configured
with DCQCN") adds a check of congest control alorithm for UD. But
that patch causes a problem: hr_dev->caps.congest_type is global,
used by all QPs, so modifying this field to DCQCN for UD QPs causes
other QPs unable to use any other algorithm except DCQCN.

Revert the modification in commit 27c5fd271d ("RDMA/hns: The UD
mode can only be configured with DCQCN"). Add a new field cong_type
to struct hns_roce_qp and configure DCQCN for UD QPs.

Fixes: 27c5fd271d ("RDMA/hns: The UD mode can only be configured with DCQCN")
Fixes: f91696f2f0 ("RDMA/hns: Support congestion control type selection according to the FW")
Signed-off-by: Luoyouming <luoyouming@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240219061805.668170-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-19 09:50:31 +02:00
Arnd Bergmann
eb5c7465c3 RDMA/srpt: fix function pointer cast warnings
clang-16 notices that srpt_qp_event() gets called through an incompatible
pointer here:

drivers/infiniband/ulp/srpt/ib_srpt.c:1815:5: error: cast from 'void (*)(struct ib_event *, struct srpt_rdma_ch *)' to 'void (*)(struct ib_event *, void *)' converts to incompatible function type [-Werror,-Wcast-function-type-strict]
 1815 |                 = (void(*)(struct ib_event *, void*))srpt_qp_event;

Change srpt_qp_event() to use the correct prototype and adjust the
argument inside of it.

Fixes: a42d985bd5 ("ib_srpt: Initial SRP Target merge for v3.3-rc1")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240213100728.458348-1-arnd@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-14 11:15:54 +02:00
Kamal Heib
5ba4e6d586 RDMA/qedr: Fix qedr_create_user_qp error flow
Avoid the following warning by making sure to free the allocated
resources in case that qedr_init_user_queue() fail.

-----------[ cut here ]-----------
WARNING: CPU: 0 PID: 143192 at drivers/infiniband/core/rdma_core.c:874 uverbs_destroy_ufile_hw+0xcf/0xf0 [ib_uverbs]
Modules linked in: tls target_core_user uio target_core_pscsi target_core_file target_core_iblock ib_srpt ib_srp scsi_transport_srp nfsd nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs 8021q garp mrp stp llc ext4 mbcache jbd2 opa_vnic ib_umad ib_ipoib sunrpc rdma_ucm ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm hfi1 intel_rapl_msr intel_rapl_common mgag200 qedr sb_edac drm_shmem_helper rdmavt x86_pkg_temp_thermal drm_kms_helper intel_powerclamp ib_uverbs coretemp i2c_algo_bit kvm_intel dell_wmi_descriptor ipmi_ssif sparse_keymap kvm ib_core rfkill syscopyarea sysfillrect video sysimgblt irqbypass ipmi_si ipmi_devintf fb_sys_fops rapl iTCO_wdt mxm_wmi iTCO_vendor_support intel_cstate pcspkr dcdbas intel_uncore ipmi_msghandler lpc_ich acpi_power_meter mei_me mei fuse drm xfs libcrc32c qede sd_mod ahci libahci t10_pi sg crct10dif_pclmul crc32_pclmul crc32c_intel qed libata tg3
ghash_clmulni_intel megaraid_sas crc8 wmi [last unloaded: ib_srpt]
CPU: 0 PID: 143192 Comm: fi_rdm_tagged_p Kdump: loaded Not tainted 5.14.0-408.el9.x86_64 #1
Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 2.14.0 01/25/2022
RIP: 0010:uverbs_destroy_ufile_hw+0xcf/0xf0 [ib_uverbs]
Code: 5d 41 5c 41 5d 41 5e e9 0f 26 1b dd 48 89 df e8 67 6a ff ff 49 8b 86 10 01 00 00 48 85 c0 74 9c 4c 89 e7 e8 83 c0 cb dd eb 92 <0f> 0b eb be 0f 0b be 04 00 00 00 48 89 df e8 8e f5 ff ff e9 6d ff
RSP: 0018:ffffb7c6cadfbc60 EFLAGS: 00010286
RAX: ffff8f0889ee3f60 RBX: ffff8f088c1a5200 RCX: 00000000802a0016
RDX: 00000000802a0017 RSI: 0000000000000001 RDI: ffff8f0880042600
RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
R10: ffff8f11fffd5000 R11: 0000000000039000 R12: ffff8f0d5b36cd80
R13: ffff8f088c1a5250 R14: ffff8f1206d91000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8f11d7c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000147069200e20 CR3: 00000001c7210002 CR4: 00000000001706f0
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? show_trace_log_lvl+0x1c4/0x2df
? ib_uverbs_close+0x1f/0xb0 [ib_uverbs]
? uverbs_destroy_ufile_hw+0xcf/0xf0 [ib_uverbs]
? __warn+0x81/0x110
? uverbs_destroy_ufile_hw+0xcf/0xf0 [ib_uverbs]
? report_bug+0x10a/0x140
? handle_bug+0x3c/0x70
? exc_invalid_op+0x14/0x70
? asm_exc_invalid_op+0x16/0x20
? uverbs_destroy_ufile_hw+0xcf/0xf0 [ib_uverbs]
ib_uverbs_close+0x1f/0xb0 [ib_uverbs]
__fput+0x94/0x250
task_work_run+0x5c/0x90
do_exit+0x270/0x4a0
do_group_exit+0x2d/0x90
get_signal+0x87c/0x8c0
arch_do_signal_or_restart+0x25/0x100
? ib_uverbs_ioctl+0xc2/0x110 [ib_uverbs]
exit_to_user_mode_loop+0x9c/0x130
exit_to_user_mode_prepare+0xb6/0x100
syscall_exit_to_user_mode+0x12/0x40
do_syscall_64+0x69/0x90
? syscall_exit_work+0x103/0x130
? syscall_exit_to_user_mode+0x22/0x40
? do_syscall_64+0x69/0x90
? syscall_exit_work+0x103/0x130
? syscall_exit_to_user_mode+0x22/0x40
? do_syscall_64+0x69/0x90
? do_syscall_64+0x69/0x90
? common_interrupt+0x43/0xa0
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x1470abe3ec6b
Code: Unable to access opcode bytes at RIP 0x1470abe3ec41.
RSP: 002b:00007fff13ce9108 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: fffffffffffffffc RBX: 00007fff13ce9218 RCX: 00001470abe3ec6b
RDX: 00007fff13ce9200 RSI: 00000000c0181b01 RDI: 0000000000000004
RBP: 00007fff13ce91e0 R08: 0000558d9655da10 R09: 0000558d9655dd00
R10: 00007fff13ce95c0 R11: 0000000000000246 R12: 00007fff13ce9358
R13: 0000000000000013 R14: 0000558d9655db50 R15: 00007fff13ce9470
</TASK>
--[ end trace 888a9b92e04c5c97 ]--

Fixes: df15856132 ("RDMA/qedr: restructure functions that create/destroy QPs")
Signed-off-by: Kamal Heib <kheib@redhat.com>
Link: https://lore.kernel.org/r/20240208223628.2040841-1-kheib@redhat.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-12 13:49:04 +02:00
Bart Van Assche
fdfa083549 RDMA/srpt: Support specifying the srpt_service_guid parameter
Make loading ib_srpt with this parameter set work. The current behavior is
that setting that parameter while loading the ib_srpt kernel module
triggers the following kernel crash:

BUG: kernel NULL pointer dereference, address: 0000000000000000
Call Trace:
 <TASK>
 parse_one+0x18c/0x1d0
 parse_args+0xe1/0x230
 load_module+0x8de/0xa60
 init_module_from_file+0x8b/0xd0
 idempotent_init_module+0x181/0x240
 __x64_sys_finit_module+0x5a/0xb0
 do_syscall_64+0x5f/0xe0
 entry_SYSCALL_64_after_hwframe+0x6e/0x76

Cc: LiHonggang <honggangli@163.com>
Reported-by: LiHonggang <honggangli@163.com>
Fixes: a42d985bd5 ("ib_srpt: Initial SRP Target merge for v3.3-rc1")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240205004207.17031-1-bvanassche@acm.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-05 09:57:53 +02:00
Guoqing Jiang
aafe4cc509 RDMA/rxe: Remove unused 'iova' parameter from rxe_mr_init_user
This one is not needed since commit 954afc5a8f ("RDMA/rxe:
Use members of generic struct in rxe_mr").

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20240202124144.16033-1-guoqing.jiang@linux.dev
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-04 13:05:43 +02:00
William Kucharski
c21a8870c9 RDMA/srpt: Do not register event handler until srpt device is fully setup
Upon rare occasions, KASAN reports a use-after-free Write
in srpt_refresh_port().

This seems to be because an event handler is registered before the
srpt device is fully setup and a race condition upon error may leave a
partially setup event handler in place.

Instead, only register the event handler after srpt device initialization
is complete.

Fixes: a42d985bd5 ("ib_srpt: Initial SRP Target merge for v3.3-rc1")
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Link: https://lore.kernel.org/r/20240202091549.991784-2-william.kucharski@oracle.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-04 11:44:50 +02:00
Daniel Vacek
e6f57c6881 IB/hfi1: Fix sdma.h tx->num_descs off-by-one error
Unfortunately the commit `fd8958efe877` introduced another error
causing the `descs` array to overflow. This reults in further crashes
easily reproducible by `sendmsg` system call.

[ 1080.836473] general protection fault, probably for non-canonical address 0x400300015528b00a: 0000 [#1] PREEMPT SMP PTI
[ 1080.869326] RIP: 0010:hfi1_ipoib_build_ib_tx_headers.constprop.0+0xe1/0x2b0 [hfi1]
--
[ 1080.974535] Call Trace:
[ 1080.976990]  <TASK>
[ 1081.021929]  hfi1_ipoib_send_dma_common+0x7a/0x2e0 [hfi1]
[ 1081.027364]  hfi1_ipoib_send_dma_list+0x62/0x270 [hfi1]
[ 1081.032633]  hfi1_ipoib_send+0x112/0x300 [hfi1]
[ 1081.042001]  ipoib_start_xmit+0x2a9/0x2d0 [ib_ipoib]
[ 1081.046978]  dev_hard_start_xmit+0xc4/0x210
--
[ 1081.148347]  __sys_sendmsg+0x59/0xa0

crash> ipoib_txreq 0xffff9cfeba229f00
struct ipoib_txreq {
  txreq = {
    list = {
      next = 0xffff9cfeba229f00,
      prev = 0xffff9cfeba229f00
    },
    descp = 0xffff9cfeba229f40,
    coalesce_buf = 0x0,
    wait = 0xffff9cfea4e69a48,
    complete = 0xffffffffc0fe0760 <hfi1_ipoib_sdma_complete>,
    packet_len = 0x46d,
    tlen = 0x0,
    num_desc = 0x0,
    desc_limit = 0x6,
    next_descq_idx = 0x45c,
    coalesce_idx = 0x0,
    flags = 0x0,
    descs = {{
        qw = {0x8024000120dffb00, 0x4}  # SDMA_DESC0_FIRST_DESC_FLAG (bit 63)
      }, {
        qw = {  0x3800014231b108, 0x4}
      }, {
        qw = { 0x310000e4ee0fcf0, 0x8}
      }, {
        qw = {  0x3000012e9f8000, 0x8}
      }, {
        qw = {  0x59000dfb9d0000, 0x8}
      }, {
        qw = {  0x78000e02e40000, 0x8}
      }}
  },
  sdma_hdr =  0x400300015528b000,  <<< invalid pointer in the tx request structure
  sdma_status = 0x0,                   SDMA_DESC0_LAST_DESC_FLAG (bit 62)
  complete = 0x0,
  priv = 0x0,
  txq = 0xffff9cfea4e69880,
  skb = 0xffff9d099809f400
}

If an SDMA send consists of exactly 6 descriptors and requires dword
padding (in the 7th descriptor), the sdma_txreq descriptor array is not
properly expanded and the packet will overflow into the container
structure. This results in a panic when the send completion runs. The
exact panic varies depending on what elements of the container structure
get corrupted. The fix is to use the correct expression in
_pad_sdma_tx_descs() to test the need to expand the descriptor array.

With this patch the crashes are no longer reproducible and the machine is
stable.

Fixes: fd8958efe8 ("IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors")
Cc: stable@vger.kernel.org
Reported-by: Mats Kronberg <kronberg@nsc.liu.se>
Tested-by: Mats Kronberg <kronberg@nsc.liu.se>
Signed-off-by: Daniel Vacek <neelx@redhat.com>
Link: https://lore.kernel.org/r/20240201081009.1109442-1-neelx@redhat.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-04 11:40:06 +02:00
Mustafa Ismail
926e8ea4b8 RDMA/irdma: Remove duplicate assignment
Remove the unneeded assignment of the qp_num which is already
set in irdma_create_qp().

Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Sindhu Devale <sindhu.devale@intel.com>
Link: https://lore.kernel.org/r/20240131233953.400483-1-sindhu.devale@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-04 11:37:58 +02:00
Mustafa Ismail
630bdb6f28 RDMA/irdma: Add AE for too many RNRS
Add IRDMA_AE_LLP_TOO_MANY_RNRS to the list of AE's processed as an
abnormal asyncronous event.

Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Sindhu Devale <sindhu.devale@gmail.com>
Link: https://lore.kernel.org/r/20240131233849.400285-5-sindhu.devale@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-02-04 11:36:26 +02:00