Jakub Kicinski says:
====================
selftests: net: page_poll allocation error injection
Add a test for exercising driver memory allocation failure paths.
page pool is a bit tricky to inject errors into at the page allocator
level because of the bulk alloc and recycling, so add explicit error
injection support "in front" of the caches.
Add a test to exercise that using only the standard APIs.
This is the first useful test for the new tests with an endpoint.
There's no point testing netdevsim here, so this is also the first
HW-only test in Python.
I'm not super happy with the traffic generation using iperf3,
my initial approach was to use mausezahn. But it turned out to be
5x slower in terms of PPS. Hopefully this is good enough for now.
v1: https://lore.kernel.org/all/20240426232400.624864-1-kuba@kernel.org/
====================
Link: https://lore.kernel.org/r/20240429144426.743476-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bugs in memory allocation failure paths are quite common.
Add a test exercising those paths based on qstat and page pool
failure hook.
Running on bnxt:
# ./drivers/net/hw/pp_alloc_fail.py
KTAP version 1
1..1
# ethtool -G change retval: success
ok 1 pp_alloc_fail.test_pp_alloc
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
I initially wrote this test to validate commit be43b7489a ("net/mlx5e:
RX, Fix page_pool allocation failure recovery for striding rq") but mlx5
still doesn't have qstat. So I run it on bnxt, and while bnxt survives
I found the problem fixed in commit 7301177307 ("eth: bnxt: fix counting
packets discarded due to OOM and netpoll").
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240429144426.743476-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While we are not very interested in testing performance
it's useful to be able to generate a lot of traffic.
iperf is the simplest way of getting relatively high PPS.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240429144426.743476-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When picking TCP ports to use, avoid all below 10k.
This should lower the chance of collision or running
afoul whatever random policies may be on the host.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240429144426.743476-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The main use of the ip() wrapper over cmd() is that it can parse JSON.
cmd("ip -j link show") will return stdout as a string, and test has
to call json.loads(). With ip("link show", json=True) the return value
will be already parsed.
More tools (ethtool, bpftool etc.) support the --json switch.
To avoid having to wrap all of them individually create a tool()
helper.
Switch from -j to --json (for ethtool).
While at it consume the netns attribute at the ip() level.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240429144426.743476-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We created a separate directory for HW-only tests, recently.
Glue in the Python test library there, Python is a bit annoying
when it comes to using library code located "lower"
in the directory structure.
Reuse the Env class, but let tests require non-nsim setup.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240429144426.743476-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Because of caching / recycling using the general page allocation
failures to induce errors in page pool allocation is very hard.
Add direct error injection support to page_pool_alloc_pages().
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240429144426.743476-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub reports that some tests fail on netdev CI when executed in a debug
kernel.
Increase test timeout to 30m, this should hopefully be enough.
Also reduce test duration where possible for "slow" machines.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20240429105736.22677-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sfp_select_interface() does not modify its link_modes argument, so
make this a const pointer.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/E1s15s0-00AHyq-8E@rmk-PC.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Allow use of 2500base-X interface mode for PHY modules that support
2500base-T.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/E1s15rv-00AHyk-5S@rmk-PC.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a debugging print in phylink_validate_phy() when we detect that the
PHY has not supplied a possible_interfaces bitmap.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/E1s15rq-00AHye-22@rmk-PC.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Convert realtek to provide its own phylink MAC operations, thus
avoiding the shim layer in DSA's port.c. We need to provide a stub for
the mandatory mac_config() method for rtl8366rb.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/E1s11qJ-00AHi0-Kk@rmk-PC.armlinux.org.uk
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
DSA initalises the ds->num_ports amount of ports in
dsa_switch_touch_ports(). When the PHY muxing feature is in use, port 5
won't be defined in the device tree. Because of this, the type member of
the dsa_port structure for this port will be assigned DSA_PORT_TYPE_UNUSED.
The dsa_port_setup() function calls ds->ops->port_disable() when the port
type is DSA_PORT_TYPE_UNUSED.
The MT7530_P5_DIS bit is unset in mt7530_setup() when PHY muxing is being
used. mt7530_port_disable() which is assigned to ds->ops->port_disable() is
called afterwards. Currently, mt7530_port_disable() sets MT7530_P5_DIS
which breaks network connectivity when PHY muxing is being used.
Therefore, do not set MT7530_P5_DIS when PHY muxing is being used.
Fixes: 377174c576 ("net: dsa: mt7530: move MT753X_MTRAP operations for MT7530")
Reported-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20240428-for-netnext-mt7530-do-not-disable-port5-when-phy-muxing-v2-1-bb7c37d293f8@arinc9.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Wen Gu says:
====================
net/smc: SMC intra-OS shortcut with loopback-ism
This patch set acts as the second part of the new version of [1] (The first
part can be referred from [2]), the updated things of this version are listed
at the end.
- Background
SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-implemented Emulated-ISM device,
that is the loopback-ism device here, to accelerate inter-process or
inter-containers communication within the same OS instance.
- Design
This patch set includes 3 parts:
- Patch #1: some prepare work for loopback-ism.
- Patch #2-#7: implement loopback-ism device and adapt SMC-D for it.
loopback-ism now serves only SMC and no userspace interfaces exposed.
- Patch #8-#11: memory copy optimization for intra-OS scenario.
The loopback-ism device is designed as an ISMv2 device and not be limited to
a specific net namespace, ends of both inter-process connection (1/1' in diagram
below) or inter-container connection (2/2' in diagram below) can find the same
available loopback-ism and choose it during the CLC handshake.
Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ +-------+ +-------+ | | +-------+ |
| | App A | | App B | | App C | | | | App D |<-+ |
| +-------+ +---^---+ +-------+ | | +-------+ |(2') |
| |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| |
| (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ |
| `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | |
+---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+
| | | |
Kernel | | | |
+----+-------v---+-----------v----------------------------------+---+----+
| | TCP | |
| | | |
| +--------------------------------------------------------------+ |
| |
| +--------------+ |
| | smc loopback | |
+---------------------------+--------------+-----------------------------+
loopback-ism device creates DMBs (shared memory) for each connection peer.
Since data transfer occurs within the same kernel, the sndbuf of each peer
is only a descriptor and point to the same memory region as peer DMB, so that
the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
Container 1 (ns1) Container 2 (ns2)
+-----------------------------------------+ +-------------------------+
| +-------+ | | +-------+ |
| | App C |-----+ | | | App D | |
| +-------+ | | | +-^-----+ |
| | | | | |
| (2) | | | (2') | |
| | | | | |
+---------------|-------------------------+ +----------|--------------+
| |
Kernel | |
+---------------|-----------------------------------------|--------------+
| +--------+ +--v-----+ +--------+ +--------+ |
| |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| |
| +-----|--+ +--|-----+ +-----|--+ +--------+ |
| +-----|--+ | +-----|--+ |
| | DMB C | +---------------------------------| DMB D | |
| +--------+ +--------+ |
| |
| +--------------+ |
| | smc loopback | |
+---------------------------+--------------+-----------------------------+
- Benchmark Test
* Test environments:
- VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
- SMC sndbuf/DMB size 1MB.
* Test object:
- TCP: run on TCP loopback.
- SMC lo: run on SMC loopback-ism.
1. ipc-benchmark (see [3])
- ./<foo> -c 1000000 -s 100
TCP SMC-lo
Message
rate (msg/s) 84991 151293(+78.01%)
2. sockperf
- serv: <smc_run> sockperf sr --tcp
- clnt: <smc_run> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
TCP SMC-lo
Bandwidth(MBps) 5033.569 7987.732(+58.69%)
Latency(us) 5.986 3.398(-43.23%)
3. nginx/wrk
- serv: <smc_run> nginx
- clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
TCP SMC-lo
Requests/s 187951.76 267107.90(+42.12%)
4. redis-benchmark
- serv: <smc_run> redis-server
- clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
TCP SMC-lo
GET(Requests/s) 86132.64 118133.49(+37.15%)
SET(Requests/s) 87374.40 122887.86(+40.65%)
Change log:
v7->v6
- Patch #2: minor: remove unnecessary 'return' of inline smc_loopback_exit().
- Patch #10: minor: directly return 0 instead of 'rc' in smcd_cdc_msg_send().
- all: collect the Reviewed-by tags.
v6->RFC v5
Link: https://lore.kernel.org/netdev/20240414040304.54255-1-guwen@linux.alibaba.com/
- Patch #2: make the use of CONFIG_SMC_LO cleaner.
- Patch #5: mark some smcd_ops that loopback-ism doesn't support as
optional and check for the support when they are called.
- Patch #7: keep loopback-ism at the beginning of the SMC-D device list.
- Some expression changes in commit logs and comments.
RFC v5->RFC v4:
Link: https://lore.kernel.org/netdev/20240324135522.108564-1-guwen@linux.alibaba.com/
- Patch #2: minor changes in description of config SMC_LO and comments.
- Patch #10: minor changes in comments and if(smc_ism_support_dmb_nocopy())
check in smcd_cdc_msg_send().
- Patch #3: change smc_lo_generate_id() to smc_lo_generate_ids() and SMC_LO_CHID
to SMC_LO_RESERVED_CHID.
- Patch #5: memcpy while holding the ldev->dmb_ht_lock.
- Some expression changes in commit logs.
RFC v4->v3:
Link: https://lore.kernel.org/netdev/20240317100545.96663-1-guwen@linux.alibaba.com/
- The merge window of v6.9 is open, so post this series as an RFC.
- Patch #6: since some information fed back by smc_nl_handle_smcd_dev() dose
not apply to Emulated-ISM (including loopback-ism here), loopback-ism is
not exposed through smc netlink for the time being. we may refactor this
part when smc netlink interface is updated.
v3->v2:
Link: https://lore.kernel.org/netdev/20240312142743.41406-1-guwen@linux.alibaba.com/
- Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler()
to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh()
to acquire dmb_ht_lock.
v2->v1:
Link: https://lore.kernel.org/netdev/20240307095536.29648-1-guwen@linux.alibaba.com/
- All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1.
- Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism
to sysfs and remove all the knobs until future definition clear.
- Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data().
- Patch #6: defaultly use physical contiguous DMB buffers.
- Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in
unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using
wait_event to keep waiting in unregister_dmb.
v1->RFC:
Link: https://lore.kernel.org/netdev/20240111120036.109903-1-guwen@linux.alibaba.com/
- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
/sys/devices/virtual/smc/loopback-ism/xfer_bytes
- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
merging sndbuf with peer DMB.
- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
control of whether to merge sndbuf and DMB. They can be respectively set by:
/sys/devices/virtual/smc/loopback-ism/dmb_type
/sys/devices/virtual/smc/loopback-ism/dmb_copy
The motivation for these two control is that a performance bottleneck was
found when using vzalloced DMB and sndbuf is merged with DMB, and there are
many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
vmap lock contention [6]. It has significant effects, but using virtual memory
still has additional overhead compared to using physical memory.
So this new version provides controls of dmb_type and dmb_copy to suit
different scenarios.
- Some minor changes and comments improvements.
RFC->old version([1]):
Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
- Patch #1: improve the loopback-ism dump, it shows as follows now:
# smcd d
FID Type PCI-ID PCHID InUse #LGs PNET-ID
0000 0 loopback-ism ffff No 0
- Patch #3: introduce the smc_ism_set_v2_capable() helper and set
smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
regardless of whether there is already a device in smcd device list.
- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
to activate or deactivate the loopback-ism.
- Patch #9: introduce the statistics of loopback-ism by
/sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
- Some minor changes and comments improvements.
[1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
[2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
[3] https://github.com/goldsborough/ipc-bench
[4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
[5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
[6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
====================
Link: https://lore.kernel.org/r/20240428060738.60843-1-guwen@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This implements operations related to merging sndbuf with peer DMB in
loopback-ism. The DMB won't be freed until no sndbuf is attached to it.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If the local sndbuf shares the same physical memory with peer DMB,
the cursor update processing needs to be adapted to ensure that the
data to be consumed won't be overwritten.
So in this case, the fin_curs and sndbuf_space that were originally
updated after sending the CDC message should be modified to not be
update until the peer updates cons_curs.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If the device used by SMC-D supports merging local sndbuf to peer DMB,
then create sndbuf descriptor and attach it to peer DMB once peer
token is obtained, and detach and free the sndbuf descriptor when the
connection is freed.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In some scenarios using Emulated-ISM device, sndbuf can share the same
physical memory region with peer DMB to avoid data copy from one side
to the other. In such case the sndbuf is only a descriptor that
describes the shared memory and does not actually occupy memory, it's
more like a ghost buffer.
+----------+ +----------+
| socket A | | socket B |
+----------+ +----------+
| |
+--------+ +--------+
| sndbuf | | DMB |
| desc | | desc |
+--------+ +--------+
| |
| +----v-----+
+--------------------------> memory |
+----------+
So here introduces three new SMC-D device operations to check if this
feature is supported by device, and to {attach|detach} ghost sndbuf to
peer DMB. For now only loopback-ism supports this.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
After the loopback-ism device is ready, add it to the SMC-D device list
as an ISMv2 device, and always keep it at the beginning to ensure it is
preferred for providing a shortcut for data transfer within the same
kernel.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Since loopback-ism is not a PCI device, the PCI information fed back by
smc_nl_handle_smcd_dev() does not apply to loopback-ism. So currently
ignore loopback-ism when dumping SMC-D devices. The netlink function of
loopback-ism will be refactored when SMC netlink interface is updated.
Link: https://lore.kernel.org/r/caab067b-f5c3-490f-9259-262624c236b4@linux.ibm.com/
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some operations are not supported by new introduced Emulated-ISM, so
mark them as optional and check if the device supports them when called.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This implements DMB (un)registration and data move operations of
loopback-ism device.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This implements operations related to IDs for the loopback-ism device.
loopback-ism uses an Extended GID that is a 128-bit GID instead of the
existing ISM 64-bit GID, and uses the CHID defined with the reserved
value 0xFFFF.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This introduces a kind of Emulated-ISM device named loopback-ism for
SMCv2.1. The loopback-ism device is currently exclusive for SMC usage,
and aims to provide an SMC shortcut for sockets within the same kernel,
leading to improved intra-OS traffic performance. Configuration of this
feature is managed through the config SMC_LO.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Gerd Bayer <gbayer@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The struct 'ism_client' is specialized for s390 platform firmware ISM.
So replace it with 'void' to make SMCD DMB registration helper generic
for both Emulated-ISM and existing ISM.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This is an effort to get rid of all multiplications from allocation
functions in order to prevent integer overflows [1][2].
As the "ids" variable is a pointer to "struct sctp_assoc_ids" and this
structure ends in a flexible array:
struct sctp_assoc_ids {
[...]
sctp_assoc_t gaids_assoc_id[];
};
the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the calculation "size + size * count" in
the kmalloc() function.
Also, refactor the code adding the "ids_size" variable to avoid sizing
twice.
This way, the code is more readable and safer.
This code was detected with the help of Coccinelle, and audited and
modified manually.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
Link: https://github.com/KSPP/linux/issues/160 [2]
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/PAXPR02MB724871DB78375AB06B5171C88B152@PAXPR02MB7248.eurprd02.prod.outlook.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
To enhance functionality, we now support reporting statistics through
the netdev-generic netlink (netdev-genl) queue stats interface. However,
this does not extend to all statistics, so a new field, qstat_offset,
has been introduced. This field determines which statistics should be
reported via netdev-genl queue stats.
Given that queue stats are retrieved individually per queue, it's
necessary for the virtnet_get_hw_stats() function to be capable of
fetching statistics for a specific queue.
As the document https://docs.kernel.org/next/networking/statistics.html#notes-for-driver-authors
We should not duplicate the stats which get reported via the netlink API in
ethtool. If the stats are for queue stat, that will not be reported by
ethtool -S.
python3 ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml
--dump qstats-get --json '{"scope": "queue"}'
[{'ifindex': 2,
'queue-id': 0,
'queue-type': 'rx',
'rx-bytes': 157844011,
'rx-csum-bad': 0,
'rx-csum-none': 0,
'rx-csum-unnecessary': 2195386,
'rx-hw-drop-overruns': 0,
'rx-hw-drop-ratelimits': 0,
'rx-hw-drops': 12964,
'rx-packets': 598929},
{'ifindex': 2,
'queue-id': 0,
'queue-type': 'tx',
'tx-bytes': 1938511,
'tx-csum-none': 0,
'tx-hw-drop-errors': 0,
'tx-hw-drop-ratelimits': 0,
'tx-hw-drops': 0,
'tx-needs-csum': 61263,
'tx-packets': 15515}]
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In the last commit, we introduced some helpers for device stats.
And the drivers stats are realized by the open code.
This commit make the helpers to support driver stats.
Then we can have the unify helper for device and driver stats.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The key size of ethtool -S is controlled by this macro.
ETH_GSTRING_LEN 32
That includes the \0 at the end. So the max length of the key name must
is 31. But the length of the prefix "rx_queue_0_" is 11. If the queue
num is larger than 10, the length of the prefix is 12. So the
key name max is 19. That is too short. We will introduce some keys
such as "gso_packets_coalesced". So we should change the prefix
to "rx0_".
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The virtio-net device stats spec:
42f3899898
We introduce the relative feature and structures.
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As the spec 42f3899898
Based on the description provided in the above specification, we have
enabled the virtio-net driver to support acquiring some response
information from the device via the CVQ (Control Virtqueue).
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Use phylink_pcs_change() when reporting changes in PCS link state to
phylink as the interrupts are informing us about changes to the PCS
state.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://lore.kernel.org/r/E1s0OH2-009hgx-Qw@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use phylink_pcs_change() when reporting changes in PCS link state to
phylink as the interrupts are informing us about changes to the PCS
state.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/E1s0OGx-009hgr-NP@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use phylink_pcs_change() when reporting changes in PCS link state to
phylink as the interrupts are informing us about changes to the PCS
state.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/E1s0OGs-009hgl-Jg@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use phylink_pcs_change() when reporting changes in PCS link state to
phylink as the interrupts are informing us about changes to the PCS
state.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/E1s0OGn-009hgf-G6@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We must initialize prune_proxy_timer before we attempt
a del_timer_sync() on it.
syzbot reported the following splat:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 1 PID: 11 Comm: kworker/u8:1 Not tainted 6.9.0-rc5-syzkaller-01199-gfc48de77d69d #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: netns cleanup_net
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
assign_lock_key+0x238/0x270 kernel/locking/lockdep.c:976
register_lock_class+0x1cf/0x980 kernel/locking/lockdep.c:1289
__lock_acquire+0xda/0x1fd0 kernel/locking/lockdep.c:5014
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
__timer_delete_sync+0x148/0x310 kernel/time/timer.c:1648
del_timer_sync include/linux/timer.h:185 [inline]
hsr_dellink+0x33/0x80 net/hsr/hsr_netlink.c:132
default_device_exit_batch+0x956/0xa90 net/core/dev.c:11737
ops_exit_list net/core/net_namespace.c:175 [inline]
cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:637
process_one_work kernel/workqueue.c:3254 [inline]
process_scheduled_works+0xa10/0x17c0 kernel/workqueue.c:3335
worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
kthread+0x2f0/0x390 kernel/kthread.c:388
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
ODEBUG: assert_init not available (active state 0) object: ffff88806d3fcd88 object type: timer_list hint: 0x0
WARNING: CPU: 1 PID: 11 at lib/debugobjects.c:517 debug_print_object+0x17a/0x1f0 lib/debugobjects.c:514
Fixes: 5055cccfc2 ("net: hsr: Provide RedBox support (HSR-SAN)")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240426163355.2613767-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: dsa: microchip: use phylink_mac_ops for ksz driver
This four patch series switches the Microchip KSZ DSA driver to use
phylink_mac_ops support, and for this one we go a little further
beyond a simple conversion. This driver has four distinct cases:
lan937x
ksz9477
ksz8
ksz8830
Three of these cases are handled by shimming the existing DSA calls
through ksz_dev_ops, and the final case is handled through a
conditional in ksz_phylink_mac_config(). These can all be handled
with separate phylink_mac_ops.
To get there, we do a progressive conversion.
Patch 1 removes ksz_dev_ops' phylink_mac_config() method which is
not populated in any of the arrays - and is thus redundant.
Patch 2 switches the driver to use a common set of phylink_mac_ops
for all cases, doing the simple conversion to avoid the DSA shim.
Patch 3 pushes the phylink_mac_ops down to the first three classes
(lan937x, ksz9477, ksz8) adding an appropriate pointer to the
phylink_mac_ops to struct ksz_chip_data, and using that to
populate DSA's ds->phylink_mac_ops pointer. The difference between
each of these are the mac_link_up() method. mac_config() and
mac_link_down() remain common between each at this stage.
Patch 4 splits out ksz8830, which needs different mac_config()
handling, and thus means we have a difference in mac_config()
methods between the now four phylink_mac_ops structures.
Build tested only, with additional -Wunused-const-variable flag.
====================
Link: https://lore.kernel.org/r/ZivP/R1IwKEPb5T6@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert ksz_common to provide its own phylink MAC operations, thus
avoiding the shim layer in DSA's port.c
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1s0O7H-009gpq-IF@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The phylink_mac_config function pointer member of struct ksz_dev_ops is
never initialised, so let's remove it to simplify the code.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/E1s0O7C-009gpk-Dh@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZi9+AAAKCRDbK58LschI
g0nEAP487m7L0nLVriC2oIOWsi29tklW3etm6DO7gmGRGIHgrgEAnMyV1xBj3bGj
v6jJwDcybCym1hLx+1x1JCZ4eoAFswE=
=xbna
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-04-29
We've added 147 non-merge commits during the last 32 day(s) which contain
a total of 158 files changed, 9400 insertions(+), 2213 deletions(-).
The main changes are:
1) Add an internal-only BPF per-CPU instruction for resolving per-CPU
memory addresses and implement support in x86 BPF JIT. This allows
inlining per-CPU array and hashmap lookups
and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko.
2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song.
3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various
atomics in bpf_arena which can be JITed as a single x86 instruction,
from Alexei Starovoitov.
4) Add support for passing mark with bpf_fib_lookup helper,
from Anton Protopopov.
5) Add a new bpf_wq API for deferring events and refactor sleepable
bpf_timer code to keep common code where possible,
from Benjamin Tissoires.
6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs
to check when NULL is passed for non-NULLable parameters,
from Eduard Zingerman.
7) Harden the BPF verifier's and/or/xor value tracking,
from Harishankar Vishwanathan.
8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel
crypto subsystem, from Vadim Fedorenko.
9) Various improvements to the BPF instruction set standardization doc,
from Dave Thaler.
10) Extend libbpf APIs to partially consume items from the BPF ringbuffer,
from Andrea Righi.
11) Bigger batch of BPF selftests refactoring to use common network helpers
and to drop duplicate code, from Geliang Tang.
12) Support bpf_tail_call_static() helper for BPF programs with GCC 13,
from Jose E. Marchesi.
13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF
program to have code sections where preemption is disabled,
from Kumar Kartikeya Dwivedi.
14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs,
from David Vernet.
15) Extend the BPF verifier to allow different input maps for a given
bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu.
16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions
for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan.
17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison
the stack memory, from Martin KaFai Lau.
18) Improve xsk selftest coverage with new tests on maximum and minimum
hardware ring size configurations, from Tushar Vyavahare.
19) Various ReST man pages fixes as well as documentation and bash completion
improvements for bpftool, from Rameez Rehman & Quentin Monnet.
20) Fix libbpf with regards to dumping subsequent char arrays,
from Quentin Deslandes.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits)
bpf, docs: Clarify PC use in instruction-set.rst
bpf_helpers.h: Define bpf_tail_call_static when building with GCC
bpf, docs: Add introduction for use in the ISA Internet Draft
selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us
bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args
selftests/bpf: dummy_st_ops should reject 0 for non-nullable params
bpf: check bpf_dummy_struct_ops program params for test runs
selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops
selftests/bpf: adjust dummy_st_ops_success to detect additional error
bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable
selftests/bpf: Add ring_buffer__consume_n test.
bpf: Add bpf_guard_preempt() convenience macro
selftests: bpf: crypto: add benchmark for crypto functions
selftests: bpf: crypto skcipher algo selftests
bpf: crypto: add skcipher to bpf crypto
bpf: make common crypto API for TC/XDP programs
bpf: update the comment for BTF_FIELDS_MAX
selftests/bpf: Fix wq test.
selftests/bpf: Use make_sockaddr in test_sock_addr
selftests/bpf: Use connect_to_addr in test_sock_addr
...
====================
Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend the PTP programmable gpios to implement also PTP_PF_EXTTS
function. The pins can be configured to capture both of rising
and falling edge. Once the event is seen, then an interrupt is
generated and the LTC is saved in the registers.
On lan8814 only GPIO 3 can be configured for this.
This was tested using:
ts2phc -m -l 7 -s generic -f ts2phc.cfg
Where the configuration was the following:
---
[global]
ts2phc.pin_index 3
[eth0]
---
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Luiz Angelo Daros de Luca says:
====================
net: dsa: realtek: fix LED support for rtl8366
This series fixes the LED support for rtl8366. The existing code was not
tested in a device with switch LEDs and it was using a flawed logic.
The driver now keeps the default LED configuration if nothing requests a
different behavior. This may be enough for most devices. This can be
achieved either by omitting the LED from the device-tree or configuring
all LEDs in a group with the default state set to "keep".
The hardware trigger for LEDs in Realtek switches is shared among all
LEDs in a group. This behavior doesn't align well with the Linux LED
API, which controls LEDs individually. Once the OS changes the
brightness of a LED in a group still triggered by the hardware, the
entire group switches to software-controlled LEDs, even for those not
metioned in the device-tree. This shared behavior also prevents
offloading the trigger to the hardware as it would require an
orchestration between LEDs in a group, not currently present in the LED
API.
The assertion of device hardware reset during driver removal was removed
because it was causing an issue with the LED release code. Devres
devices are released after the driver's removal is executed. Asserting
the reset at that point was causing timeout errors during LED release
when it attempted to turn off the LED.
To: Linus Walleij <linus.walleij@linaro.org>
To: Alvin Šipraga <alsi@bang-olufsen.dk>
To: Andrew Lunn <andrew@lunn.ch>
To: Florian Fainelli <f.fainelli@gmail.com>
To: Vladimir Oltean <olteanv@gmail.com>
To: David S. Miller <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Rob Herring <robh+dt@kernel.org>
To: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
To: Conor Dooley <conor+dt@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: devicetree@vger.kernel.org
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Changes in v2:
- Fixed commit message formatting
- Added GROUP to LED group enum values. With that, moved the code that
disables LED into a new function to keep 80-collumn limit.
- Dropped unused enable argument in rb8366rb_get_port_led()
- Fixed variable order in rtl8366rb_setup_led()
- Removed redundant led group test in rb8366rb_{g,s}et_port_led()
- Initialize ret as 0 in rtl8366rb_setup_leds()
- Updated comments related to LED blinking and setup
- Link to v1: https://lore.kernel.org/r/20240310-realtek-led-v1-0-4d9813ce938e@gmail.com
Changes in v1:
- Rebased on new relatek DSA drivers
- Improved commit messages
- Added commit to remove the reset assert during .remove
- Link to RFC: https://lore.kernel.org/r/20240106184651.3665-1-luizluca@gmail.com
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit introduces LED drivers for rtl8366rb, enabling LEDs to be
described in the device tree using the same format as qca8k. Each port
can configure up to 4 LEDs.
If all LEDs in a group use the default state "keep", they will use the
default behavior after a reset. Changing the brightness of one LED,
either manually or by a trigger, will disable the default hardware
trigger and switch the entire LED group to manually controlled LEDs.
Once in this mode, there is no way to revert to hardware-controlled LEDs
(except by resetting the switch).
Software triggers function as expected with manually controlled LEDs.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The necessity of asserting the reset on removal was previously
questioned, as DSA's own cleanup methods should suffice to prevent
traffic leakage[1].
When a driver has subdrivers controlled by devres, they will be
unregistered after the main driver's .remove is executed. If it asserts
a reset, the subdrivers will be unable to communicate with the hardware
during their cleanup. For LEDs, this means that they will fail to turn
off, resulting in a timeout error.
[1] https://lore.kernel.org/r/20240123215606.26716-9-luizluca@gmail.com/
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>