linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-18 06:15:12 +00:00

Author	SHA1	Message	Date
Brian Haley	62e395f82d	neighbor: fix proxy_delay usage when it is zero When set to zero, the neighbor sysctl proxy_delay value does not cause an immediate reply for ARP/ND requests as expected, it instead causes a random delay between [0, U32_MAX). Looking at this comment from __get_random_u32_below() explains the reason: /* * This function is technically undefined for ceil == 0, and in fact * for the non-underscored constant version in the header, we build bug * on that. But for the non-constant case, it's convenient to have that * evaluate to being a straight call to get_random_u32(), so that * get_random_u32_inclusive() can work over its whole range without * undefined behavior. */ Added helper function that does not call get_random_u32_below() if proxy_delay is zero and just uses the current value of jiffies instead, causing pneigh_enqueue() to respond immediately. Also added definition of proxy_delay to ip-sysctl.txt since it was missing. Signed-off-by: Brian Haley <haleyb.dev@gmail.com> Link: https://lore.kernel.org/r/20230130171428.367111-1-haleyb.dev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 21:02:54 -08:00
Jakub Kicinski	983f507c30	Merge branch 'net-support-ipv4-big-tcp' Xin Long says: ==================== net: support ipv4 big tcp This is similar to the BIG TCP patchset added by Eric for IPv6: https://lwn.net/Articles/895398/ Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header doesn't have exthdrs(options) for the BIG TCP packets' length. To make it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to indicate this might be a BIG TCP packet and use skb->len as the real IPv4 total length. This will work safely, as all BIG TCP packets are GSO/GRO packets and processed on the same host as they were created; There is no padding in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4 packet total length; Also, before implementing the feature, all those places that may get iph tot_len from BIG TCP packets are taken care with some new APIs: Patch 1 adds some APIs for iph tot_len setting and getting, which are used in all these places where IPv4 BIG TCP packets may reach in Patch 2-7, Patch 8 adds a GSO_TCP tp_status for af_packet users, and Patch 9 add new netlink attributes to make IPv4 BIG TCP independent from IPv6 BIG TCP on configuration, and Patch 10 implements this feature. Note that the similar change as in Patch 2-6 are also needed for IPv6 BIG TCP packets, and will be addressed in another patchset. The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC and 1.5K MTU: No BIG TCP: for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT\|tail -1; done 168 322 337 3776.49 143 236 277 4654.67 128 258 288 4772.83 171 229 278 4645.77 175 228 243 4678.93 149 239 279 4599.86 164 234 268 4606.94 155 276 289 4235.82 180 255 268 4418.95 168 241 249 4417.82 Enable BIG TCP: ip link set dev ens1f0np0 gro_ipv4_max_size 128000 gso_ipv4_max_size 128000 for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT\|tail -1; done 161 241 252 4821.73 174 205 217 5098.28 167 208 220 5001.43 164 228 249 4883.98 150 233 249 4914.90 180 233 244 4819.66 154 208 219 5004.92 157 209 247 4999.78 160 218 246 4842.31 174 206 217 5080.99 Thanks for the feedback from Eric and David Ahern. ==================== Link: https://lore.kernel.org/r/cover.1674921359.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:30 -08:00
Xin Long	b1a78b9b98	net: add support for ipv4 big tcp Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP. Firstly, allow sk->sk_gso_max_size to be set to a value greater than GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size() for IPv4 TCP sockets. Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU in __ip_local_out() to allow to send BIG TCP packets, and this implies that skb->len is the length of a IPv4 packet; On RX path, use skb->len as the length of the IPv4 packet when the IP header tot_len is 0 and skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only need to update these APIs. Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In GRO complete, set IP header tot_len to 0 when the merged packet size greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed on RX path. Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP packets. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	9eefedd58a	net: add gso_ipv4_max_size and gro_ipv4_max_size per device This patch introduces gso_ipv4_max_size and gro_ipv4_max_size per device and adds netlink attributes for them, so that IPV4 BIG TCP can be guarded by a separate tunable in the next patch. To not break the old application using "gso/gro_max_size" for IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size" in netif_set_gso/gro_max_size() if the new size isn't greater than GSO_LEGACY_MAX_SIZE, so that nothing will change even if userspace doesn't realize the new netlink attributes. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	8e08bb75b6	packet: add TP_STATUS_GSO_TCP for tp_status Introduce TP_STATUS_GSO_TCP tp_status flag to tell the af_packet user that this is a TCP GSO packet. When parsing IPv4 BIG TCP packets in tcpdump/libpcap, it can use tp_len as the IPv4 packet len when this flag is set, as iph tot_len is set to 0 for IPv4 BIG TCP packets. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	50e6fb5c6e	ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr ipvlan devices calls netif_inherit_tso_max() to get the tso_max_size/segs from the lower device, so when lower device supports BIG TCP, the ipvlan devices support it too. We also should consider its iph tot_len accessing. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	7eb072be41	cipso_ipv4: use iph_set_totlen in skbuff_setattr It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so the iph->tot_len update should use iph_set_totlen(). Note that for these non GSO packets, the new iph tot_len with extra iph option len added may become greater than 65535, the old process will cast it and set iph->tot_len to it, which is a bug. In theory, iph options shouldn't be added for these big packets in here, a fix may be needed here in the future. For now this patch is only to set iph->tot_len to 0 when it happens. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	a13fbf5ed5	netfilter: use skb_ip_totlen and iph_totlen There are also quite some places in netfilter that may process IPv4 TCP GSO packets, we need to replace them too. In length_mt(), we have to use u_int32_t/int to accept skb_ip_totlen() return value, otherwise it may overflow and mismatch. This change will also help us add selftest for IPv4 BIG TCP in the following patch. Note that we don't need to replace the one in tcpmss_tg4(), as it will return if there is data after tcphdr in tcpmss_mangle_packet(). The same in mangle_contents() in nf_nat_helper.c, it returns false when skb->len + extra > 65535 in enlarge_skb(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	043e397e48	net: sched: use skb_ip_totlen and iph_totlen There are 1 action and 1 qdisc that may process IPv4 TCP GSO packets and access iph->tot_len, replace them with skb_ip_totlen() and iph_totlen() accordingly. Note that we don't need to replace the one in tcf_csum_ipv4(), as it will return for TCP GSO packets in tcf_csum_ipv4_tcp(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	ec84c955a0	openvswitch: use skb_ip_totlen in conntrack IPv4 GSO packets may get processed in ovs_skb_network_trim(), and we need to use skb_ip_totlen() to get iph totlen. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	46abd17302	bridge: use skb_ip_totlen in br netfilter These 3 places in bridge netfilter are called on RX path after GRO and IPv4 TCP GSO packets may come through, so replace iph tot_len accessing with skb_ip_totlen() in there. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Xin Long	058a8f7f73	net: add a couple of helpers for iph tot_len This patch adds three APIs to replace the iph->tot_len setting and getting in all places where IPv4 BIG TCP packets may reach, they will be used in the following patches. Note that iph_totlen() will be used when iph is not in linear data of the skb. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:54:27 -08:00
Jakub Kicinski	d8673afbf5	Merge branch 'virtio_net-vdpa-update-mac-address-when-it-is-generated-by-virtio-net' Laurent Vivier says: ==================== virtio_net: vdpa: update MAC address when it is generated by virtio-net When the MAC address is not provided by the vdpa device virtio_net driver assigns a random one without notifying the device. The consequence, in the case of mlx5_vdpa, is the internal routing tables of the device are not updated and this can block the communication between two namespaces. To fix this problem, use virtnet_send_command(VIRTIO_NET_CTRL_MAC) to set the address from virtnet_probe() when the MAC address is not provided by the device. ==================== Link: https://lore.kernel.org/r/20230127204500.51930-1-lvivier@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:53:06 -08:00
Laurent Vivier	9f62d221a4	virtio_net: notify MAC address change on device initialization In virtnet_probe(), if the device doesn't provide a MAC address the driver assigns a random one. As we modify the MAC address we need to notify the device to allow it to update all the related information. The problem can be seen with vDPA and mlx5_vdpa driver as it doesn't assign a MAC address by default. The virtio_net device uses a random MAC address (we can see it with "ip link"), but we can't ping a net namespace from another one using the virtio-vdpa device because the new MAC address has not been provided to the hardware: RX packets are dropped since they don't go through the receive filters, TX packets go through unaffected. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:53:04 -08:00
Laurent Vivier	7c06458c10	virtio_net: disable VIRTIO_NET_F_STANDBY if VIRTIO_NET_F_MAC is not set failover relies on the MAC address to pair the primary and the standby devices: "[...] the hypervisor needs to enable VIRTIO_NET_F_STANDBY feature on the virtio-net interface and assign the same MAC address to both virtio-net and VF interfaces." Documentation/networking/net_failover.rst This patch disables the STANDBY feature if the MAC address is not provided by the hypervisor. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 20:53:04 -08:00
Huayu Chen	ca3daf437d	nfp: correct cleanup related to DCB resources This patch corrects two oversights relating to releasing resources and DCB initialisation. 1. If mapping of the dcbcfg_tbl area fails: an error should be propagated, allowing partial initialisation (probe) to be unwound. 2. Conversely, if where dcbcfg_tbl is successfully mapped: it should be unmapped in nfp_nic_dcb_clean() which is called via various error cleanup paths, and shutdown or removal of the PCIE device. Fixes: 9b7fe8046d74 ("nfp: add DCB IEEE support") Signed-off-by: Huayu Chen <huayu.chen@corigine.com> Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230131163033.981937-1-simon.horman@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 19:57:30 -08:00
Jiapeng Chong	bc61761394	ipv6: ICMPV6: Use swap() instead of open coding it Swap is a function interface that provides exchange function. To avoid code duplication, we can use swap function. ./net/ipv6/icmp.c:344:25-26: WARNING opportunity for swap(). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3896 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230131063456.76302-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 19:55:41 -08:00
Jakub Kicinski	074dd3b35a	Merge branch 'devlink-trivial-names-cleanup' Jiri Pirko says: ==================== devlink: trivial names cleanup This is a follow-up to Jakub's devlink code split and dump iteration helper patchset. No functional changes, just couple of renames to makes things consistent and perhaps easier to follow. ==================== Link: https://lore.kernel.org/r/20230131090613.2131740-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 10:57:04 -08:00
Jiri Pirko	8589ba4e64	devlink: rename and reorder instances of struct devlink_cmd In order to maintain naming consistency, rename and reorder all usages of struct struct devlink_cmd in the following way: 1) Remove "gen" and replace it with "cmd" to match the struct name 2) Order devl_cmds[] and the header file to match the order of enum devlink_command 3) Move devl_cmd_rate_get among the peers 4) Remove "inst" for DEVLINK_CMD_GET 5) Add "_get" suffix to all to match DEVLINK_CMD_*_GET (only rate had it done correctly) Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 10:57:01 -08:00
Jiri Pirko	f87445953d	devlink: remove "gen" from struct devlink_gen_cmd name No need to have "gen" inside name of the structure for devlink commands. Remove it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 10:57:01 -08:00
Jiri Pirko	c3a4fd5718	devlink: rename devlink_nl_instance_iter_dump() to "dumpit" To have the name of the function consistent with the struct cb name, rename devlink_nl_instance_iter_dump() to devlink_nl_instance_iter_dumpit(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-02-01 10:57:01 -08:00
Jakub Kicinski	dd25cfab16	Merge branch 'net-ipa-remaining-ipa-v5-0-support' Alex Elder says: ==================== net: ipa: remaining IPA v5.0 support This series includes almost all remaining IPA code changes required to support IPA v5.0. IPA register definitions and configuration data for IPA v5.0 will be sent later (soon). Note that the GSI register definitions still require work. GSI for IPA v5.0 supports up to 256 (rather than 32) channels, and this changes the way GSI register offsets are calculated. A few GSI register fields also change. The first patch in this series increases the number of IPA endpoints supported by the driver, from 32 to 36. The next updates the width of the destination field for the IP_PACKET_INIT immediate command so it can represent up to 256 endpoints rather than just 32. The next adds a few definitions of some IPA registers and fields that are first available in IPA v5.0. The next two patches update the code that handles router and filter table caches. Previously these were referred to as "hashed" tables, and the IPv4 and IPv6 tables are now combined into one "unified" table. The sixth and seventh patches add support for a new pulse generator, which allows time periods to be specified with a wider range of clock resolution. And the last patch just defines two new memory regions that were not previously used. ==================== Link: https://lore.kernel.org/r/20230130210158.4126129-1-elder@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:45:54 -08:00
Alex Elder	5157d6bfca	net: ipa: define two new memory regions IPA v5.0 uses two memory regions not previously used. Define them and treat them as valid only for IPA v5.0. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:45:52 -08:00
Alex Elder	2cdbcbfd48	net: ipa: support a third pulse register The AP has third pulse generator available starting with IPA v5.0. Redefine ipa_qtime_val() to support that possibility. Pass the IPA pointer as an argument so the version can be determined. And stop using the sign of the returned tick count to indicate which of two pulse generators to use. Instead, have the caller provide the address of a variable that will hold the selected pulse generator for the Qtime value. And for version 5.0, check whether the third pulse generator best represents the time period. Add code in ipa_qtime_config() to configure the fourth pulse generator for IPA v5.0+; in that case configure both the third and fourth pulse generators to use 10 msec granularity. Consistently use "ticks" for local variables that represent a tick count. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:45:52 -08:00
Alex Elder	32079a4ab1	net: ipa: greater timer granularity options Starting with IPA v5.0, the head-of-line blocking timer has more than two pulse generators available to define timer granularity. To prepare for that, change the way the field value is encoded to use ipa_reg_encode() rather than ipa_reg_bit(). The aggregation granularity selection could (in principle) also use an additional pulse generator starting with IPA v5.0. Encode the AGGR_GRAN_SEL field differently to allow that as well. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:45:52 -08:00
Alex Elder	a08cedc31d	net: ipa: support zeroing new cache tables IPA v5.0+ separates the configuration of entries in the cached (previously "hashed") routing and filtering tables into distinct registers. Previously a single "filter and router" register updated entries in both tables at once; now the routing and filter table caches have separate registers that define their content. This patch updates the code that zeroes entries in the cached filter and router tables to support IPA versions including v5.0+. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:45:52 -08:00
Alex Elder	8e7c89d84a	net: ipa: update table cache flushing Update the code that causes filter and router table caches to be flushed so that it supports IPA versions 5.0+. It adds a comment in ipa_hardware_config_hashing() that explains that cacheing does not need to be enabled, just as before, because it's enabled by default. (For the record, the FILT_ROUT_CACHE_CFG register would have been used if we wanted to explicitly enable these.) Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:45:51 -08:00
Alex Elder	8ba59716d1	net: ipa: define IPA v5.0+ registers Define some new registers that appear starting with IPA v5.0, along with enumerated types identifying their fields. Code that uses these will be added by upcoming patches. Most of the new registers are related to filter and routing tables, and in particular, their "hashed" variant. These tables are better described as "cached", where a hash value determines which entries are cached. From now on, naming related to this functionality will use "cache" instead of "hash", and that is reflected in these new register names. Some registers for managing these caches and their contents have changed as well. A few other new field definitions for registers (unrelated to table caches) are also defined. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:45:51 -08:00
Alex Elder	c84ddc1197	net: ipa: extend endpoints in packet init command The IP_PACKET_INIT immediate command defines the destination endpoint to which a packet should be sent. Prior to IPA v5.0, a 5 bit field in that command represents the endpoint, but starting with IPA v5.0, the field is extended to 8 bits to support more than 32 endpoints. Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:45:51 -08:00
Alex Elder	07abde549b	net: ipa: support more endpoints Increase the number of endpoints supported by the driver to 36, which IPA v5.0 supports. This makes it impossible to check at build time whether the supported number is too big to fit within the (5-bit) PACKET_INIT destination endpoint field. Instead, convert the build time check to compare against what fits in 8 bits. Add a check in ipa_endpoint_config() to also ensure the hardware reports an endpoint count that's in the expected range. Just open-code 32 as the limit (the PACKET_INIT field mask is not available where we'd want to use it). Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:45:51 -08:00
Jakub Kicinski	71af6a2ddf	mlx5-updates-2023-01-30 Add fast update encryption key Jianbo Liu Says: ================ Data encryption keys (DEKs) are the keys used for data encryption and decryption operations. Starting from version 22.33.0783, firmware is optimized to accelerate the update of user keys into DEK object in hardware. The support for bulk allocation and destruction of DEK objects is added, and the bulk allocated DEKs are uninitialized, as the bulk creation requires no input key. When offload encryption/decryption, user gets one object from a bulk, and updates key by a new "modify DEK" command. This command is the same as create DEK object, but requires no heavy context memory allocation in firmware, which consumes most cpu cycles of the create DEK command. DEKs are cached internally by the NIC, so invalidating internal NIC caches is required before reusing DEKs. The SYNC_CRYPTO command is added to support it. DEK object can be reused, the keys in it can be updated after this command is executed. This patchset enhances the key creation and destruction flow, to get use of this new feature. Any user, for example, ktls, ipsec and macsec, can use it to offload keys. But, only ktls uses it, as others don't need many keys, and caching two many DEKs in pool is wasteful. There are two new data struts added: a. DEK pool. One pool is created for each key type. The bulks by the type, are placed in the pool's different bulk lists, according to the number of available and in_used DEKs in the bulk. b. DEK bulk. All DEKs in one bulk allocation are store here. There are two bitmaps to indicate the state of each DEK. New APIs are then added. When user need a DEK object, a. Fetch one bulk with avail DEKs, from the partial_list or avail_list, otherwise create new one. b. Pick one DEK, and set its need_sync and in_used bits to 1. Move the bulk to full_list if no more available keys, or put it to partial_list if the bulk is newly created. c. Update DEK object's key with user key, by the "modify DEK" command. d. Return DEK struct to user, then it gets the object id and fills it into the offload commands. When user free a DEK, a. Set in_use bit to 0. If all need_sync bits are 1 and all in_use bits of this bulk are 0, move it to sync_list. b. If the number of DEKs, which are freed by users, is over the threshold (128), schedule a workqueue to do the sync process. For the sync process, the SYNC_CRYPTO command is executed first. Then, for each bulks in partial_list, full_list and sync_list, reset need_sync bits of the freed DEK objects. If all need_sync bits in one bulk are zero, move it to avail_list. We already supported TIS pool to recycle the TISes. With this series and TIS pool, TLS CPS performance is improved greatly. And we tested https on the system: CPU: dual AMD EPYC 7763 64-Core processors RAM: 512G DEV: ConnectX-6 DX, with FW ver 22.33.0838 and TLS_OPTIMISE=true TLS CPS performance numbers are: Before: 11k connections/sec After: 101 connections/sec ================ -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmPYho4ACgkQSD+KveBX +j4tmQf/UnDnj55lf7zxvDYCgIThSFeqIPCnwnbRRTPB85jsjsBMx+52ugYGJ5kZ Mci93QfkDoIEAAamBwj76X3skobmsdKZsOmFyLKpfWBz6K98EZVC7nAPPRO9o80Z YGQQAbUn8I/USC0cB2BICCnjkbcpeMUgYYqnLteBsKBiH3IkMoEtkeaWN0M3SHK/ xKLZwlpX+2gIotr6h2ftd8B8ygL1CSyMTqIp0vrSQY69ucTpgtsbDufODbU58p7n JUOVtNM5irwi2QdfSJjPAc1vMkkVJYCGbE1mxMjbKyDOMEnK5vIiMgb7RWRSMbdC FzSY0/vQxFoOB21+CeiGN/rPvMnRCA== =/Cb4 -----END PGP SIGNATURE----- Merge tag 'mlx5-updates-2023-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-01-30 Add fast update encryption key Jianbo Liu Says: ================ Data encryption keys (DEKs) are the keys used for data encryption and decryption operations. Starting from version 22.33.0783, firmware is optimized to accelerate the update of user keys into DEK object in hardware. The support for bulk allocation and destruction of DEK objects is added, and the bulk allocated DEKs are uninitialized, as the bulk creation requires no input key. When offload encryption/decryption, user gets one object from a bulk, and updates key by a new "modify DEK" command. This command is the same as create DEK object, but requires no heavy context memory allocation in firmware, which consumes most cpu cycles of the create DEK command. DEKs are cached internally by the NIC, so invalidating internal NIC caches is required before reusing DEKs. The SYNC_CRYPTO command is added to support it. DEK object can be reused, the keys in it can be updated after this command is executed. This patchset enhances the key creation and destruction flow, to get use of this new feature. Any user, for example, ktls, ipsec and macsec, can use it to offload keys. But, only ktls uses it, as others don't need many keys, and caching two many DEKs in pool is wasteful. There are two new data struts added: a. DEK pool. One pool is created for each key type. The bulks by the type, are placed in the pool's different bulk lists, according to the number of available and in_used DEKs in the bulk. b. DEK bulk. All DEKs in one bulk allocation are store here. There are two bitmaps to indicate the state of each DEK. New APIs are then added. When user need a DEK object, a. Fetch one bulk with avail DEKs, from the partial_list or avail_list, otherwise create new one. b. Pick one DEK, and set its need_sync and in_used bits to 1. Move the bulk to full_list if no more available keys, or put it to partial_list if the bulk is newly created. c. Update DEK object's key with user key, by the "modify DEK" command. d. Return DEK struct to user, then it gets the object id and fills it into the offload commands. When user free a DEK, a. Set in_use bit to 0. If all need_sync bits are 1 and all in_use bits of this bulk are 0, move it to sync_list. b. If the number of DEKs, which are freed by users, is over the threshold (128), schedule a workqueue to do the sync process. For the sync process, the SYNC_CRYPTO command is executed first. Then, for each bulks in partial_list, full_list and sync_list, reset need_sync bits of the freed DEK objects. If all need_sync bits in one bulk are zero, move it to avail_list. We already supported TIS pool to recycle the TISes. With this series and TIS pool, TLS CPS performance is improved greatly. And we tested https on the system: CPU: dual AMD EPYC 7763 64-Core processors RAM: 512G DEV: ConnectX-6 DX, with FW ver 22.33.0838 and TLS_OPTIMISE=true TLS CPS performance numbers are: Before: 11k connections/sec After: 101 connections/sec ================ * tag 'mlx5-updates-2023-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: kTLS, Improve connection rate by using fast update encryption key net/mlx5: Keep only one bulk of full available DEKs net/mlx5: Add async garbage collector for DEK bulk net/mlx5: Reuse DEKs after executing SYNC_CRYPTO command net/mlx5: Use bulk allocation for fast update encryption key net/mlx5: Add bulk allocation and modify_dek operation net/mlx5: Add support SYNC_CRYPTO command net/mlx5: Add new APIs for fast update encryption key net/mlx5: Refactor the encryption key creation net/mlx5: Add const to the key pointer of encryption key creation net/mlx5: Prepare for fast crypto key update if hardware supports it net/mlx5: Change key type to key purpose net/mlx5: Add IFC bits and enums for crypto key net/mlx5: Add IFC bits for general obj create param net/mlx5: Header file for crypto ==================== Link: https://lore.kernel.org/r/20230131031201.35336-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:35:34 -08:00
Jakub Kicinski	c925ed5f66	Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN: Remove redundant Device Control Error Reporting Enable Bjorn Helgaas says: Since f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is native"), the PCI core sets the Device Control bits that enable error reporting for PCIe devices. This series removes redundant calls to pci_enable_pcie_error_reporting() that do the same thing from several NIC drivers. There are several more drivers where this should be removed; I started with just the Intel drivers here. * '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ixgbe: Remove redundant pci_enable_pcie_error_reporting() igc: Remove redundant pci_enable_pcie_error_reporting() igb: Remove redundant pci_enable_pcie_error_reporting() ice: Remove redundant pci_enable_pcie_error_reporting() iavf: Remove redundant pci_enable_pcie_error_reporting() i40e: Remove redundant pci_enable_pcie_error_reporting() fm10k: Remove redundant pci_enable_pcie_error_reporting() e1000e: Remove redundant pci_enable_pcie_error_reporting() ==================== Link: https://lore.kernel.org/r/20230130192519.686446-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:04:25 -08:00
Jakub Kicinski	67971c381f	Merge branch 'selftests-mlxsw-convert-to-iproute2-dcb' Petr Machata says: ==================== selftests: mlxsw: Convert to iproute2 dcb There is a dedicated tool for configuration of DCB in iproute2. Use it in the selftests instead of lldpad. Patches #1-#3 convert three tests. Patch #4 drops the now-unnecessary lldpad helpers. ==================== Link: https://lore.kernel.org/r/cover.1675096231.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:02:13 -08:00
Petr Machata	bd32ff6872	selftests: net: forwarding: lib: Drop lldpad_app_wait_set(), _del() The existing users of these helpers have been converted to iproute2 dcb. Drop the helpers. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:02:11 -08:00
Petr Machata	5b3ef0452c	selftests: mlxsw: qos_defprio: Convert from lldptool to dcb Set up default port priority through the iproute2 dcb tool, which is easier to understand and manage. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:02:11 -08:00
Petr Machata	10d5bd0b69	selftests: mlxsw: qos_dscp_router: Convert from lldptool to dcb Set up DSCP prioritization through the iproute2 dcb tool, which is easier to understand and manage. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:02:11 -08:00
Petr Machata	1680801ef6	selftests: mlxsw: qos_dscp_bridge: Convert from lldptool to dcb Set up DSCP prioritization through the iproute2 dcb tool, which is easier to understand and manage. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 21:02:11 -08:00
Jakub Kicinski	88b49402fa	Merge branch 'net-mdio-add-amlogic-gxl-mdio-mux-support' Jerome Brunet says: ==================== net: mdio: add amlogic gxl mdio mux support Add support for the MDIO multiplexer found in the Amlogic GXL SoC family. This multiplexer allows to choose between the external (SoC pins) MDIO bus, or the internal one leading to the integrated 10/100M PHY. This multiplexer has been handled with the mdio-mux-mmioreg generic driver so far. When it was added, it was thought the logic was handled by a single register. It turns out more than a single register need to be properly set. As long as the device is using the Amlogic vendor bootloader, or upstream u-boot with net support, it is working fine since the kernel is inheriting the bootloader settings. Without net support in the bootloader, this glue comes unset in the kernel and only the external path may operate properly. With this driver (and the associated change in arch/arm64/boot/dts/amlogic/meson-gxl.dtsi), the kernel no longer relies on the bootloader to set things up, fixing the problem. ==================== Link: https://lore.kernel.org/r/20230130151616.375168-1-jbrunet@baylibre.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:59:09 -08:00
Jerome Brunet	9a24e1ff43	net: mdio: add amlogic gxl mdio mux support Add support for the mdio mux and internal phy glue of the GXL SoC family Reported-by: Da Xue <da@lessconfused.com> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:59:07 -08:00
Jerome Brunet	cc732d2351	dt-bindings: net: add amlogic gxl mdio multiplexer Add documentation for the MDIO bus multiplexer found on the Amlogic GXL SoC family Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:59:07 -08:00
Jakub Kicinski	1b98ac0fc8	Merge branch 'tools-ynl-more-docs-and-basic-ethtool-support' Jakub Kicinski says: ==================== tools: ynl: more docs and basic ethtool support I got discouraged from supporting ethtool in specs, because generating the user space C code seems a little tricky. The messages are ID'ed in a "directional" way (to and from kernel are separate ID "spaces"). There is value, however, in having the spec and being able to for example use it in Python. After paying off some technical debt - add a partial ethtool spec. Partial because the header for ethtool is almost a 1000 LoC, so converting in one sitting is tough. But adding new commands should be trivial now. Last but not least I add more docs, I realized that I've been sending a similar "instructions" email to people working on new families. It's now intro-specs.rst. ==================== Link: https://lore.kernel.org/r/20230131023354.1732677-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:05 -08:00
Jakub Kicinski	981cbcb030	tools: net: use python3 explicitly The scripts require Python 3 and some distros are dropping Python 2 support. Reported-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:03 -08:00
Jakub Kicinski	01e47a3722	docs: netlink: add a starting guide for working with specs We have a bit of documentation about the internals of Netlink and the specs, but really the goal is for most people to not worry about those. Add a practical guide for beginners who want to poke at the specs. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:03 -08:00
Jakub Kicinski	b784db7ae8	netlink: specs: add partial specification for ethtool Ethtool is one of the most actively developed families. With the changes to the CLI it should be possible to use the YNL based code for easy prototyping and development. Add a partial family definition. I've tested the string set and rings. I don't have any MAC Merge implementation to test with, but I added the definition for it, anyway, because it's last. New commands can simply be added at the end without having to worry about manually providing IDs / values. Set (with notification support - None is the response, the data is from the notification): $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ethtool.yaml \ --do rings-set \ --json '{"header":{"dev-name":"enp0s31f6"}, "rx":129}' \ --subscribe monitor None [{'msg': {'header': {'dev-index': 2, 'dev-name': 'enp0s31f6'}, 'rx': 136, 'rx-max': 4096, 'tx': 256, 'tx-max': 4096, 'tx-push': 0}, 'name': 'rings-ntf'}] Do / dump (yes, the kernel requires that even for dump and even if empty - the "header" nest must be there): $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ethtool.yaml \ --do rings-get \ --json '{"header":{"dev-index": 2}}' {'header': {'dev-index': 2, 'dev-name': 'enp0s31f6'}, 'rx': 136, 'rx-max': 4096, 'tx': 256, 'tx-max': 4096, 'tx-push': 0} $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ethtool.yaml \ --dump rings-get \ --json '{"header":{}}' [{'header': {'dev-index': 2, 'dev-name': 'enp0s31f6'}, 'rx': 136, 'rx-max': 4096, 'tx': 256, 'tx-max': 4096, 'tx-push': 0}, {'header': {'dev-index': 3, 'dev-name': 'wlp0s20f3'}, 'tx-push': 0}, {'header': {'dev-index': 19, 'dev-name': 'enp58s0u1u1'}, 'rx': 100, 'rx-max': 4096, 'tx-push': 0}] And error reporting: $ ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/ethtool.yaml \ --dump rings-get \ --json '{"header":{"flags":5}}' Netlink error: Invalid argument nl_len = 68 (52) nl_flags = 0x300 nl_type = 2 error: -22 extack: {'msg': 'reserved bit set', 'bad-attr-offs': 24, 'bad-attr': '.header.flags'} None Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:03 -08:00
Jakub Kicinski	8403bf0445	netlink: specs: finish up operation enum-models I had a (bright?) idea of introducing the concept of enum-models to account for all the weird ways families enumerate their messages. I've never finished it because generating C code for each of them is pretty daunting. But for languages which can use ID values directly the support is simple enough, so clean this up a bit. "unified" model is what I recommend going forward. "directional" model is what ethtool uses. "notify-split" is used by the proposed DPLL code, but we can just make them use "unified", it hasn't been merged :) Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:03 -08:00
Jakub Kicinski	5c6674f6eb	tools: ynl: load jsonschema on demand The CLI script tries to validate jsonschema by default. It's seems better to validate too many times than too few. However, when copying the scripts to random servers having to install jsonschema is tedious. Load jsonschema via importlib, and let the user opt out. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:03 -08:00
Jakub Kicinski	8dfec0a888	tools: ynl: use operation names from spec on the CLI When I wrote the first version of the Python code I was quite excited that we can generate class methods directly from the spec. Unfortunately we need to use valid identifiers for method names (specifically no dashes are allowed). Don't reuse those names on the CLI, it's much more natural to use the operation names exactly as listed in the spec. Instead of: ./cli --do rings_get use: ./cli --do rings-get Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:03 -08:00
Jakub Kicinski	4cd2796f3f	tools: ynl: support pretty printing bad attribute names One of my favorite features of the Netlink specs is that they make decoding structured extack a ton easier. Implement pretty printing bad attribute names in YNL. For example it will now say: 'bad-attr': '.header.flags' rather than the useless: 'bad-attr-offs': 32 Proof: $ ./cli.py --spec ethtool.yaml --do rings_get \ --json '{"header":{"dev-index":1, "flags":4}}' Netlink error: Invalid argument nl_len = 68 (52) nl_flags = 0x300 nl_type = 2 error: -22 extack: {'msg': 'reserved bit set', 'bad-attr': '.header.flags'} Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:03 -08:00
Jakub Kicinski	90256f3f80	tools: ynl: support multi-attr Ethtool uses mutli-attr, add the support to YNL. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:03 -08:00
Jakub Kicinski	fd0616d342	tools: ynl: support directional enum-model in CLI Support families which use different IDs for messages to and from the kernel. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-31 20:36:03 -08:00

1 2 3 4 5 ...

1155803 Commits