linux/net
Eric Dumazet 4bfe744ff1 tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
I had this bug sitting for too long in my pile, it is time to fix it.

Thanks to Doug Porter for reminding me of it!

We had various attempts in the past, including commit
0cbe6a8f08 ("tcp: remove SOCK_QUEUE_SHRUNK"),
but the issue is that TCP stack currently only generates
EPOLLOUT from input path, when tp->snd_una has advanced
and skb(s) cleaned from rtx queue.

If a flow has a big RTT, and/or receives SACKs, it is possible
that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
and no more data can be sent until tp->snd_una finally advances.

What is needed is to also check if POLLOUT needs to be generated
whenever tp->snd_nxt is advanced, from output path.

This bug triggers more often after an idle period, as
we do not receive ACK for at least one RTT. tcp_notsent_lowat
could be a fraction of what CWND and pacing rate would allow to
send during this RTT.

In a followup patch, I will remove the bogus call
to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
from tcp_check_space(). Fact that we have decided to generate
an EPOLLOUT does not mean the application has immediately
refilled the transmit queue. This optimistic call
might have been the reason the bug seemed not too serious.

Tested:

200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]

$ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
$ cat bench_rr.sh
SUM=0
for i in {1..10}
do
 V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
 echo $V
 SUM=$(($SUM + $V))
done
echo SUM=$SUM

Before patch:
$ bench_rr.sh
130000000
80000000
140000000
140000000
140000000
140000000
130000000
40000000
90000000
110000000
SUM=1140000000

After patch:
$ bench_rr.sh
430000000
590000000
530000000
450000000
450000000
350000000
450000000
490000000
480000000
460000000
SUM=4680000000  # This is 410 % of the value before patch.

Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Doug Porter <dsp@fb.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25 12:07:45 +01:00
..
6lowpan net: don't include ndisc.h from ipv6.h 2022-02-04 14:15:11 -08:00
9p xen/grant-table: remove readonly parameter from functions 2022-03-15 20:34:40 -05:00
802 net: 802: Use memset_startat() to clear struct fields 2021-11-19 11:23:23 +00:00
8021q vlan: use correct format characters 2022-03-17 16:34:49 -07:00
appletalk net: socket: rework compat_ifreq_ioctl() 2021-07-23 14:20:25 +01:00
atm proc: remove PDE_DATA() completely 2022-01-22 08:33:37 +02:00
ax25 ax25: Fix UAF bugs in ax25 timers 2022-03-29 10:24:34 +02:00
batman-adv batman-adv: Use netif_rx(). 2022-03-07 11:40:41 +00:00
bluetooth Bluetooth: call hci_le_conn_failed with hdev lock in hci_le_conn_failed 2022-03-18 17:12:08 +01:00
bpf bpf, test_run: Fix packet size check for live packet mode 2022-03-11 22:01:26 +01:00
bpfilter uaccess: remove CONFIG_SET_FS 2022-02-25 09:36:06 +01:00
bridge net: bridge: switchdev: check br_vlan_group() return value 2022-04-22 15:12:18 -07:00
caif net: caif: Use netif_rx(). 2022-03-04 12:02:19 +00:00
can can: isotp: stop timeout monitoring when no first frame was sent 2022-04-17 17:21:22 +02:00
ceph libceph: drop else branches in prepare_read_data{,_cont} 2022-03-01 18:26:36 +01:00
core rtnetlink: Fix handling of disabled L3 stats in RTM_GETSTATS replies 2022-04-14 09:01:26 +02:00
dcb net: dcb: disable softirqs in dcbnl_flush_dev() 2022-03-03 08:01:55 -08:00
dccp dccp: remove max48() 2022-01-27 13:53:27 +00:00
decnet net: decnet: use time_is_before_jiffies() instead of open coding it 2022-02-28 13:21:32 +00:00
dns_resolver
dsa net: dsa: flood multicast to CPU when slave has IFF_PROMISC 2022-04-25 11:46:24 +01:00
ethernet gro: remove rcu_read_lock/rcu_read_unlock from gro_complete handlers 2021-11-24 17:21:42 -08:00
ethtool ethtool: add support to set/get completion queue event size 2022-02-23 20:33:05 -08:00
hsr net: add per-cpu storage and net->core_stats 2022-03-11 23:17:24 -08:00
ieee802154 net: ipv6: Handle delivery_time in ipv6 defrag 2022-03-03 14:38:48 +00:00
ife
ipv4 tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT 2022-04-25 12:07:45 +01:00
ipv6 ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode 2022-04-25 11:40:45 +01:00
iucv s390/iucv: sort out physical vs virtual pointers usage 2022-02-22 16:09:13 -08:00
kcm net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
key af_key: add __GFP_ZERO flag for compose_sadb_supported in function pfkey_register 2022-03-10 07:39:47 +01:00
l2tp l2tp: add netns refcount tracker to l2tp_dfs_seq_data 2021-12-10 06:38:27 -08:00
l3mdev l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu 2022-04-15 14:27:24 -07:00
lapb net: lapb: Use list_for_each_entry() to simplify code in lapb_iface.c 2021-06-08 16:31:25 -07:00
llc llc: only change llc->dev when bind() succeeds 2022-03-25 16:55:41 -07:00
mac80211 mac80211: fix ht_capa printout in debugfs 2022-04-11 11:57:27 +02:00
mac802154 mac802154: use dev_addr_set() - manual 2021-10-20 14:27:40 +01:00
mctp mctp: Use output netdev to allocate skb headroom 2022-04-01 12:04:15 +01:00
mpls net: mpls: Fix GCC 12 warning 2022-02-10 15:29:39 +00:00
mptcp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-23 10:53:49 -07:00
ncsi all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate 2022-01-15 08:47:31 -08:00
netfilter netfilter: nft_set_rbtree: overlap detection with element re-addition after deletion 2022-04-22 15:49:15 +02:00
netlabel netlabel: fix out-of-bounds memory accesses 2022-03-21 10:59:11 +00:00
netlink netlink: reset network and mac headers in netlink_dump() 2022-04-19 15:05:03 +02:00
netrom netrom: fix api breakage in nr_setsockopt() 2022-01-07 14:11:05 +00:00
nfc nfc: nci: add flush_workqueue to prevent uaf 2022-04-13 14:44:44 +01:00
nsh
openvswitch openvswitch: fix OOB access in reserve_sfa_size() 2022-04-15 11:50:02 +01:00
packet net/packet: fix packet_sock xmit return value checking 2022-04-15 11:17:30 +01:00
phonet phonet: Use netif_rx(). 2022-03-07 11:40:41 +00:00
psample
qrtr bus: mhi: core: Add an API for auto queueing buffers for DL channel 2021-12-17 17:17:14 +01:00
rds Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-12-16 16:13:19 -08:00
rfkill rfkill: make new event layout opt-in 2022-03-18 13:09:17 +02:00
rose net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
rxrpc rxrpc: Restore removed timer deletion 2022-04-15 10:54:49 +01:00
sched net/sched: cls_u32: fix possible leak in u32_init_knode() 2022-04-15 14:26:11 -07:00
sctp sctp: check asoc strreset_chunk in sctp_generate_reconf_event 2022-04-23 22:34:17 +01:00
smc net/smc: sync err code when tcp connection was refused 2022-04-25 11:10:49 +01:00
strparser bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding 2021-11-09 01:05:28 +01:00
sunrpc NFSD bug fixes for 5.18-rc: 2022-04-12 14:23:19 -10:00
switchdev net: switchdev: remove lag_mod_cb from switchdev_handle_fdb_event_to_device 2022-02-24 21:31:43 -08:00
tipc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-23 10:53:49 -07:00
tls net/tls: fix slab-out-of-bounds bug in decrypt_internal 2022-04-01 11:53:35 +01:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-23 10:53:49 -07:00
vmw_vsock vsock/virtio: enable VQs early on probe 2022-03-24 18:36:36 -07:00
wireless cfg80211: hold bss_lock while updating nontrans_list 2022-04-11 11:55:36 +02:00
x25 net/x25: Fix null-ptr-deref caused by x25_disconnect 2022-03-26 11:48:16 -07:00
xdp xsk: Do not write NULL in SW ring at allocation failure 2022-03-28 19:56:28 -07:00
xfrm xfrm: Pass flowi_oif or l3mdev as oif to xfrm_dst_lookup 2022-04-04 13:32:56 +02:00
compat.c net: Return the correct errno code 2021-06-03 15:13:56 -07:00
devres.c net: devres: Correct a grammatical error 2021-06-11 12:55:28 -07:00
Kconfig page_pool: Add allocation stats 2022-03-03 09:55:28 +00:00
Kconfig.debug net: add networking namespace refcount tracker 2021-12-10 06:38:26 -08:00
Makefile mctp: Add MCTP base 2021-07-29 15:06:49 +01:00
socket.c fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
sysctl_net.c sections: move and rename core_kernel_data() to is_kernel_core_data() 2021-11-09 10:02:50 -08:00