Stefano Brivio a502ea6fa9 udp: Deal with race between UDP socket address change and rehash
If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.

Secondary hash chains were introduced by commit 30fff9231fad ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().

This operation was introduced by commit 719f835853a9 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.

This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.

Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:

  LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4))
  dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in

  while :; do
  	taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.1 || sleep 1
  	taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null
  	wait
  done

where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):

  2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused

This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:

  LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')"

  while :; do
  	./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc &
  	sleep 0.2 || sleep 1
  	socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null
  	wait
  	cmp tmp.in tmp.out
  done

Once this fails:

  tmp.in tmp.out differ: char 8193, line 29

we can finally have a look at what's going on:

  $ tshark -r pasta.pcap
      1   0.000000           :: ? ff02::16     ICMPv6 110 Multicast Listener Report Message v2
      2   0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      3   0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      4   0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      5   0.168827 c6:47:05:8d:dc:04 ? Broadcast    ARP 42 Who has 88.198.0.161? Tell 88.198.0.164
      6   0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55
      7   0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
      8   0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable)
      9   0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     10   0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192
     11   0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096
     12   0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0

On the third datagram received, the network namespace of the container
initiates an ARP lookup to deliver the ICMP message.

In another variant of this reproducer, starting the client with:

  strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log &

and connecting to the socat server using a loopback address:

  socat OPEN:tmp.in UDP4:localhost:1337,shut-null

we can more clearly observe a sendmmsg() call failing after the
first datagram is delivered:

  [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0
  [...]
  [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable)
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1
  [...]
  [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused)

and, somewhat confusingly, after a connect() on the same socket
succeeded.

Until commit 4cdeeee9252a ("net: udp: prefer listeners bound to an
address"), the race between receive address change and lookup didn't
actually cause visible issues, because, once the lookup based on the
secondary hash chain failed, we would still attempt a lookup based on
the primary hash (destination port only), and find the socket with the
outdated secondary hash.

That change, however, dropped port-only lookups altogether, as side
effect, making the race visible.

To fix this, while avoiding the need to make address changes and
rehash atomic against lookups, reintroduce primary hash lookups as
fallback, if lookups based on four-tuple and secondary hashes fail.

To this end, introduce a simplified lookup implementation, which
doesn't take care of SO_REUSEPORT groups: if we have one, there are
multiple sockets that would match the four-tuple or secondary hash,
meaning that we can't run into this race at all.

v2:
  - instead of synchronising lookup operations against address change
    plus rehash, reintroduce a simplified version of the original
    primary hash lookup as fallback

v1:
  - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
    usage (Kuniyuki Iwashima)
  - directly use sk_rcv_saddr for IPv4 receive addresses instead of
    fetching inet_rcv_saddr (Kuniyuki Iwashima)
  - move inet_update_saddr() to inet_hashtables.h and use that
    to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
  - rebase onto net-next, update commit message accordingly

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/24147
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 30fff9231fad ("udp: bind() optimisation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-23 11:39:55 +00:00

1982 lines
51 KiB
C

// SPDX-License-Identifier: GPL-2.0-or-later
/*
* UDP over IPv6
* Linux INET6 implementation
*
* Authors:
* Pedro Roque <roque@di.fc.ul.pt>
*
* Based on linux/ipv4/udp.c
*
* Fixes:
* Hideaki YOSHIFUJI : sin6_scope_id support
* YOSHIFUJI Hideaki @USAGI and: Support IPV6_V6ONLY socket option, which
* Alexey Kuznetsov allow both IPv4 and IPv6 sockets to bind
* a single port at the same time.
* Kazunori MIYAZAWA @USAGI: change process style to use ip6_append_data
* YOSHIFUJI Hideaki @USAGI: convert /proc/net/udp6 to seq_file.
*/
#include <linux/bpf-cgroup.h>
#include <linux/errno.h>
#include <linux/types.h>
#include <linux/socket.h>
#include <linux/sockios.h>
#include <linux/net.h>
#include <linux/in6.h>
#include <linux/netdevice.h>
#include <linux/if_arp.h>
#include <linux/ipv6.h>
#include <linux/icmpv6.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/skbuff.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
#include <linux/indirect_call_wrapper.h>
#include <trace/events/udp.h>
#include <net/addrconf.h>
#include <net/ndisc.h>
#include <net/protocol.h>
#include <net/transp_v6.h>
#include <net/ip6_route.h>
#include <net/raw.h>
#include <net/seg6.h>
#include <net/tcp_states.h>
#include <net/ip6_checksum.h>
#include <net/ip6_tunnel.h>
#include <net/xfrm.h>
#include <net/inet_hashtables.h>
#include <net/inet6_hashtables.h>
#include <net/busy_poll.h>
#include <net/sock_reuseport.h>
#include <net/gro.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <trace/events/skb.h>
#include "udp_impl.h"
static void udpv6_destruct_sock(struct sock *sk)
{
udp_destruct_common(sk);
inet6_sock_destruct(sk);
}
int udpv6_init_sock(struct sock *sk)
{
udp_lib_init_sock(sk);
sk->sk_destruct = udpv6_destruct_sock;
set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
return 0;
}
INDIRECT_CALLABLE_SCOPE
u32 udp6_ehashfn(const struct net *net,
const struct in6_addr *laddr,
const u16 lport,
const struct in6_addr *faddr,
const __be16 fport)
{
u32 lhash, fhash;
net_get_random_once(&udp6_ehash_secret,
sizeof(udp6_ehash_secret));
net_get_random_once(&udp_ipv6_hash_secret,
sizeof(udp_ipv6_hash_secret));
lhash = (__force u32)laddr->s6_addr32[3];
fhash = __ipv6_addr_jhash(faddr, udp_ipv6_hash_secret);
return __inet6_ehashfn(lhash, lport, fhash, fport,
udp6_ehash_secret + net_hash_mix(net));
}
int udp_v6_get_port(struct sock *sk, unsigned short snum)
{
unsigned int hash2_nulladdr =
ipv6_portaddr_hash(sock_net(sk), &in6addr_any, snum);
unsigned int hash2_partial =
ipv6_portaddr_hash(sock_net(sk), &sk->sk_v6_rcv_saddr, 0);
/* precompute partial secondary hash */
udp_sk(sk)->udp_portaddr_hash = hash2_partial;
return udp_lib_get_port(sk, snum, hash2_nulladdr);
}
void udp_v6_rehash(struct sock *sk)
{
u16 new_hash = ipv6_portaddr_hash(sock_net(sk),
&sk->sk_v6_rcv_saddr,
inet_sk(sk)->inet_num);
u16 new_hash4;
if (ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)) {
new_hash4 = udp_ehashfn(sock_net(sk),
sk->sk_rcv_saddr, sk->sk_num,
sk->sk_daddr, sk->sk_dport);
} else {
new_hash4 = udp6_ehashfn(sock_net(sk),
&sk->sk_v6_rcv_saddr, sk->sk_num,
&sk->sk_v6_daddr, sk->sk_dport);
}
udp_lib_rehash(sk, new_hash, new_hash4);
}
static int compute_score(struct sock *sk, const struct net *net,
const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr, unsigned short hnum,
int dif, int sdif)
{
int bound_dev_if, score;
struct inet_sock *inet;
bool dev_match;
if (!net_eq(sock_net(sk), net) ||
udp_sk(sk)->udp_port_hash != hnum ||
sk->sk_family != PF_INET6)
return -1;
if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
return -1;
score = 0;
inet = inet_sk(sk);
if (inet->inet_dport) {
if (inet->inet_dport != sport)
return -1;
score++;
}
if (!ipv6_addr_any(&sk->sk_v6_daddr)) {
if (!ipv6_addr_equal(&sk->sk_v6_daddr, saddr))
return -1;
score++;
}
bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
dev_match = udp_sk_bound_dev_eq(net, bound_dev_if, dif, sdif);
if (!dev_match)
return -1;
if (bound_dev_if)
score++;
if (READ_ONCE(sk->sk_incoming_cpu) == raw_smp_processor_id())
score++;
return score;
}
/**
* udp6_lib_lookup1() - Simplified lookup using primary hash (destination port)
* @net: Network namespace
* @saddr: Source address, network order
* @sport: Source port, network order
* @daddr: Destination address, network order
* @hnum: Destination port, host order
* @dif: Destination interface index
* @sdif: Destination bridge port index, if relevant
* @udptable: Set of UDP hash tables
*
* Simplified lookup to be used as fallback if no sockets are found due to a
* potential race between (receive) address change, and lookup happening before
* the rehash operation. This function ignores SO_REUSEPORT groups while scoring
* result sockets, because if we have one, we don't need the fallback at all.
*
* Called under rcu_read_lock().
*
* Return: socket with highest matching score if any, NULL if none
*/
static struct sock *udp6_lib_lookup1(const struct net *net,
const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr,
unsigned int hnum, int dif, int sdif,
const struct udp_table *udptable)
{
unsigned int slot = udp_hashfn(net, hnum, udptable->mask);
struct udp_hslot *hslot = &udptable->hash[slot];
struct sock *sk, *result = NULL;
int score, badness = 0;
sk_for_each_rcu(sk, &hslot->head) {
score = compute_score(sk, net,
saddr, sport, daddr, hnum, dif, sdif);
if (score > badness) {
result = sk;
badness = score;
}
}
return result;
}
/* called with rcu_read_lock() */
static struct sock *udp6_lib_lookup2(const struct net *net,
const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr, unsigned int hnum,
int dif, int sdif, struct udp_hslot *hslot2,
struct sk_buff *skb)
{
struct sock *sk, *result;
int score, badness;
bool need_rescore;
result = NULL;
badness = -1;
udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
need_rescore = false;
rescore:
score = compute_score(need_rescore ? result : sk, net, saddr,
sport, daddr, hnum, dif, sdif);
if (score > badness) {
badness = score;
if (need_rescore)
continue;
if (sk->sk_state == TCP_ESTABLISHED) {
result = sk;
continue;
}
result = inet6_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
saddr, sport, daddr, hnum, udp6_ehashfn);
if (!result) {
result = sk;
continue;
}
/* Fall back to scoring if group has connections */
if (!reuseport_has_conns(sk))
return result;
/* Reuseport logic returned an error, keep original score. */
if (IS_ERR(result))
continue;
/* compute_score is too long of a function to be
* inlined, and calling it again here yields
* measureable overhead for some
* workloads. Work around it by jumping
* backwards to rescore 'result'.
*/
need_rescore = true;
goto rescore;
}
}
return result;
}
#if IS_ENABLED(CONFIG_BASE_SMALL)
static struct sock *udp6_lib_lookup4(const struct net *net,
const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr,
unsigned int hnum, int dif, int sdif,
struct udp_table *udptable)
{
return NULL;
}
static void udp6_hash4(struct sock *sk)
{
}
#else /* !CONFIG_BASE_SMALL */
static struct sock *udp6_lib_lookup4(const struct net *net,
const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr,
unsigned int hnum, int dif, int sdif,
struct udp_table *udptable)
{
const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
const struct hlist_nulls_node *node;
struct udp_hslot *hslot4;
unsigned int hash4, slot;
struct udp_sock *up;
struct sock *sk;
hash4 = udp6_ehashfn(net, daddr, hnum, saddr, sport);
slot = hash4 & udptable->mask;
hslot4 = &udptable->hash4[slot];
begin:
udp_lrpa_for_each_entry_rcu(up, node, &hslot4->nulls_head) {
sk = (struct sock *)up;
if (inet6_match(net, sk, saddr, daddr, ports, dif, sdif))
return sk;
}
/* if the nulls value we got at the end of this lookup is not the
* expected one, we must restart lookup. We probably met an item that
* was moved to another chain due to rehash.
*/
if (get_nulls_value(node) != slot)
goto begin;
return NULL;
}
static void udp6_hash4(struct sock *sk)
{
struct net *net = sock_net(sk);
unsigned int hash;
if (ipv6_addr_v4mapped(&sk->sk_v6_rcv_saddr)) {
udp4_hash4(sk);
return;
}
if (sk_unhashed(sk) || ipv6_addr_any(&sk->sk_v6_rcv_saddr))
return;
hash = udp6_ehashfn(net, &sk->sk_v6_rcv_saddr, sk->sk_num,
&sk->sk_v6_daddr, sk->sk_dport);
udp_lib_hash4(sk, hash);
}
#endif /* CONFIG_BASE_SMALL */
/* rcu_read_lock() must be held */
struct sock *__udp6_lib_lookup(const struct net *net,
const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr, __be16 dport,
int dif, int sdif, struct udp_table *udptable,
struct sk_buff *skb)
{
unsigned short hnum = ntohs(dport);
struct udp_hslot *hslot2;
struct sock *result, *sk;
unsigned int hash2;
hash2 = ipv6_portaddr_hash(net, daddr, hnum);
hslot2 = udp_hashslot2(udptable, hash2);
if (udp_has_hash4(hslot2)) {
result = udp6_lib_lookup4(net, saddr, sport, daddr, hnum,
dif, sdif, udptable);
if (result) /* udp6_lib_lookup4 return sk or NULL */
return result;
}
/* Lookup connected or non-wildcard sockets */
result = udp6_lib_lookup2(net, saddr, sport,
daddr, hnum, dif, sdif,
hslot2, skb);
if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
goto done;
/* Lookup redirect from BPF */
if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
udptable == net->ipv4.udp_table) {
sk = inet6_lookup_run_sk_lookup(net, IPPROTO_UDP, skb, sizeof(struct udphdr),
saddr, sport, daddr, hnum, dif,
udp6_ehashfn);
if (sk) {
result = sk;
goto done;
}
}
/* Got non-wildcard socket or error on first lookup */
if (result)
goto done;
/* Lookup wildcard sockets */
hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
hslot2 = udp_hashslot2(udptable, hash2);
result = udp6_lib_lookup2(net, saddr, sport,
&in6addr_any, hnum, dif, sdif,
hslot2, skb);
if (!IS_ERR_OR_NULL(result))
goto done;
/* Cover address change/lookup/rehash race: see __udp4_lib_lookup() */
result = udp6_lib_lookup1(net, saddr, sport, daddr, hnum, dif, sdif,
udptable);
done:
if (IS_ERR(result))
return NULL;
return result;
}
EXPORT_SYMBOL_GPL(__udp6_lib_lookup);
static struct sock *__udp6_lib_lookup_skb(struct sk_buff *skb,
__be16 sport, __be16 dport,
struct udp_table *udptable)
{
const struct ipv6hdr *iph = ipv6_hdr(skb);
return __udp6_lib_lookup(dev_net(skb->dev), &iph->saddr, sport,
&iph->daddr, dport, inet6_iif(skb),
inet6_sdif(skb), udptable, skb);
}
struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
__be16 sport, __be16 dport)
{
const u16 offset = NAPI_GRO_CB(skb)->network_offsets[skb->encapsulation];
const struct ipv6hdr *iph = (struct ipv6hdr *)(skb->data + offset);
struct net *net = dev_net(skb->dev);
int iif, sdif;
inet6_get_iif_sdif(skb, &iif, &sdif);
return __udp6_lib_lookup(net, &iph->saddr, sport,
&iph->daddr, dport, iif,
sdif, net->ipv4.udp_table, NULL);
}
/* Must be called under rcu_read_lock().
* Does increment socket refcount.
*/
#if IS_ENABLED(CONFIG_NF_TPROXY_IPV6) || IS_ENABLED(CONFIG_NF_SOCKET_IPV6)
struct sock *udp6_lib_lookup(const struct net *net, const struct in6_addr *saddr, __be16 sport,
const struct in6_addr *daddr, __be16 dport, int dif)
{
struct sock *sk;
sk = __udp6_lib_lookup(net, saddr, sport, daddr, dport,
dif, 0, net->ipv4.udp_table, NULL);
if (sk && !refcount_inc_not_zero(&sk->sk_refcnt))
sk = NULL;
return sk;
}
EXPORT_SYMBOL_GPL(udp6_lib_lookup);
#endif
/* do not use the scratch area len for jumbogram: their length execeeds the
* scratch area space; note that the IP6CB flags is still in the first
* cacheline, so checking for jumbograms is cheap
*/
static int udp6_skb_len(struct sk_buff *skb)
{
return unlikely(inet6_is_jumbogram(skb)) ? skb->len : udp_skb_len(skb);
}
/*
* This should be easy, if there is something there we
* return it, otherwise we block.
*/
int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
int flags, int *addr_len)
{
struct ipv6_pinfo *np = inet6_sk(sk);
struct inet_sock *inet = inet_sk(sk);
struct sk_buff *skb;
unsigned int ulen, copied;
int off, err, peeking = flags & MSG_PEEK;
int is_udplite = IS_UDPLITE(sk);
struct udp_mib __percpu *mib;
bool checksum_valid = false;
int is_udp4;
if (flags & MSG_ERRQUEUE)
return ipv6_recv_error(sk, msg, len, addr_len);
if (np->rxpmtu && np->rxopt.bits.rxpmtu)
return ipv6_recv_rxpmtu(sk, msg, len, addr_len);
try_again:
off = sk_peek_offset(sk, flags);
skb = __skb_recv_udp(sk, flags, &off, &err);
if (!skb)
return err;
ulen = udp6_skb_len(skb);
copied = len;
if (copied > ulen - off)
copied = ulen - off;
else if (copied < ulen)
msg->msg_flags |= MSG_TRUNC;
is_udp4 = (skb->protocol == htons(ETH_P_IP));
mib = __UDPX_MIB(sk, is_udp4);
/*
* If checksum is needed at all, try to do it while copying the
* data. If the data is truncated, or if we only want a partial
* coverage checksum (UDP-Lite), do it before the copy.
*/
if (copied < ulen || peeking ||
(is_udplite && UDP_SKB_CB(skb)->partial_cov)) {
checksum_valid = udp_skb_csum_unnecessary(skb) ||
!__udp_lib_checksum_complete(skb);
if (!checksum_valid)
goto csum_copy_err;
}
if (checksum_valid || udp_skb_csum_unnecessary(skb)) {
if (udp_skb_is_linear(skb))
err = copy_linear_skb(skb, copied, off, &msg->msg_iter);
else
err = skb_copy_datagram_msg(skb, off, msg, copied);
} else {
err = skb_copy_and_csum_datagram_msg(skb, off, msg);
if (err == -EINVAL)
goto csum_copy_err;
}
if (unlikely(err)) {
if (!peeking) {
atomic_inc(&sk->sk_drops);
SNMP_INC_STATS(mib, UDP_MIB_INERRORS);
}
kfree_skb(skb);
return err;
}
if (!peeking)
SNMP_INC_STATS(mib, UDP_MIB_INDATAGRAMS);
sock_recv_cmsgs(msg, sk, skb);
/* Copy the address. */
if (msg->msg_name) {
DECLARE_SOCKADDR(struct sockaddr_in6 *, sin6, msg->msg_name);
sin6->sin6_family = AF_INET6;
sin6->sin6_port = udp_hdr(skb)->source;
sin6->sin6_flowinfo = 0;
if (is_udp4) {
ipv6_addr_set_v4mapped(ip_hdr(skb)->saddr,
&sin6->sin6_addr);
sin6->sin6_scope_id = 0;
} else {
sin6->sin6_addr = ipv6_hdr(skb)->saddr;
sin6->sin6_scope_id =
ipv6_iface_scope_id(&sin6->sin6_addr,
inet6_iif(skb));
}
*addr_len = sizeof(*sin6);
BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk,
(struct sockaddr *)sin6,
addr_len);
}
if (udp_test_bit(GRO_ENABLED, sk))
udp_cmsg_recv(msg, sk, skb);
if (np->rxopt.all)
ip6_datagram_recv_common_ctl(sk, msg, skb);
if (is_udp4) {
if (inet_cmsg_flags(inet))
ip_cmsg_recv_offset(msg, sk, skb,
sizeof(struct udphdr), off);
} else {
if (np->rxopt.all)
ip6_datagram_recv_specific_ctl(sk, msg, skb);
}
err = copied;
if (flags & MSG_TRUNC)
err = ulen;
skb_consume_udp(sk, skb, peeking ? -err : err);
return err;
csum_copy_err:
if (!__sk_queue_drop_skb(sk, &udp_sk(sk)->reader_queue, skb, flags,
udp_skb_destructor)) {
SNMP_INC_STATS(mib, UDP_MIB_CSUMERRORS);
SNMP_INC_STATS(mib, UDP_MIB_INERRORS);
}
kfree_skb(skb);
/* starting over for a new packet, but check if we need to yield */
cond_resched();
msg->msg_flags &= ~MSG_TRUNC;
goto try_again;
}
DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
void udpv6_encap_enable(void)
{
static_branch_inc(&udpv6_encap_needed_key);
}
EXPORT_SYMBOL(udpv6_encap_enable);
/* Handler for tunnels with arbitrary destination ports: no socket lookup, go
* through error handlers in encapsulations looking for a match.
*/
static int __udp6_lib_err_encap_no_sk(struct sk_buff *skb,
struct inet6_skb_parm *opt,
u8 type, u8 code, int offset, __be32 info)
{
int i;
for (i = 0; i < MAX_IPTUN_ENCAP_OPS; i++) {
int (*handler)(struct sk_buff *skb, struct inet6_skb_parm *opt,
u8 type, u8 code, int offset, __be32 info);
const struct ip6_tnl_encap_ops *encap;
encap = rcu_dereference(ip6tun_encaps[i]);
if (!encap)
continue;
handler = encap->err_handler;
if (handler && !handler(skb, opt, type, code, offset, info))
return 0;
}
return -ENOENT;
}
/* Try to match ICMP errors to UDP tunnels by looking up a socket without
* reversing source and destination port: this will match tunnels that force the
* same destination port on both endpoints (e.g. VXLAN, GENEVE). Note that
* lwtunnels might actually break this assumption by being configured with
* different destination ports on endpoints, in this case we won't be able to
* trace ICMP messages back to them.
*
* If this doesn't match any socket, probe tunnels with arbitrary destination
* ports (e.g. FoU, GUE): there, the receiving socket is useless, as the port
* we've sent packets to won't necessarily match the local destination port.
*
* Then ask the tunnel implementation to match the error against a valid
* association.
*
* Return an error if we can't find a match, the socket if we need further
* processing, zero otherwise.
*/
static struct sock *__udp6_lib_err_encap(struct net *net,
const struct ipv6hdr *hdr, int offset,
struct udphdr *uh,
struct udp_table *udptable,
struct sock *sk,
struct sk_buff *skb,
struct inet6_skb_parm *opt,
u8 type, u8 code, __be32 info)
{
int (*lookup)(struct sock *sk, struct sk_buff *skb);
int network_offset, transport_offset;
struct udp_sock *up;
network_offset = skb_network_offset(skb);
transport_offset = skb_transport_offset(skb);
/* Network header needs to point to the outer IPv6 header inside ICMP */
skb_reset_network_header(skb);
/* Transport header needs to point to the UDP header */
skb_set_transport_header(skb, offset);
if (sk) {
up = udp_sk(sk);
lookup = READ_ONCE(up->encap_err_lookup);
if (lookup && lookup(sk, skb))
sk = NULL;
goto out;
}
sk = __udp6_lib_lookup(net, &hdr->daddr, uh->source,
&hdr->saddr, uh->dest,
inet6_iif(skb), 0, udptable, skb);
if (sk) {
up = udp_sk(sk);
lookup = READ_ONCE(up->encap_err_lookup);
if (!lookup || lookup(sk, skb))
sk = NULL;
}
out:
if (!sk) {
sk = ERR_PTR(__udp6_lib_err_encap_no_sk(skb, opt, type, code,
offset, info));
}
skb_set_transport_header(skb, transport_offset);
skb_set_network_header(skb, network_offset);
return sk;
}
int __udp6_lib_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
u8 type, u8 code, int offset, __be32 info,
struct udp_table *udptable)
{
struct ipv6_pinfo *np;
const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
const struct in6_addr *saddr = &hdr->saddr;
const struct in6_addr *daddr = seg6_get_daddr(skb, opt) ? : &hdr->daddr;
struct udphdr *uh = (struct udphdr *)(skb->data+offset);
bool tunnel = false;
struct sock *sk;
int harderr;
int err;
struct net *net = dev_net(skb->dev);
sk = __udp6_lib_lookup(net, daddr, uh->dest, saddr, uh->source,
inet6_iif(skb), inet6_sdif(skb), udptable, NULL);
if (!sk || READ_ONCE(udp_sk(sk)->encap_type)) {
/* No socket for error: try tunnels before discarding */
if (static_branch_unlikely(&udpv6_encap_needed_key)) {
sk = __udp6_lib_err_encap(net, hdr, offset, uh,
udptable, sk, skb,
opt, type, code, info);
if (!sk)
return 0;
} else
sk = ERR_PTR(-ENOENT);
if (IS_ERR(sk)) {
__ICMP6_INC_STATS(net, __in6_dev_get(skb->dev),
ICMP6_MIB_INERRORS);
return PTR_ERR(sk);
}
tunnel = true;
}
harderr = icmpv6_err_convert(type, code, &err);
np = inet6_sk(sk);
if (type == ICMPV6_PKT_TOOBIG) {
if (!ip6_sk_accept_pmtu(sk))
goto out;
ip6_sk_update_pmtu(skb, sk, info);
if (READ_ONCE(np->pmtudisc) != IPV6_PMTUDISC_DONT)
harderr = 1;
}
if (type == NDISC_REDIRECT) {
if (tunnel) {
ip6_redirect(skb, sock_net(sk), inet6_iif(skb),
READ_ONCE(sk->sk_mark), sk->sk_uid);
} else {
ip6_sk_redirect(skb, sk);
}
goto out;
}
/* Tunnels don't have an application socket: don't pass errors back */
if (tunnel) {
if (udp_sk(sk)->encap_err_rcv)
udp_sk(sk)->encap_err_rcv(sk, skb, err, uh->dest,
ntohl(info), (u8 *)(uh+1));
goto out;
}
if (!inet6_test_bit(RECVERR6, sk)) {
if (!harderr || sk->sk_state != TCP_ESTABLISHED)
goto out;
} else {
ipv6_icmp_error(sk, skb, err, uh->dest, ntohl(info), (u8 *)(uh+1));
}
sk->sk_err = err;
sk_error_report(sk);
out:
return 0;
}
static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
int rc;
if (!ipv6_addr_any(&sk->sk_v6_daddr)) {
sock_rps_save_rxhash(sk, skb);
sk_mark_napi_id(sk, skb);
sk_incoming_cpu_update(sk);
} else {
sk_mark_napi_id_once(sk, skb);
}
rc = __udp_enqueue_schedule_skb(sk, skb);
if (rc < 0) {
int is_udplite = IS_UDPLITE(sk);
enum skb_drop_reason drop_reason;
/* Note that an ENOMEM error is charged twice */
if (rc == -ENOMEM) {
UDP6_INC_STATS(sock_net(sk),
UDP_MIB_RCVBUFERRORS, is_udplite);
drop_reason = SKB_DROP_REASON_SOCKET_RCVBUFF;
} else {
UDP6_INC_STATS(sock_net(sk),
UDP_MIB_MEMERRORS, is_udplite);
drop_reason = SKB_DROP_REASON_PROTO_MEM;
}
UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
trace_udp_fail_queue_rcv_skb(rc, sk, skb);
sk_skb_reason_drop(sk, skb, drop_reason);
return -1;
}
return 0;
}
static __inline__ int udpv6_err(struct sk_buff *skb,
struct inet6_skb_parm *opt, u8 type,
u8 code, int offset, __be32 info)
{
return __udp6_lib_err(skb, opt, type, code, offset, info,
dev_net(skb->dev)->ipv4.udp_table);
}
static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
{
enum skb_drop_reason drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
struct udp_sock *up = udp_sk(sk);
int is_udplite = IS_UDPLITE(sk);
if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) {
drop_reason = SKB_DROP_REASON_XFRM_POLICY;
goto drop;
}
nf_reset_ct(skb);
if (static_branch_unlikely(&udpv6_encap_needed_key) &&
READ_ONCE(up->encap_type)) {
int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
/*
* This is an encapsulation socket so pass the skb to
* the socket's udp_encap_rcv() hook. Otherwise, just
* fall through and pass this up the UDP socket.
* up->encap_rcv() returns the following value:
* =0 if skb was successfully passed to the encap
* handler or was discarded by it.
* >0 if skb should be passed on to UDP.
* <0 if skb should be resubmitted as proto -N
*/
/* if we're overly short, let UDP handle it */
encap_rcv = READ_ONCE(up->encap_rcv);
if (encap_rcv) {
int ret;
/* Verify checksum before giving to encap */
if (udp_lib_checksum_complete(skb))
goto csum_error;
ret = encap_rcv(sk, skb);
if (ret <= 0) {
__UDP6_INC_STATS(sock_net(sk),
UDP_MIB_INDATAGRAMS,
is_udplite);
return -ret;
}
}
/* FALLTHROUGH -- it's a UDP Packet */
}
/*
* UDP-Lite specific tests, ignored on UDP sockets (see net/ipv4/udp.c).
*/
if (udp_test_bit(UDPLITE_RECV_CC, sk) && UDP_SKB_CB(skb)->partial_cov) {
u16 pcrlen = READ_ONCE(up->pcrlen);
if (pcrlen == 0) { /* full coverage was set */
net_dbg_ratelimited("UDPLITE6: partial coverage %d while full coverage %d requested\n",
UDP_SKB_CB(skb)->cscov, skb->len);
goto drop;
}
if (UDP_SKB_CB(skb)->cscov < pcrlen) {
net_dbg_ratelimited("UDPLITE6: coverage %d too small, need min %d\n",
UDP_SKB_CB(skb)->cscov, pcrlen);
goto drop;
}
}
prefetch(&sk->sk_rmem_alloc);
if (rcu_access_pointer(sk->sk_filter) &&
udp_lib_checksum_complete(skb))
goto csum_error;
if (sk_filter_trim_cap(sk, skb, sizeof(struct udphdr))) {
drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
goto drop;
}
udp_csum_pull_header(skb);
skb_dst_drop(skb);
return __udpv6_queue_rcv_skb(sk, skb);
csum_error:
drop_reason = SKB_DROP_REASON_UDP_CSUM;
__UDP6_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
drop:
__UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
atomic_inc(&sk->sk_drops);
sk_skb_reason_drop(sk, skb, drop_reason);
return -1;
}
static int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
struct sk_buff *next, *segs;
int ret;
if (likely(!udp_unexpected_gso(sk, skb)))
return udpv6_queue_rcv_one_skb(sk, skb);
__skb_push(skb, -skb_mac_offset(skb));
segs = udp_rcv_segment(sk, skb, false);
skb_list_walk_safe(segs, skb, next) {
__skb_pull(skb, skb_transport_offset(skb));
udp_post_segment_fix_csum(skb);
ret = udpv6_queue_rcv_one_skb(sk, skb);
if (ret > 0)
ip6_protocol_deliver_rcu(dev_net(skb->dev), skb, ret,
true);
}
return 0;
}
static bool __udp_v6_is_mcast_sock(struct net *net, const struct sock *sk,
__be16 loc_port, const struct in6_addr *loc_addr,
__be16 rmt_port, const struct in6_addr *rmt_addr,
int dif, int sdif, unsigned short hnum)
{
const struct inet_sock *inet = inet_sk(sk);
if (!net_eq(sock_net(sk), net))
return false;
if (udp_sk(sk)->udp_port_hash != hnum ||
sk->sk_family != PF_INET6 ||
(inet->inet_dport && inet->inet_dport != rmt_port) ||
(!ipv6_addr_any(&sk->sk_v6_daddr) &&
!ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr)) ||
!udp_sk_bound_dev_eq(net, READ_ONCE(sk->sk_bound_dev_if), dif, sdif) ||
(!ipv6_addr_any(&sk->sk_v6_rcv_saddr) &&
!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, loc_addr)))
return false;
if (!inet6_mc_check(sk, loc_addr, rmt_addr))
return false;
return true;
}
static void udp6_csum_zero_error(struct sk_buff *skb)
{
/* RFC 2460 section 8.1 says that we SHOULD log
* this error. Well, it is reasonable.
*/
net_dbg_ratelimited("IPv6: udp checksum is 0 for [%pI6c]:%u->[%pI6c]:%u\n",
&ipv6_hdr(skb)->saddr, ntohs(udp_hdr(skb)->source),
&ipv6_hdr(skb)->daddr, ntohs(udp_hdr(skb)->dest));
}
/*
* Note: called only from the BH handler context,
* so we don't need to lock the hashes.
*/
static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
const struct in6_addr *saddr, const struct in6_addr *daddr,
struct udp_table *udptable, int proto)
{
struct sock *sk, *first = NULL;
const struct udphdr *uh = udp_hdr(skb);
unsigned short hnum = ntohs(uh->dest);
struct udp_hslot *hslot = udp_hashslot(udptable, net, hnum);
unsigned int offset = offsetof(typeof(*sk), sk_node);
unsigned int hash2 = 0, hash2_any = 0, use_hash2 = (hslot->count > 10);
int dif = inet6_iif(skb);
int sdif = inet6_sdif(skb);
struct hlist_node *node;
struct sk_buff *nskb;
if (use_hash2) {
hash2_any = ipv6_portaddr_hash(net, &in6addr_any, hnum) &
udptable->mask;
hash2 = ipv6_portaddr_hash(net, daddr, hnum) & udptable->mask;
start_lookup:
hslot = &udptable->hash2[hash2].hslot;
offset = offsetof(typeof(*sk), __sk_common.skc_portaddr_node);
}
sk_for_each_entry_offset_rcu(sk, node, &hslot->head, offset) {
if (!__udp_v6_is_mcast_sock(net, sk, uh->dest, daddr,
uh->source, saddr, dif, sdif,
hnum))
continue;
/* If zero checksum and no_check is not on for
* the socket then skip it.
*/
if (!uh->check && !udp_get_no_check6_rx(sk))
continue;
if (!first) {
first = sk;
continue;
}
nskb = skb_clone(skb, GFP_ATOMIC);
if (unlikely(!nskb)) {
atomic_inc(&sk->sk_drops);
__UDP6_INC_STATS(net, UDP_MIB_RCVBUFERRORS,
IS_UDPLITE(sk));
__UDP6_INC_STATS(net, UDP_MIB_INERRORS,
IS_UDPLITE(sk));
continue;
}
if (udpv6_queue_rcv_skb(sk, nskb) > 0)
consume_skb(nskb);
}
/* Also lookup *:port if we are using hash2 and haven't done so yet. */
if (use_hash2 && hash2 != hash2_any) {
hash2 = hash2_any;
goto start_lookup;
}
if (first) {
if (udpv6_queue_rcv_skb(first, skb) > 0)
consume_skb(skb);
} else {
kfree_skb(skb);
__UDP6_INC_STATS(net, UDP_MIB_IGNOREDMULTI,
proto == IPPROTO_UDPLITE);
}
return 0;
}
static void udp6_sk_rx_dst_set(struct sock *sk, struct dst_entry *dst)
{
if (udp_sk_rx_dst_set(sk, dst))
sk->sk_rx_dst_cookie = rt6_get_cookie(dst_rt6_info(dst));
}
/* wrapper for udp_queue_rcv_skb tacking care of csum conversion and
* return code conversion for ip layer consumption
*/
static int udp6_unicast_rcv_skb(struct sock *sk, struct sk_buff *skb,
struct udphdr *uh)
{
int ret;
if (inet_get_convert_csum(sk) && uh->check && !IS_UDPLITE(sk))
skb_checksum_try_convert(skb, IPPROTO_UDP, ip6_compute_pseudo);
ret = udpv6_queue_rcv_skb(sk, skb);
/* a return value > 0 means to resubmit the input */
if (ret > 0)
return ret;
return 0;
}
int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
int proto)
{
enum skb_drop_reason reason = SKB_DROP_REASON_NOT_SPECIFIED;
const struct in6_addr *saddr, *daddr;
struct net *net = dev_net(skb->dev);
struct sock *sk = NULL;
struct udphdr *uh;
bool refcounted;
u32 ulen = 0;
if (!pskb_may_pull(skb, sizeof(struct udphdr)))
goto discard;
saddr = &ipv6_hdr(skb)->saddr;
daddr = &ipv6_hdr(skb)->daddr;
uh = udp_hdr(skb);
ulen = ntohs(uh->len);
if (ulen > skb->len)
goto short_packet;
if (proto == IPPROTO_UDP) {
/* UDP validates ulen. */
/* Check for jumbo payload */
if (ulen == 0)
ulen = skb->len;
if (ulen < sizeof(*uh))
goto short_packet;
if (ulen < skb->len) {
if (pskb_trim_rcsum(skb, ulen))
goto short_packet;
saddr = &ipv6_hdr(skb)->saddr;
daddr = &ipv6_hdr(skb)->daddr;
uh = udp_hdr(skb);
}
}
if (udp6_csum_init(skb, uh, proto))
goto csum_error;
/* Check if the socket is already available, e.g. due to early demux */
sk = inet6_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
&refcounted, udp6_ehashfn);
if (IS_ERR(sk))
goto no_sk;
if (sk) {
struct dst_entry *dst = skb_dst(skb);
int ret;
if (unlikely(rcu_dereference(sk->sk_rx_dst) != dst))
udp6_sk_rx_dst_set(sk, dst);
if (!uh->check && !udp_get_no_check6_rx(sk)) {
if (refcounted)
sock_put(sk);
goto report_csum_error;
}
ret = udp6_unicast_rcv_skb(sk, skb, uh);
if (refcounted)
sock_put(sk);
return ret;
}
/*
* Multicast receive code
*/
if (ipv6_addr_is_multicast(daddr))
return __udp6_lib_mcast_deliver(net, skb,
saddr, daddr, udptable, proto);
/* Unicast */
sk = __udp6_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
if (sk) {
if (!uh->check && !udp_get_no_check6_rx(sk))
goto report_csum_error;
return udp6_unicast_rcv_skb(sk, skb, uh);
}
no_sk:
reason = SKB_DROP_REASON_NO_SOCKET;
if (!uh->check)
goto report_csum_error;
if (!xfrm6_policy_check(NULL, XFRM_POLICY_IN, skb))
goto discard;
nf_reset_ct(skb);
if (udp_lib_checksum_complete(skb))
goto csum_error;
__UDP6_INC_STATS(net, UDP_MIB_NOPORTS, proto == IPPROTO_UDPLITE);
icmpv6_send(skb, ICMPV6_DEST_UNREACH, ICMPV6_PORT_UNREACH, 0);
sk_skb_reason_drop(sk, skb, reason);
return 0;
short_packet:
if (reason == SKB_DROP_REASON_NOT_SPECIFIED)
reason = SKB_DROP_REASON_PKT_TOO_SMALL;
net_dbg_ratelimited("UDP%sv6: short packet: From [%pI6c]:%u %d/%d to [%pI6c]:%u\n",
proto == IPPROTO_UDPLITE ? "-Lite" : "",
saddr, ntohs(uh->source),
ulen, skb->len,
daddr, ntohs(uh->dest));
goto discard;
report_csum_error:
udp6_csum_zero_error(skb);
csum_error:
if (reason == SKB_DROP_REASON_NOT_SPECIFIED)
reason = SKB_DROP_REASON_UDP_CSUM;
__UDP6_INC_STATS(net, UDP_MIB_CSUMERRORS, proto == IPPROTO_UDPLITE);
discard:
__UDP6_INC_STATS(net, UDP_MIB_INERRORS, proto == IPPROTO_UDPLITE);
sk_skb_reason_drop(sk, skb, reason);
return 0;
}
static struct sock *__udp6_lib_demux_lookup(struct net *net,
__be16 loc_port, const struct in6_addr *loc_addr,
__be16 rmt_port, const struct in6_addr *rmt_addr,
int dif, int sdif)
{
struct udp_table *udptable = net->ipv4.udp_table;
unsigned short hnum = ntohs(loc_port);
struct udp_hslot *hslot2;
unsigned int hash2;
__portpair ports;
struct sock *sk;
hash2 = ipv6_portaddr_hash(net, loc_addr, hnum);
hslot2 = udp_hashslot2(udptable, hash2);
ports = INET_COMBINED_PORTS(rmt_port, hnum);
udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
if (sk->sk_state == TCP_ESTABLISHED &&
inet6_match(net, sk, rmt_addr, loc_addr, ports, dif, sdif))
return sk;
/* Only check first socket in chain */
break;
}
return NULL;
}
void udp_v6_early_demux(struct sk_buff *skb)
{
struct net *net = dev_net(skb->dev);
const struct udphdr *uh;
struct sock *sk;
struct dst_entry *dst;
int dif = skb->dev->ifindex;
int sdif = inet6_sdif(skb);
if (!pskb_may_pull(skb, skb_transport_offset(skb) +
sizeof(struct udphdr)))
return;
uh = udp_hdr(skb);
if (skb->pkt_type == PACKET_HOST)
sk = __udp6_lib_demux_lookup(net, uh->dest,
&ipv6_hdr(skb)->daddr,
uh->source, &ipv6_hdr(skb)->saddr,
dif, sdif);
else
return;
if (!sk)
return;
skb->sk = sk;
DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
skb->destructor = sock_pfree;
dst = rcu_dereference(sk->sk_rx_dst);
if (dst)
dst = dst_check(dst, sk->sk_rx_dst_cookie);
if (dst) {
/* set noref for now.
* any place which wants to hold dst has to call
* dst_hold_safe()
*/
skb_dst_set_noref(skb, dst);
}
}
INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
{
return __udp6_lib_rcv(skb, dev_net(skb->dev)->ipv4.udp_table, IPPROTO_UDP);
}
/*
* Throw away all pending data and cancel the corking. Socket is locked.
*/
static void udp_v6_flush_pending_frames(struct sock *sk)
{
struct udp_sock *up = udp_sk(sk);
if (up->pending == AF_INET)
udp_flush_pending_frames(sk);
else if (up->pending) {
up->len = 0;
WRITE_ONCE(up->pending, 0);
ip6_flush_pending_frames(sk);
}
}
static int udpv6_pre_connect(struct sock *sk, struct sockaddr *uaddr,
int addr_len)
{
if (addr_len < offsetofend(struct sockaddr, sa_family))
return -EINVAL;
/* The following checks are replicated from __ip6_datagram_connect()
* and intended to prevent BPF program called below from accessing
* bytes that are out of the bound specified by user in addr_len.
*/
if (uaddr->sa_family == AF_INET) {
if (ipv6_only_sock(sk))
return -EAFNOSUPPORT;
return udp_pre_connect(sk, uaddr, addr_len);
}
if (addr_len < SIN6_LEN_RFC2133)
return -EINVAL;
return BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr, &addr_len);
}
static int udpv6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
int res;
lock_sock(sk);
res = __ip6_datagram_connect(sk, uaddr, addr_len);
if (!res)
udp6_hash4(sk);
release_sock(sk);
return res;
}
/**
* udp6_hwcsum_outgoing - handle outgoing HW checksumming
* @sk: socket we are sending on
* @skb: sk_buff containing the filled-in UDP header
* (checksum field must be zeroed out)
* @saddr: source address
* @daddr: destination address
* @len: length of packet
*/
static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
const struct in6_addr *saddr,
const struct in6_addr *daddr, int len)
{
unsigned int offset;
struct udphdr *uh = udp_hdr(skb);
struct sk_buff *frags = skb_shinfo(skb)->frag_list;
__wsum csum = 0;
if (!frags) {
/* Only one fragment on the socket. */
skb->csum_start = skb_transport_header(skb) - skb->head;
skb->csum_offset = offsetof(struct udphdr, check);
uh->check = ~csum_ipv6_magic(saddr, daddr, len, IPPROTO_UDP, 0);
} else {
/*
* HW-checksum won't work as there are two or more
* fragments on the socket so that all csums of sk_buffs
* should be together
*/
offset = skb_transport_offset(skb);
skb->csum = skb_checksum(skb, offset, skb->len - offset, 0);
csum = skb->csum;
skb->ip_summed = CHECKSUM_NONE;
do {
csum = csum_add(csum, frags->csum);
} while ((frags = frags->next));
uh->check = csum_ipv6_magic(saddr, daddr, len, IPPROTO_UDP,
csum);
if (uh->check == 0)
uh->check = CSUM_MANGLED_0;
}
}
/*
* Sending
*/
static int udp_v6_send_skb(struct sk_buff *skb, struct flowi6 *fl6,
struct inet_cork *cork)
{
struct sock *sk = skb->sk;
struct udphdr *uh;
int err = 0;
int is_udplite = IS_UDPLITE(sk);
__wsum csum = 0;
int offset = skb_transport_offset(skb);
int len = skb->len - offset;
int datalen = len - sizeof(*uh);
/*
* Create a UDP header
*/
uh = udp_hdr(skb);
uh->source = fl6->fl6_sport;
uh->dest = fl6->fl6_dport;
uh->len = htons(len);
uh->check = 0;
if (cork->gso_size) {
const int hlen = skb_network_header_len(skb) +
sizeof(struct udphdr);
if (hlen + cork->gso_size > cork->fragsize) {
kfree_skb(skb);
return -EINVAL;
}
if (datalen > cork->gso_size * UDP_MAX_SEGMENTS) {
kfree_skb(skb);
return -EINVAL;
}
if (udp_get_no_check6_tx(sk)) {
kfree_skb(skb);
return -EINVAL;
}
if (is_udplite || dst_xfrm(skb_dst(skb))) {
kfree_skb(skb);
return -EIO;
}
if (datalen > cork->gso_size) {
skb_shinfo(skb)->gso_size = cork->gso_size;
skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4;
skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(datalen,
cork->gso_size);
/* Don't checksum the payload, skb will get segmented */
goto csum_partial;
}
}
if (is_udplite)
csum = udplite_csum(skb);
else if (udp_get_no_check6_tx(sk)) { /* UDP csum disabled */
skb->ip_summed = CHECKSUM_NONE;
goto send;
} else if (skb->ip_summed == CHECKSUM_PARTIAL) { /* UDP hardware csum */
csum_partial:
udp6_hwcsum_outgoing(sk, skb, &fl6->saddr, &fl6->daddr, len);
goto send;
} else
csum = udp_csum(skb);
/* add protocol-dependent pseudo-header */
uh->check = csum_ipv6_magic(&fl6->saddr, &fl6->daddr,
len, fl6->flowi6_proto, csum);
if (uh->check == 0)
uh->check = CSUM_MANGLED_0;
send:
err = ip6_send_skb(skb);
if (err) {
if (err == -ENOBUFS && !inet6_test_bit(RECVERR6, sk)) {
UDP6_INC_STATS(sock_net(sk),
UDP_MIB_SNDBUFERRORS, is_udplite);
err = 0;
}
} else {
UDP6_INC_STATS(sock_net(sk),
UDP_MIB_OUTDATAGRAMS, is_udplite);
}
return err;
}
static int udp_v6_push_pending_frames(struct sock *sk)
{
struct sk_buff *skb;
struct udp_sock *up = udp_sk(sk);
int err = 0;
if (up->pending == AF_INET)
return udp_push_pending_frames(sk);
skb = ip6_finish_skb(sk);
if (!skb)
goto out;
err = udp_v6_send_skb(skb, &inet_sk(sk)->cork.fl.u.ip6,
&inet_sk(sk)->cork.base);
out:
up->len = 0;
WRITE_ONCE(up->pending, 0);
return err;
}
int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
{
struct ipv6_txoptions opt_space;
struct udp_sock *up = udp_sk(sk);
struct inet_sock *inet = inet_sk(sk);
struct ipv6_pinfo *np = inet6_sk(sk);
DECLARE_SOCKADDR(struct sockaddr_in6 *, sin6, msg->msg_name);
struct in6_addr *daddr, *final_p, final;
struct ipv6_txoptions *opt = NULL;
struct ipv6_txoptions *opt_to_free = NULL;
struct ip6_flowlabel *flowlabel = NULL;
struct inet_cork_full cork;
struct flowi6 *fl6 = &cork.fl.u.ip6;
struct dst_entry *dst;
struct ipcm6_cookie ipc6;
int addr_len = msg->msg_namelen;
bool connected = false;
int ulen = len;
int corkreq = udp_test_bit(CORK, sk) || msg->msg_flags & MSG_MORE;
int err;
int is_udplite = IS_UDPLITE(sk);
int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
ipcm6_init(&ipc6);
ipc6.gso_size = READ_ONCE(up->gso_size);
ipc6.sockc.tsflags = READ_ONCE(sk->sk_tsflags);
ipc6.sockc.mark = READ_ONCE(sk->sk_mark);
ipc6.sockc.priority = READ_ONCE(sk->sk_priority);
/* destination address check */
if (sin6) {
if (addr_len < offsetof(struct sockaddr, sa_data))
return -EINVAL;
switch (sin6->sin6_family) {
case AF_INET6:
if (addr_len < SIN6_LEN_RFC2133)
return -EINVAL;
daddr = &sin6->sin6_addr;
if (ipv6_addr_any(daddr) &&
ipv6_addr_v4mapped(&np->saddr))
ipv6_addr_set_v4mapped(htonl(INADDR_LOOPBACK),
daddr);
break;
case AF_INET:
goto do_udp_sendmsg;
case AF_UNSPEC:
msg->msg_name = sin6 = NULL;
msg->msg_namelen = addr_len = 0;
daddr = NULL;
break;
default:
return -EINVAL;
}
} else if (!READ_ONCE(up->pending)) {
if (sk->sk_state != TCP_ESTABLISHED)
return -EDESTADDRREQ;
daddr = &sk->sk_v6_daddr;
} else
daddr = NULL;
if (daddr) {
if (ipv6_addr_v4mapped(daddr)) {
struct sockaddr_in sin;
sin.sin_family = AF_INET;
sin.sin_port = sin6 ? sin6->sin6_port : inet->inet_dport;
sin.sin_addr.s_addr = daddr->s6_addr32[3];
msg->msg_name = &sin;
msg->msg_namelen = sizeof(sin);
do_udp_sendmsg:
err = ipv6_only_sock(sk) ?
-ENETUNREACH : udp_sendmsg(sk, msg, len);
msg->msg_name = sin6;
msg->msg_namelen = addr_len;
return err;
}
}
/* Rough check on arithmetic overflow,
better check is made in ip6_append_data().
*/
if (len > INT_MAX - sizeof(struct udphdr))
return -EMSGSIZE;
getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
if (READ_ONCE(up->pending)) {
if (READ_ONCE(up->pending) == AF_INET)
return udp_sendmsg(sk, msg, len);
/*
* There are pending frames.
* The socket lock must be held while it's corked.
*/
lock_sock(sk);
if (likely(up->pending)) {
if (unlikely(up->pending != AF_INET6)) {
release_sock(sk);
return -EAFNOSUPPORT;
}
dst = NULL;
goto do_append_data;
}
release_sock(sk);
}
ulen += sizeof(struct udphdr);
memset(fl6, 0, sizeof(*fl6));
if (sin6) {
if (sin6->sin6_port == 0)
return -EINVAL;
fl6->fl6_dport = sin6->sin6_port;
daddr = &sin6->sin6_addr;
if (inet6_test_bit(SNDFLOW, sk)) {
fl6->flowlabel = sin6->sin6_flowinfo&IPV6_FLOWINFO_MASK;
if (fl6->flowlabel & IPV6_FLOWLABEL_MASK) {
flowlabel = fl6_sock_lookup(sk, fl6->flowlabel);
if (IS_ERR(flowlabel))
return -EINVAL;
}
}
/*
* Otherwise it will be difficult to maintain
* sk->sk_dst_cache.
*/
if (sk->sk_state == TCP_ESTABLISHED &&
ipv6_addr_equal(daddr, &sk->sk_v6_daddr))
daddr = &sk->sk_v6_daddr;
if (addr_len >= sizeof(struct sockaddr_in6) &&
sin6->sin6_scope_id &&
__ipv6_addr_needs_scope_id(__ipv6_addr_type(daddr)))
fl6->flowi6_oif = sin6->sin6_scope_id;
} else {
if (sk->sk_state != TCP_ESTABLISHED)
return -EDESTADDRREQ;
fl6->fl6_dport = inet->inet_dport;
daddr = &sk->sk_v6_daddr;
fl6->flowlabel = np->flow_label;
connected = true;
}
if (!fl6->flowi6_oif)
fl6->flowi6_oif = READ_ONCE(sk->sk_bound_dev_if);
if (!fl6->flowi6_oif)
fl6->flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
fl6->flowi6_uid = sk->sk_uid;
if (msg->msg_controllen) {
opt = &opt_space;
memset(opt, 0, sizeof(struct ipv6_txoptions));
opt->tot_len = sizeof(*opt);
ipc6.opt = opt;
err = udp_cmsg_send(sk, msg, &ipc6.gso_size);
if (err > 0) {
err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, fl6,
&ipc6);
connected = false;
}
if (err < 0) {
fl6_sock_release(flowlabel);
return err;
}
if ((fl6->flowlabel&IPV6_FLOWLABEL_MASK) && !flowlabel) {
flowlabel = fl6_sock_lookup(sk, fl6->flowlabel);
if (IS_ERR(flowlabel))
return -EINVAL;
}
if (!(opt->opt_nflen|opt->opt_flen))
opt = NULL;
}
if (!opt) {
opt = txopt_get(np);
opt_to_free = opt;
}
if (flowlabel)
opt = fl6_merge_options(&opt_space, flowlabel, opt);
opt = ipv6_fixup_options(&opt_space, opt);
ipc6.opt = opt;
fl6->flowi6_proto = sk->sk_protocol;
fl6->flowi6_mark = ipc6.sockc.mark;
fl6->daddr = *daddr;
if (ipv6_addr_any(&fl6->saddr) && !ipv6_addr_any(&np->saddr))
fl6->saddr = np->saddr;
fl6->fl6_sport = inet->inet_sport;
if (cgroup_bpf_enabled(CGROUP_UDP6_SENDMSG) && !connected) {
err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
(struct sockaddr *)sin6,
&addr_len,
&fl6->saddr);
if (err)
goto out_no_dst;
if (sin6) {
if (ipv6_addr_v4mapped(&sin6->sin6_addr)) {
/* BPF program rewrote IPv6-only by IPv4-mapped
* IPv6. It's currently unsupported.
*/
err = -ENOTSUPP;
goto out_no_dst;
}
if (sin6->sin6_port == 0) {
/* BPF program set invalid port. Reject it. */
err = -EINVAL;
goto out_no_dst;
}
fl6->fl6_dport = sin6->sin6_port;
fl6->daddr = sin6->sin6_addr;
}
}
if (ipv6_addr_any(&fl6->daddr))
fl6->daddr.s6_addr[15] = 0x1; /* :: means loopback (BSD'ism) */
final_p = fl6_update_dst(fl6, opt, &final);
if (final_p)
connected = false;
if (!fl6->flowi6_oif && ipv6_addr_is_multicast(&fl6->daddr)) {
fl6->flowi6_oif = READ_ONCE(np->mcast_oif);
connected = false;
} else if (!fl6->flowi6_oif)
fl6->flowi6_oif = READ_ONCE(np->ucast_oif);
security_sk_classify_flow(sk, flowi6_to_flowi_common(fl6));
if (ipc6.tclass < 0)
ipc6.tclass = np->tclass;
fl6->flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6->flowlabel);
dst = ip6_sk_dst_lookup_flow(sk, fl6, final_p, connected);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
goto out;
}
if (ipc6.hlimit < 0)
ipc6.hlimit = ip6_sk_dst_hoplimit(np, fl6, dst);
if (msg->msg_flags&MSG_CONFIRM)
goto do_confirm;
back_from_confirm:
/* Lockless fast path for the non-corking case */
if (!corkreq) {
struct sk_buff *skb;
skb = ip6_make_skb(sk, getfrag, msg, ulen,
sizeof(struct udphdr), &ipc6,
dst_rt6_info(dst),
msg->msg_flags, &cork);
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_v6_send_skb(skb, fl6, &cork.base);
/* ip6_make_skb steals dst reference */
goto out_no_dst;
}
lock_sock(sk);
if (unlikely(up->pending)) {
/* The socket is already corked while preparing it. */
/* ... which is an evident application bug. --ANK */
release_sock(sk);
net_dbg_ratelimited("udp cork app bug 2\n");
err = -EINVAL;
goto out;
}
WRITE_ONCE(up->pending, AF_INET6);
do_append_data:
if (ipc6.dontfrag < 0)
ipc6.dontfrag = inet6_test_bit(DONTFRAG, sk);
up->len += ulen;
err = ip6_append_data(sk, getfrag, msg, ulen, sizeof(struct udphdr),
&ipc6, fl6, dst_rt6_info(dst),
corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
if (err)
udp_v6_flush_pending_frames(sk);
else if (!corkreq)
err = udp_v6_push_pending_frames(sk);
else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
WRITE_ONCE(up->pending, 0);
if (err > 0)
err = inet6_test_bit(RECVERR6, sk) ? net_xmit_errno(err) : 0;
release_sock(sk);
out:
dst_release(dst);
out_no_dst:
fl6_sock_release(flowlabel);
txopt_put(opt_to_free);
if (!err)
return len;
/*
* ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space. Reporting
* ENOBUFS might not be good (it's not tunable per se), but otherwise
* we don't have a good statistic (IpOutDiscards but it can be too many
* things). We could add another new stat but at least for now that
* seems like overkill.
*/
if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
UDP6_INC_STATS(sock_net(sk),
UDP_MIB_SNDBUFERRORS, is_udplite);
}
return err;
do_confirm:
if (msg->msg_flags & MSG_PROBE)
dst_confirm_neigh(dst, &fl6->daddr);
if (!(msg->msg_flags&MSG_PROBE) || len)
goto back_from_confirm;
err = 0;
goto out;
}
EXPORT_SYMBOL(udpv6_sendmsg);
static void udpv6_splice_eof(struct socket *sock)
{
struct sock *sk = sock->sk;
struct udp_sock *up = udp_sk(sk);
if (!READ_ONCE(up->pending) || udp_test_bit(CORK, sk))
return;
lock_sock(sk);
if (up->pending && !udp_test_bit(CORK, sk))
udp_v6_push_pending_frames(sk);
release_sock(sk);
}
void udpv6_destroy_sock(struct sock *sk)
{
struct udp_sock *up = udp_sk(sk);
lock_sock(sk);
/* protects from races with udp_abort() */
sock_set_flag(sk, SOCK_DEAD);
udp_v6_flush_pending_frames(sk);
release_sock(sk);
if (static_branch_unlikely(&udpv6_encap_needed_key)) {
if (up->encap_type) {
void (*encap_destroy)(struct sock *sk);
encap_destroy = READ_ONCE(up->encap_destroy);
if (encap_destroy)
encap_destroy(sk);
}
if (udp_test_bit(ENCAP_ENABLED, sk)) {
static_branch_dec(&udpv6_encap_needed_key);
udp_encap_disable();
}
}
}
/*
* Socket option code for UDP
*/
int udpv6_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
unsigned int optlen)
{
if (level == SOL_UDP || level == SOL_UDPLITE || level == SOL_SOCKET)
return udp_lib_setsockopt(sk, level, optname,
optval, optlen,
udp_v6_push_pending_frames);
return ipv6_setsockopt(sk, level, optname, optval, optlen);
}
int udpv6_getsockopt(struct sock *sk, int level, int optname,
char __user *optval, int __user *optlen)
{
if (level == SOL_UDP || level == SOL_UDPLITE)
return udp_lib_getsockopt(sk, level, optname, optval, optlen);
return ipv6_getsockopt(sk, level, optname, optval, optlen);
}
/* ------------------------------------------------------------------------ */
#ifdef CONFIG_PROC_FS
int udp6_seq_show(struct seq_file *seq, void *v)
{
if (v == SEQ_START_TOKEN) {
seq_puts(seq, IPV6_SEQ_DGRAM_HEADER);
} else {
int bucket = ((struct udp_iter_state *)seq->private)->bucket;
const struct inet_sock *inet = inet_sk((const struct sock *)v);
__u16 srcp = ntohs(inet->inet_sport);
__u16 destp = ntohs(inet->inet_dport);
__ip6_dgram_sock_seq_show(seq, v, srcp, destp,
udp_rqueue_get(v), bucket);
}
return 0;
}
const struct seq_operations udp6_seq_ops = {
.start = udp_seq_start,
.next = udp_seq_next,
.stop = udp_seq_stop,
.show = udp6_seq_show,
};
EXPORT_SYMBOL(udp6_seq_ops);
static struct udp_seq_afinfo udp6_seq_afinfo = {
.family = AF_INET6,
.udp_table = NULL,
};
int __net_init udp6_proc_init(struct net *net)
{
if (!proc_create_net_data("udp6", 0444, net->proc_net, &udp6_seq_ops,
sizeof(struct udp_iter_state), &udp6_seq_afinfo))
return -ENOMEM;
return 0;
}
void udp6_proc_exit(struct net *net)
{
remove_proc_entry("udp6", net->proc_net);
}
#endif /* CONFIG_PROC_FS */
/* ------------------------------------------------------------------------ */
struct proto udpv6_prot = {
.name = "UDPv6",
.owner = THIS_MODULE,
.close = udp_lib_close,
.pre_connect = udpv6_pre_connect,
.connect = udpv6_connect,
.disconnect = udp_disconnect,
.ioctl = udp_ioctl,
.init = udpv6_init_sock,
.destroy = udpv6_destroy_sock,
.setsockopt = udpv6_setsockopt,
.getsockopt = udpv6_getsockopt,
.sendmsg = udpv6_sendmsg,
.recvmsg = udpv6_recvmsg,
.splice_eof = udpv6_splice_eof,
.release_cb = ip6_datagram_release_cb,
.hash = udp_lib_hash,
.unhash = udp_lib_unhash,
.rehash = udp_v6_rehash,
.get_port = udp_v6_get_port,
.put_port = udp_lib_unhash,
#ifdef CONFIG_BPF_SYSCALL
.psock_update_sk_prot = udp_bpf_update_proto,
#endif
.memory_allocated = &udp_memory_allocated,
.per_cpu_fw_alloc = &udp_memory_per_cpu_fw_alloc,
.sysctl_mem = sysctl_udp_mem,
.sysctl_wmem_offset = offsetof(struct net, ipv4.sysctl_udp_wmem_min),
.sysctl_rmem_offset = offsetof(struct net, ipv4.sysctl_udp_rmem_min),
.obj_size = sizeof(struct udp6_sock),
.ipv6_pinfo_offset = offsetof(struct udp6_sock, inet6),
.h.udp_table = NULL,
.diag_destroy = udp_abort,
};
static struct inet_protosw udpv6_protosw = {
.type = SOCK_DGRAM,
.protocol = IPPROTO_UDP,
.prot = &udpv6_prot,
.ops = &inet6_dgram_ops,
.flags = INET_PROTOSW_PERMANENT,
};
int __init udpv6_init(void)
{
int ret;
net_hotdata.udpv6_protocol = (struct inet6_protocol) {
.handler = udpv6_rcv,
.err_handler = udpv6_err,
.flags = INET6_PROTO_NOPOLICY | INET6_PROTO_FINAL,
};
ret = inet6_add_protocol(&net_hotdata.udpv6_protocol, IPPROTO_UDP);
if (ret)
goto out;
ret = inet6_register_protosw(&udpv6_protosw);
if (ret)
goto out_udpv6_protocol;
out:
return ret;
out_udpv6_protocol:
inet6_del_protocol(&net_hotdata.udpv6_protocol, IPPROTO_UDP);
goto out;
}
void udpv6_exit(void)
{
inet6_unregister_protosw(&udpv6_protosw);
inet6_del_protocol(&net_hotdata.udpv6_protocol, IPPROTO_UDP);
}