2019-05-29 07:18:09 -07:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2014-11-13 17:36:45 -08:00
|
|
|
/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
* Copyright (c) 2016 Facebook
|
2014-11-13 17:36:45 -08:00
|
|
|
*/
|
|
|
|
#include <linux/bpf.h>
|
2018-08-09 08:55:20 -07:00
|
|
|
#include <linux/btf.h>
|
2014-11-13 17:36:45 -08:00
|
|
|
#include <linux/jhash.h>
|
|
|
|
#include <linux/filter.h>
|
2017-03-07 20:00:13 -08:00
|
|
|
#include <linux/rculist_nulls.h>
|
2018-08-22 23:49:37 +02:00
|
|
|
#include <linux/random.h>
|
2018-08-09 08:55:20 -07:00
|
|
|
#include <uapi/linux/btf.h>
|
2020-08-27 15:01:11 -07:00
|
|
|
#include <linux/rcupdate_trace.h>
|
2022-04-25 21:32:47 +08:00
|
|
|
#include <linux/btf_ids.h>
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
#include "percpu_freelist.h"
|
2016-11-11 10:55:09 -08:00
|
|
|
#include "bpf_lru_list.h"
|
2017-03-22 10:00:34 -07:00
|
|
|
#include "map_in_map.h"
|
2022-09-02 14:10:44 -07:00
|
|
|
#include <linux/bpf_mem_alloc.h>
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2017-10-18 13:00:22 -07:00
|
|
|
#define HTAB_CREATE_FLAG_MASK \
|
|
|
|
(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
|
bpf: add program side {rd, wr}only support for maps
This work adds two new map creation flags BPF_F_RDONLY_PROG
and BPF_F_WRONLY_PROG in order to allow for read-only or
write-only BPF maps from a BPF program side.
Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
applies to system call side, meaning the BPF program has full
read/write access to the map as usual while bpf(2) calls with
map fd can either only read or write into the map depending
on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
for the exact opposite such that verifier is going to reject
program loads if write into a read-only map or a read into a
write-only map is detected. For read-only map case also some
helpers are forbidden for programs that would alter the map
state such as map deletion, update, etc. As opposed to the two
BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
as BPF_F_WRONLY_PROG really do correspond to the map lifetime.
We've enabled this generic map extension to various non-special
maps holding normal user data: array, hash, lru, lpm, local
storage, queue and stack. Further generic map types could be
followed up in future depending on use-case. Main use case
here is to forbid writes into .rodata map values from verifier
side.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 23:20:05 +02:00
|
|
|
BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
|
2017-08-18 11:28:00 -07:00
|
|
|
|
2020-01-15 10:43:04 -08:00
|
|
|
#define BATCH_OPS(_name) \
|
|
|
|
.map_lookup_batch = \
|
|
|
|
_name##_map_lookup_batch, \
|
|
|
|
.map_lookup_and_delete_batch = \
|
|
|
|
_name##_map_lookup_and_delete_batch, \
|
|
|
|
.map_update_batch = \
|
|
|
|
generic_map_update_batch, \
|
|
|
|
.map_delete_batch = \
|
|
|
|
generic_map_delete_batch
|
|
|
|
|
2020-02-24 15:01:34 +01:00
|
|
|
/*
|
|
|
|
* The bucket lock has two protection scopes:
|
|
|
|
*
|
2021-03-11 04:31:03 -08:00
|
|
|
* 1) Serializing concurrent operations from BPF programs on different
|
2020-02-24 15:01:34 +01:00
|
|
|
* CPUs
|
|
|
|
*
|
|
|
|
* 2) Serializing concurrent operations from BPF programs and sys_bpf()
|
|
|
|
*
|
|
|
|
* BPF programs can execute in any context including perf, kprobes and
|
|
|
|
* tracing. As there are almost no limits where perf, kprobes and tracing
|
|
|
|
* can be invoked from the lock operations need to be protected against
|
|
|
|
* deadlocks. Deadlocks can be caused by recursion and by an invocation in
|
|
|
|
* the lock held section when functions which acquire this lock are invoked
|
|
|
|
* from sys_bpf(). BPF recursion is prevented by incrementing the per CPU
|
|
|
|
* variable bpf_prog_active, which prevents BPF programs attached to perf
|
|
|
|
* events, kprobes and tracing to be invoked before the prior invocation
|
|
|
|
* from one of these contexts completed. sys_bpf() uses the same mechanism
|
|
|
|
* by pinning the task to the current CPU and incrementing the recursion
|
2021-05-25 10:56:59 +08:00
|
|
|
* protection across the map operation.
|
2020-02-24 15:01:51 +01:00
|
|
|
*
|
|
|
|
* This has subtle implications on PREEMPT_RT. PREEMPT_RT forbids certain
|
|
|
|
* operations like memory allocations (even with GFP_ATOMIC) from atomic
|
|
|
|
* contexts. This is required because even with GFP_ATOMIC the memory
|
2021-05-25 10:56:59 +08:00
|
|
|
* allocator calls into code paths which acquire locks with long held lock
|
2020-02-24 15:01:51 +01:00
|
|
|
* sections. To ensure the deterministic behaviour these locks are regular
|
|
|
|
* spinlocks, which are converted to 'sleepable' spinlocks on RT. The only
|
|
|
|
* true atomic contexts on an RT kernel are the low level hardware
|
|
|
|
* handling, scheduling, low level interrupt handling, NMIs etc. None of
|
|
|
|
* these contexts should ever do memory allocations.
|
|
|
|
*
|
|
|
|
* As regular device interrupt handlers and soft interrupts are forced into
|
|
|
|
* thread context, the existing code which does
|
bpf: Make non-preallocated allocation low priority
GFP_ATOMIC doesn't cooperate well with memcg pressure so far, especially
if we allocate too much GFP_ATOMIC memory. For example, when we set the
memcg limit to limit a non-preallocated bpf memory, the GFP_ATOMIC can
easily break the memcg limit by force charge. So it is very dangerous to
use GFP_ATOMIC in non-preallocated case. One way to make it safe is to
remove __GFP_HIGH from GFP_ATOMIC, IOW, use (__GFP_ATOMIC |
__GFP_KSWAPD_RECLAIM) instead, then it will be limited if we allocate
too much memory. There's a plan to completely remove __GFP_ATOMIC in the
mm side[1], so let's use GFP_NOWAIT instead.
We introduced BPF_F_NO_PREALLOC is because full map pre-allocation is
too memory expensive for some cases. That means removing __GFP_HIGH
doesn't break the rule of BPF_F_NO_PREALLOC, but has the same goal with
it-avoiding issues caused by too much memory. So let's remove it.
This fix can also apply to other run-time allocations, for example, the
allocation in lpm trie, local storage and devmap. So let fix it
consistently over the bpf code
It also fixes a typo in the comment.
[1]. https://lore.kernel.org/linux-mm/163712397076.13692.4727608274002939094@noble.neil.brown.name/
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Link: https://lore.kernel.org/r/20220709154457.57379-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-09 15:44:56 +00:00
|
|
|
* spin_lock*(); alloc(GFP_ATOMIC); spin_unlock*();
|
2020-02-24 15:01:51 +01:00
|
|
|
* just works.
|
|
|
|
*
|
|
|
|
* In theory the BPF locks could be converted to regular spinlocks as well,
|
|
|
|
* but the bucket locks and percpu_freelist locks can be taken from
|
|
|
|
* arbitrary contexts (perf, kprobes, tracepoints) which are required to be
|
2022-09-21 15:38:26 +08:00
|
|
|
* atomic contexts even on RT. Before the introduction of bpf_mem_alloc,
|
|
|
|
* it is only safe to use raw spinlock for preallocated hash map on a RT kernel,
|
|
|
|
* because there is no memory allocation within the lock held sections. However
|
|
|
|
* after hash map was fully converted to use bpf_mem_alloc, there will be
|
|
|
|
* non-synchronous memory allocation for non-preallocated hash map, so it is
|
|
|
|
* safe to always use raw spinlock for bucket lock.
|
2020-02-24 15:01:34 +01:00
|
|
|
*/
|
2015-12-29 22:40:27 +08:00
|
|
|
struct bucket {
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head head;
|
2022-09-21 15:38:26 +08:00
|
|
|
raw_spinlock_t raw_lock;
|
2015-12-29 22:40:27 +08:00
|
|
|
};
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
#define HASHTAB_MAP_LOCK_COUNT 8
|
|
|
|
#define HASHTAB_MAP_LOCK_MASK (HASHTAB_MAP_LOCK_COUNT - 1)
|
|
|
|
|
2014-11-13 17:36:45 -08:00
|
|
|
struct bpf_htab {
|
|
|
|
struct bpf_map map;
|
2022-09-02 14:10:44 -07:00
|
|
|
struct bpf_mem_alloc ma;
|
2022-09-02 14:10:53 -07:00
|
|
|
struct bpf_mem_alloc pcpu_ma;
|
2015-12-29 22:40:27 +08:00
|
|
|
struct bucket *buckets;
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
void *elems;
|
2016-11-11 10:55:09 -08:00
|
|
|
union {
|
|
|
|
struct pcpu_freelist freelist;
|
|
|
|
struct bpf_lru lru;
|
|
|
|
};
|
2017-03-21 19:05:04 -07:00
|
|
|
struct htab_elem *__percpu *extra_elems;
|
2022-09-02 14:10:48 -07:00
|
|
|
/* number of elements in non-preallocated hashtable are kept
|
|
|
|
* in either pcount or count
|
|
|
|
*/
|
|
|
|
struct percpu_counter pcount;
|
|
|
|
atomic_t count;
|
|
|
|
bool use_percpu_counter;
|
2014-11-13 17:36:45 -08:00
|
|
|
u32 n_buckets; /* number of hash buckets */
|
|
|
|
u32 elem_size; /* size of each element in bytes */
|
2018-08-22 23:49:37 +02:00
|
|
|
u32 hashrnd;
|
2020-10-29 00:19:24 -07:00
|
|
|
struct lock_class_key lockdep_key;
|
2020-10-29 00:19:25 -07:00
|
|
|
int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
|
2014-11-13 17:36:45 -08:00
|
|
|
};
|
|
|
|
|
|
|
|
/* each htab element is struct htab_elem + key + value */
|
|
|
|
struct htab_elem {
|
2016-02-01 22:39:53 -08:00
|
|
|
union {
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_node hash_node;
|
2017-03-07 20:00:12 -08:00
|
|
|
struct {
|
|
|
|
void *padding;
|
|
|
|
union {
|
|
|
|
struct pcpu_freelist_node fnode;
|
2020-02-19 15:47:57 -08:00
|
|
|
struct htab_elem *batch_flink;
|
2017-03-07 20:00:12 -08:00
|
|
|
};
|
|
|
|
};
|
2016-02-01 22:39:53 -08:00
|
|
|
};
|
2016-08-05 14:01:27 -07:00
|
|
|
union {
|
2022-09-02 14:10:53 -07:00
|
|
|
/* pointer to per-cpu pointer */
|
|
|
|
void *ptr_to_pptr;
|
2016-11-11 10:55:09 -08:00
|
|
|
struct bpf_lru_node lru_node;
|
2016-08-05 14:01:27 -07:00
|
|
|
};
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
u32 hash;
|
2020-02-26 18:17:44 -06:00
|
|
|
char key[] __aligned(8);
|
2014-11-13 17:36:45 -08:00
|
|
|
};
|
|
|
|
|
2020-02-24 15:01:51 +01:00
|
|
|
static inline bool htab_is_prealloc(const struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
return !(htab->map.map_flags & BPF_F_NO_PREALLOC);
|
|
|
|
}
|
|
|
|
|
2020-02-24 15:01:50 +01:00
|
|
|
static void htab_init_buckets(struct bpf_htab *htab)
|
|
|
|
{
|
2022-05-10 01:22:20 -07:00
|
|
|
unsigned int i;
|
2020-02-24 15:01:50 +01:00
|
|
|
|
|
|
|
for (i = 0; i < htab->n_buckets; i++) {
|
|
|
|
INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
|
2022-09-21 15:38:26 +08:00
|
|
|
raw_spin_lock_init(&htab->buckets[i].raw_lock);
|
|
|
|
lockdep_set_class(&htab->buckets[i].raw_lock,
|
2020-10-29 00:19:24 -07:00
|
|
|
&htab->lockdep_key);
|
2020-12-21 11:25:06 -08:00
|
|
|
cond_resched();
|
2020-02-24 15:01:50 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
static inline int htab_lock_bucket(const struct bpf_htab *htab,
|
|
|
|
struct bucket *b, u32 hash,
|
|
|
|
unsigned long *pflags)
|
2020-02-24 15:01:50 +01:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
bpf: hash map, avoid deadlock with suitable hash mask
The deadlock still may occur while accessed in NMI and non-NMI
context. Because in NMI, we still may access the same bucket but with
different map_locked index.
For example, on the same CPU, .max_entries = 2, we update the hash map,
with key = 4, while running bpf prog in NMI nmi_handle(), to update
hash map with key = 20, so it will have the same bucket index but have
different map_locked index.
To fix this issue, using min mask to hash again.
Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Tonghao Zhang <tong@infragraf.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230111092903.92389-1-tong@infragraf.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-11 17:29:01 +08:00
|
|
|
hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
|
2020-10-29 00:19:25 -07:00
|
|
|
|
2022-09-21 15:38:26 +08:00
|
|
|
preempt_disable();
|
2020-10-29 00:19:25 -07:00
|
|
|
if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
|
|
|
|
__this_cpu_dec(*(htab->map_locked[hash]));
|
2022-09-21 15:38:26 +08:00
|
|
|
preempt_enable();
|
2020-10-29 00:19:25 -07:00
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
2022-09-21 15:38:26 +08:00
|
|
|
raw_spin_lock_irqsave(&b->raw_lock, flags);
|
2020-10-29 00:19:25 -07:00
|
|
|
*pflags = flags;
|
|
|
|
|
|
|
|
return 0;
|
2020-02-24 15:01:50 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void htab_unlock_bucket(const struct bpf_htab *htab,
|
2020-10-29 00:19:25 -07:00
|
|
|
struct bucket *b, u32 hash,
|
2020-02-24 15:01:50 +01:00
|
|
|
unsigned long flags)
|
|
|
|
{
|
bpf: hash map, avoid deadlock with suitable hash mask
The deadlock still may occur while accessed in NMI and non-NMI
context. Because in NMI, we still may access the same bucket but with
different map_locked index.
For example, on the same CPU, .max_entries = 2, we update the hash map,
with key = 4, while running bpf prog in NMI nmi_handle(), to update
hash map with key = 20, so it will have the same bucket index but have
different map_locked index.
To fix this issue, using min mask to hash again.
Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Tonghao Zhang <tong@infragraf.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230111092903.92389-1-tong@infragraf.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-11 17:29:01 +08:00
|
|
|
hash = hash & min_t(u32, HASHTAB_MAP_LOCK_MASK, htab->n_buckets - 1);
|
2022-09-21 15:38:26 +08:00
|
|
|
raw_spin_unlock_irqrestore(&b->raw_lock, flags);
|
2020-10-29 00:19:25 -07:00
|
|
|
__this_cpu_dec(*(htab->map_locked[hash]));
|
2022-09-21 15:38:26 +08:00
|
|
|
preempt_enable();
|
2020-02-24 15:01:50 +01:00
|
|
|
}
|
|
|
|
|
2016-11-11 10:55:09 -08:00
|
|
|
static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
|
|
|
|
|
|
|
|
static bool htab_is_lru(const struct bpf_htab *htab)
|
|
|
|
{
|
2016-11-11 10:55:10 -08:00
|
|
|
return htab->map.map_type == BPF_MAP_TYPE_LRU_HASH ||
|
|
|
|
htab->map.map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool htab_is_percpu(const struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
return htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH ||
|
|
|
|
htab->map.map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH;
|
2016-11-11 10:55:09 -08:00
|
|
|
}
|
|
|
|
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
static inline void htab_elem_set_ptr(struct htab_elem *l, u32 key_size,
|
|
|
|
void __percpu *pptr)
|
|
|
|
{
|
|
|
|
*(void __percpu **)(l->key + key_size) = pptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void __percpu *htab_elem_get_ptr(struct htab_elem *l, u32 key_size)
|
|
|
|
{
|
|
|
|
return *(void __percpu **)(l->key + key_size);
|
|
|
|
}
|
|
|
|
|
2017-03-22 10:00:34 -07:00
|
|
|
static void *fd_htab_map_get_ptr(const struct bpf_map *map, struct htab_elem *l)
|
|
|
|
{
|
|
|
|
return *(void **)(l->key + roundup(map->key_size, 8));
|
|
|
|
}
|
|
|
|
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
static struct htab_elem *get_htab_elem(struct bpf_htab *htab, int i)
|
|
|
|
{
|
bpf: Avoid overflows involving hash elem_size
Use of bpf_map_charge_init() was making sure hash tables would not use more
than 4GB of memory.
Since the implicit check disappeared, we have to be more careful
about overflows, to support big hash tables.
syzbot triggers a panic using :
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_LRU_HASH, key_size=16384, value_size=8,
max_entries=262200, map_flags=0, inner_map_fd=-1, map_name="",
map_ifindex=0, btf_fd=-1, btf_key_type_id=0, btf_value_type_id=0,
btf_vmlinux_value_type_id=0}, 64) = ...
BUG: KASAN: vmalloc-out-of-bounds in bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
BUG: KASAN: vmalloc-out-of-bounds in bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
Write of size 2 at addr ffffc90017e4a020 by task syz-executor.5/19786
CPU: 0 PID: 19786 Comm: syz-executor.5 Not tainted 5.10.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:118
print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
prealloc_init kernel/bpf/hashtab.c:319 [inline]
htab_map_alloc+0xf6e/0x1230 kernel/bpf/hashtab.c:507
find_and_alloc_map kernel/bpf/syscall.c:123 [inline]
map_create kernel/bpf/syscall.c:829 [inline]
__do_sys_bpf+0xa81/0x5170 kernel/bpf/syscall.c:4336
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fd93fbc0c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 0000000000001a40 RCX: 000000000045deb9
RDX: 0000000000000040 RSI: 0000000020000280 RDI: 0000000000000000
RBP: 000000000119bf60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000119bf2c
R13: 00007ffc08a7be8f R14: 00007fd93fbc19c0 R15: 000000000119bf2c
Fixes: 755e5d55367a ("bpf: Eliminate rlimit-based memory accounting for hashtab maps")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/bpf/20201207182821.3940306-1-eric.dumazet@gmail.com
2020-12-07 10:28:21 -08:00
|
|
|
return (struct htab_elem *) (htab->elems + i * (u64)htab->elem_size);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
}
|
|
|
|
|
2021-07-14 17:54:10 -07:00
|
|
|
static bool htab_has_extra_elems(struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
return !htab_is_percpu(htab) && !htab_is_lru(htab);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void htab_free_prealloced_timers(struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
u32 num_entries = htab->map.max_entries;
|
|
|
|
int i;
|
|
|
|
|
2022-11-04 00:39:56 +05:30
|
|
|
if (!btf_record_has_field(htab->map.record, BPF_TIMER))
|
2021-07-14 17:54:10 -07:00
|
|
|
return;
|
|
|
|
if (htab_has_extra_elems(htab))
|
|
|
|
num_entries += num_possible_cpus();
|
|
|
|
|
|
|
|
for (i = 0; i < num_entries; i++) {
|
|
|
|
struct htab_elem *elem;
|
|
|
|
|
|
|
|
elem = get_htab_elem(htab, i);
|
2022-11-04 00:39:56 +05:30
|
|
|
bpf_obj_free_timer(htab->map.record, elem->key + round_up(htab->map.key_size, 8));
|
2021-07-14 17:54:10 -07:00
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
bpf: Refactor kptr_off_tab into btf_record
To prepare the BPF verifier to handle special fields in both map values
and program allocated types coming from program BTF, we need to refactor
the kptr_off_tab handling code into something more generic and reusable
across both cases to avoid code duplication.
Later patches also require passing this data to helpers at runtime, so
that they can work on user defined types, initialize them, destruct
them, etc.
The main observation is that both map values and such allocated types
point to a type in program BTF, hence they can be handled similarly. We
can prepare a field metadata table for both cases and store them in
struct bpf_map or struct btf depending on the use case.
Hence, refactor the code into generic btf_record and btf_field member
structs. The btf_record represents the fields of a specific btf_type in
user BTF. The cnt indicates the number of special fields we successfully
recognized, and field_mask is a bitmask of fields that were found, to
enable quick determination of availability of a certain field.
Subsequently, refactor the rest of the code to work with these generic
types, remove assumptions about kptr and kptr_off_tab, rename variables
to more meaningful names, etc.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 00:39:55 +05:30
|
|
|
static void htab_free_prealloced_fields(struct bpf_htab *htab)
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
{
|
|
|
|
u32 num_entries = htab->map.max_entries;
|
|
|
|
int i;
|
|
|
|
|
bpf: Refactor kptr_off_tab into btf_record
To prepare the BPF verifier to handle special fields in both map values
and program allocated types coming from program BTF, we need to refactor
the kptr_off_tab handling code into something more generic and reusable
across both cases to avoid code duplication.
Later patches also require passing this data to helpers at runtime, so
that they can work on user defined types, initialize them, destruct
them, etc.
The main observation is that both map values and such allocated types
point to a type in program BTF, hence they can be handled similarly. We
can prepare a field metadata table for both cases and store them in
struct bpf_map or struct btf depending on the use case.
Hence, refactor the code into generic btf_record and btf_field member
structs. The btf_record represents the fields of a specific btf_type in
user BTF. The cnt indicates the number of special fields we successfully
recognized, and field_mask is a bitmask of fields that were found, to
enable quick determination of availability of a certain field.
Subsequently, refactor the rest of the code to work with these generic
types, remove assumptions about kptr and kptr_off_tab, rename variables
to more meaningful names, etc.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 00:39:55 +05:30
|
|
|
if (IS_ERR_OR_NULL(htab->map.record))
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
return;
|
|
|
|
if (htab_has_extra_elems(htab))
|
|
|
|
num_entries += num_possible_cpus();
|
|
|
|
for (i = 0; i < num_entries; i++) {
|
|
|
|
struct htab_elem *elem;
|
|
|
|
|
|
|
|
elem = get_htab_elem(htab, i);
|
2023-02-25 16:40:08 +01:00
|
|
|
if (htab_is_percpu(htab)) {
|
|
|
|
void __percpu *pptr = htab_elem_get_ptr(elem, htab->map.key_size);
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
bpf_obj_free_fields(htab->map.record, elem->key + round_up(htab->map.key_size, 8));
|
|
|
|
cond_resched();
|
|
|
|
}
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
static void htab_free_elems(struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2016-11-11 10:55:10 -08:00
|
|
|
if (!htab_is_percpu(htab))
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
goto free_elems;
|
|
|
|
|
|
|
|
for (i = 0; i < htab->map.max_entries; i++) {
|
|
|
|
void __percpu *pptr;
|
|
|
|
|
|
|
|
pptr = htab_elem_get_ptr(get_htab_elem(htab, i),
|
|
|
|
htab->map.key_size);
|
|
|
|
free_percpu(pptr);
|
2017-12-12 14:22:39 -08:00
|
|
|
cond_resched();
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
}
|
|
|
|
free_elems:
|
bpf: don't trigger OOM killer under pressure with map alloc
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.
Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.
Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.
Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 15:14:17 +01:00
|
|
|
bpf_map_area_free(htab->elems);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
}
|
|
|
|
|
2020-02-19 15:47:57 -08:00
|
|
|
/* The LRU list has a lock (lru_lock). Each htab bucket has a lock
|
|
|
|
* (bucket_lock). If both locks need to be acquired together, the lock
|
|
|
|
* order is always lru_lock -> bucket_lock and this only happens in
|
|
|
|
* bpf_lru_list.c logic. For example, certain code path of
|
|
|
|
* bpf_lru_pop_free(), which is called by function prealloc_lru_pop(),
|
|
|
|
* will acquire lru_lock first followed by acquiring bucket_lock.
|
|
|
|
*
|
|
|
|
* In hashtab.c, to avoid deadlock, lock acquisition of
|
|
|
|
* bucket_lock followed by lru_lock is not allowed. In such cases,
|
|
|
|
* bucket_lock needs to be released first before acquiring lru_lock.
|
|
|
|
*/
|
2016-11-11 10:55:09 -08:00
|
|
|
static struct htab_elem *prealloc_lru_pop(struct bpf_htab *htab, void *key,
|
|
|
|
u32 hash)
|
|
|
|
{
|
|
|
|
struct bpf_lru_node *node = bpf_lru_pop_free(&htab->lru, hash);
|
|
|
|
struct htab_elem *l;
|
|
|
|
|
|
|
|
if (node) {
|
|
|
|
l = container_of(node, struct htab_elem, lru_node);
|
bpf: Don't reinit map value in prealloc_lru_pop
The LRU map that is preallocated may have its elements reused while
another program holds a pointer to it from bpf_map_lookup_elem. Hence,
only check_and_free_fields is appropriate when the element is being
deleted, as it ensures proper synchronization against concurrent access
of the map value. After that, we cannot call check_and_init_map_value
again as it may rewrite bpf_spin_lock, bpf_timer, and kptr fields while
they can be concurrently accessed from a BPF program.
This is safe to do as when the map entry is deleted, concurrent access
is protected against by check_and_free_fields, i.e. an existing timer
would be freed, and any existing kptr will be released by it. The
program can create further timers and kptrs after check_and_free_fields,
but they will eventually be released once the preallocated items are
freed on map destruction, even if the item is never reused again. Hence,
the deleted item sitting in the free list can still have resources
attached to it, and they would never leak.
With spin_lock, we never touch the field at all on delete or update, as
we may end up modifying the state of the lock. Since the verifier
ensures that a bpf_spin_lock call is always paired with bpf_spin_unlock
call, the program will eventually release the lock so that on reuse the
new user of the value can take the lock.
Essentially, for the preallocated case, we must assume that the map
value may always be in use by the program, even when it is sitting in
the freelist, and handle things accordingly, i.e. use proper
synchronization inside check_and_free_fields, and never reinitialize the
special fields when it is reused on update.
Fixes: 68134668c17f ("bpf: Add map side support for bpf timers.")
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220809213033.24147-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-09 23:30:32 +02:00
|
|
|
memcpy(l->key, key, htab->map.key_size);
|
2016-11-11 10:55:09 -08:00
|
|
|
return l;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int prealloc_init(struct bpf_htab *htab)
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
{
|
2017-03-21 19:05:04 -07:00
|
|
|
u32 num_entries = htab->map.max_entries;
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
int err = -ENOMEM, i;
|
|
|
|
|
2021-07-14 17:54:10 -07:00
|
|
|
if (htab_has_extra_elems(htab))
|
2017-03-21 19:05:04 -07:00
|
|
|
num_entries += num_possible_cpus();
|
|
|
|
|
bpf: Avoid overflows involving hash elem_size
Use of bpf_map_charge_init() was making sure hash tables would not use more
than 4GB of memory.
Since the implicit check disappeared, we have to be more careful
about overflows, to support big hash tables.
syzbot triggers a panic using :
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_LRU_HASH, key_size=16384, value_size=8,
max_entries=262200, map_flags=0, inner_map_fd=-1, map_name="",
map_ifindex=0, btf_fd=-1, btf_key_type_id=0, btf_value_type_id=0,
btf_vmlinux_value_type_id=0}, 64) = ...
BUG: KASAN: vmalloc-out-of-bounds in bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
BUG: KASAN: vmalloc-out-of-bounds in bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
Write of size 2 at addr ffffc90017e4a020 by task syz-executor.5/19786
CPU: 0 PID: 19786 Comm: syz-executor.5 Not tainted 5.10.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:118
print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
bpf_percpu_lru_populate kernel/bpf/bpf_lru_list.c:594 [inline]
bpf_lru_populate+0x4ef/0x5e0 kernel/bpf/bpf_lru_list.c:611
prealloc_init kernel/bpf/hashtab.c:319 [inline]
htab_map_alloc+0xf6e/0x1230 kernel/bpf/hashtab.c:507
find_and_alloc_map kernel/bpf/syscall.c:123 [inline]
map_create kernel/bpf/syscall.c:829 [inline]
__do_sys_bpf+0xa81/0x5170 kernel/bpf/syscall.c:4336
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fd93fbc0c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 0000000000001a40 RCX: 000000000045deb9
RDX: 0000000000000040 RSI: 0000000020000280 RDI: 0000000000000000
RBP: 000000000119bf60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000119bf2c
R13: 00007ffc08a7be8f R14: 00007fd93fbc19c0 R15: 000000000119bf2c
Fixes: 755e5d55367a ("bpf: Eliminate rlimit-based memory accounting for hashtab maps")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/bpf/20201207182821.3940306-1-eric.dumazet@gmail.com
2020-12-07 10:28:21 -08:00
|
|
|
htab->elems = bpf_map_area_alloc((u64)htab->elem_size * num_entries,
|
2017-08-18 11:28:00 -07:00
|
|
|
htab->map.numa_node);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (!htab->elems)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2016-11-11 10:55:10 -08:00
|
|
|
if (!htab_is_percpu(htab))
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
goto skip_percpu_elems;
|
|
|
|
|
2017-03-21 19:05:04 -07:00
|
|
|
for (i = 0; i < num_entries; i++) {
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
u32 size = round_up(htab->map.value_size, 8);
|
|
|
|
void __percpu *pptr;
|
|
|
|
|
2020-12-01 13:58:38 -08:00
|
|
|
pptr = bpf_map_alloc_percpu(&htab->map, size, 8,
|
|
|
|
GFP_USER | __GFP_NOWARN);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (!pptr)
|
|
|
|
goto free_elems;
|
|
|
|
htab_elem_set_ptr(get_htab_elem(htab, i), htab->map.key_size,
|
|
|
|
pptr);
|
2017-12-12 14:22:39 -08:00
|
|
|
cond_resched();
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
skip_percpu_elems:
|
2016-11-11 10:55:09 -08:00
|
|
|
if (htab_is_lru(htab))
|
|
|
|
err = bpf_lru_init(&htab->lru,
|
|
|
|
htab->map.map_flags & BPF_F_NO_COMMON_LRU,
|
|
|
|
offsetof(struct htab_elem, hash) -
|
|
|
|
offsetof(struct htab_elem, lru_node),
|
|
|
|
htab_lru_map_delete_node,
|
|
|
|
htab);
|
|
|
|
else
|
|
|
|
err = pcpu_freelist_init(&htab->freelist);
|
|
|
|
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (err)
|
|
|
|
goto free_elems;
|
|
|
|
|
2016-11-11 10:55:09 -08:00
|
|
|
if (htab_is_lru(htab))
|
|
|
|
bpf_lru_populate(&htab->lru, htab->elems,
|
|
|
|
offsetof(struct htab_elem, lru_node),
|
2017-03-21 19:05:04 -07:00
|
|
|
htab->elem_size, num_entries);
|
2016-11-11 10:55:09 -08:00
|
|
|
else
|
2017-03-07 20:00:12 -08:00
|
|
|
pcpu_freelist_populate(&htab->freelist,
|
|
|
|
htab->elems + offsetof(struct htab_elem, fnode),
|
2017-03-21 19:05:04 -07:00
|
|
|
htab->elem_size, num_entries);
|
2016-11-11 10:55:09 -08:00
|
|
|
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
free_elems:
|
|
|
|
htab_free_elems(htab);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2016-11-11 10:55:09 -08:00
|
|
|
static void prealloc_destroy(struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
htab_free_elems(htab);
|
|
|
|
|
|
|
|
if (htab_is_lru(htab))
|
|
|
|
bpf_lru_destroy(&htab->lru);
|
|
|
|
else
|
|
|
|
pcpu_freelist_destroy(&htab->freelist);
|
|
|
|
}
|
|
|
|
|
2016-08-05 14:01:27 -07:00
|
|
|
static int alloc_extra_elems(struct bpf_htab *htab)
|
|
|
|
{
|
2017-03-21 19:05:04 -07:00
|
|
|
struct htab_elem *__percpu *pptr, *l_new;
|
|
|
|
struct pcpu_freelist_node *l;
|
2016-08-05 14:01:27 -07:00
|
|
|
int cpu;
|
|
|
|
|
2020-12-01 13:58:38 -08:00
|
|
|
pptr = bpf_map_alloc_percpu(&htab->map, sizeof(struct htab_elem *), 8,
|
|
|
|
GFP_USER | __GFP_NOWARN);
|
2016-08-05 14:01:27 -07:00
|
|
|
if (!pptr)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
2017-03-21 19:05:04 -07:00
|
|
|
l = pcpu_freelist_pop(&htab->freelist);
|
|
|
|
/* pop will succeed, since prealloc_init()
|
|
|
|
* preallocated extra num_possible_cpus elements
|
|
|
|
*/
|
|
|
|
l_new = container_of(l, struct htab_elem, fnode);
|
|
|
|
*per_cpu_ptr(pptr, cpu) = l_new;
|
2016-08-05 14:01:27 -07:00
|
|
|
}
|
|
|
|
htab->extra_elems = pptr;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-11-13 17:36:45 -08:00
|
|
|
/* Called from syscall */
|
2018-01-11 20:29:05 -08:00
|
|
|
static int htab_map_alloc_check(union bpf_attr *attr)
|
2014-11-13 17:36:45 -08:00
|
|
|
{
|
2016-11-11 10:55:10 -08:00
|
|
|
bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
|
|
|
|
attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
|
|
|
|
bool lru = (attr->map_type == BPF_MAP_TYPE_LRU_HASH ||
|
|
|
|
attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
|
2016-11-11 10:55:09 -08:00
|
|
|
/* percpu_lru means each cpu has its own LRU list.
|
|
|
|
* it is different from BPF_MAP_TYPE_PERCPU_HASH where
|
|
|
|
* the map's value itself is percpu. percpu_lru has
|
|
|
|
* nothing to do with the map's value.
|
|
|
|
*/
|
|
|
|
bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
|
|
|
|
bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
|
2018-11-16 11:41:08 +00:00
|
|
|
bool zero_seed = (attr->map_flags & BPF_F_ZERO_SEED);
|
2017-08-18 11:28:00 -07:00
|
|
|
int numa_node = bpf_map_attr_numa_node(attr);
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2017-03-07 20:00:12 -08:00
|
|
|
BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
|
|
|
|
offsetof(struct htab_elem, hash_node.pprev));
|
|
|
|
|
2020-05-13 16:03:54 -07:00
|
|
|
if (lru && !bpf_capable())
|
2016-11-11 10:55:09 -08:00
|
|
|
/* LRU implementation is much complicated than other
|
2020-05-13 16:03:54 -07:00
|
|
|
* maps. Hence, limit to CAP_BPF.
|
2016-11-11 10:55:09 -08:00
|
|
|
*/
|
2018-01-11 20:29:05 -08:00
|
|
|
return -EPERM;
|
2016-11-11 10:55:09 -08:00
|
|
|
|
2018-11-16 11:41:08 +00:00
|
|
|
if (zero_seed && !capable(CAP_SYS_ADMIN))
|
|
|
|
/* Guard against local DoS, and discourage production use. */
|
|
|
|
return -EPERM;
|
|
|
|
|
bpf: add program side {rd, wr}only support for maps
This work adds two new map creation flags BPF_F_RDONLY_PROG
and BPF_F_WRONLY_PROG in order to allow for read-only or
write-only BPF maps from a BPF program side.
Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
applies to system call side, meaning the BPF program has full
read/write access to the map as usual while bpf(2) calls with
map fd can either only read or write into the map depending
on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
for the exact opposite such that verifier is going to reject
program loads if write into a read-only map or a read into a
write-only map is detected. For read-only map case also some
helpers are forbidden for programs that would alter the map
state such as map deletion, update, etc. As opposed to the two
BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
as BPF_F_WRONLY_PROG really do correspond to the map lifetime.
We've enabled this generic map extension to various non-special
maps holding normal user data: array, hash, lru, lpm, local
storage, queue and stack. Further generic map types could be
followed up in future depending on use-case. Main use case
here is to forbid writes into .rodata map values from verifier
side.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-09 23:20:05 +02:00
|
|
|
if (attr->map_flags & ~HTAB_CREATE_FLAG_MASK ||
|
|
|
|
!bpf_map_flags_access_ok(attr->map_flags))
|
2018-01-11 20:29:05 -08:00
|
|
|
return -EINVAL;
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
|
2016-11-11 10:55:09 -08:00
|
|
|
if (!lru && percpu_lru)
|
2018-01-11 20:29:05 -08:00
|
|
|
return -EINVAL;
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
if (lru && !prealloc)
|
2018-01-11 20:29:05 -08:00
|
|
|
return -ENOTSUPP;
|
2016-11-11 10:55:09 -08:00
|
|
|
|
2017-08-18 11:28:00 -07:00
|
|
|
if (numa_node != NUMA_NO_NODE && (percpu || percpu_lru))
|
2018-01-11 20:29:05 -08:00
|
|
|
return -EINVAL;
|
2017-08-18 11:28:00 -07:00
|
|
|
|
2018-01-11 20:29:04 -08:00
|
|
|
/* check sanity of attributes.
|
|
|
|
* value_size == 0 may be allowed in the future to use map as a set
|
|
|
|
*/
|
|
|
|
if (attr->max_entries == 0 || attr->key_size == 0 ||
|
|
|
|
attr->value_size == 0)
|
2018-01-11 20:29:05 -08:00
|
|
|
return -EINVAL;
|
2018-01-11 20:29:04 -08:00
|
|
|
|
2020-10-29 21:14:42 +01:00
|
|
|
if ((u64)attr->key_size + attr->value_size >= KMALLOC_MAX_SIZE -
|
|
|
|
sizeof(struct htab_elem))
|
|
|
|
/* if key_size + value_size is bigger, the user space won't be
|
|
|
|
* able to access the elements via bpf syscall. This check
|
|
|
|
* also makes sure that the elem_size doesn't overflow and it's
|
2018-01-11 20:29:04 -08:00
|
|
|
* kmalloc-able later in htab_map_update_elem()
|
|
|
|
*/
|
2018-01-11 20:29:05 -08:00
|
|
|
return -E2BIG;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
|
|
|
|
{
|
|
|
|
bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
|
|
|
|
attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
|
|
|
|
bool lru = (attr->map_type == BPF_MAP_TYPE_LRU_HASH ||
|
|
|
|
attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
|
|
|
|
/* percpu_lru means each cpu has its own LRU list.
|
|
|
|
* it is different from BPF_MAP_TYPE_PERCPU_HASH where
|
|
|
|
* the map's value itself is percpu. percpu_lru has
|
|
|
|
* nothing to do with the map's value.
|
|
|
|
*/
|
|
|
|
bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
|
|
|
|
bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
|
|
|
|
struct bpf_htab *htab;
|
2020-10-29 00:19:25 -07:00
|
|
|
int err, i;
|
2018-01-11 20:29:04 -08:00
|
|
|
|
2022-08-10 15:18:29 +00:00
|
|
|
htab = bpf_map_area_alloc(sizeof(*htab), NUMA_NO_NODE);
|
2014-11-13 17:36:45 -08:00
|
|
|
if (!htab)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
|
2020-11-02 03:41:00 -08:00
|
|
|
lockdep_register_key(&htab->lockdep_key);
|
|
|
|
|
2018-01-11 20:29:06 -08:00
|
|
|
bpf_map_init_from_attr(&htab->map, attr);
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2016-11-11 10:55:09 -08:00
|
|
|
if (percpu_lru) {
|
|
|
|
/* ensure each CPU's lru list has >=1 elements.
|
|
|
|
* since we are at it, make each lru list has the same
|
|
|
|
* number of elements.
|
|
|
|
*/
|
|
|
|
htab->map.max_entries = roundup(attr->max_entries,
|
|
|
|
num_possible_cpus());
|
|
|
|
if (htab->map.max_entries < attr->max_entries)
|
|
|
|
htab->map.max_entries = rounddown(attr->max_entries,
|
|
|
|
num_possible_cpus());
|
|
|
|
}
|
|
|
|
|
2014-11-13 17:36:45 -08:00
|
|
|
/* hash table size must be power of 2 */
|
|
|
|
htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
|
|
|
|
|
2015-11-29 16:59:35 -08:00
|
|
|
htab->elem_size = sizeof(struct htab_elem) +
|
2016-02-01 22:39:53 -08:00
|
|
|
round_up(htab->map.key_size, 8);
|
|
|
|
if (percpu)
|
|
|
|
htab->elem_size += sizeof(void *);
|
|
|
|
else
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
htab->elem_size += round_up(htab->map.value_size, 8);
|
2015-11-29 16:59:35 -08:00
|
|
|
|
2018-01-11 20:29:04 -08:00
|
|
|
err = -E2BIG;
|
2014-11-18 17:32:16 -08:00
|
|
|
/* prevent zero size kmalloc and check for u32 overflow */
|
|
|
|
if (htab->n_buckets == 0 ||
|
2015-12-29 22:40:27 +08:00
|
|
|
htab->n_buckets > U32_MAX / sizeof(struct bucket))
|
2014-11-18 17:32:16 -08:00
|
|
|
goto free_htab;
|
|
|
|
|
2015-11-29 16:59:35 -08:00
|
|
|
err = -ENOMEM;
|
bpf: don't trigger OOM killer under pressure with map alloc
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.
Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.
Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.
Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 15:14:17 +01:00
|
|
|
htab->buckets = bpf_map_area_alloc(htab->n_buckets *
|
2017-08-18 11:28:00 -07:00
|
|
|
sizeof(struct bucket),
|
|
|
|
htab->map.numa_node);
|
bpf: don't trigger OOM killer under pressure with map alloc
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.
Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.
Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.
Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 15:14:17 +01:00
|
|
|
if (!htab->buckets)
|
2020-12-01 13:58:49 -08:00
|
|
|
goto free_htab;
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++) {
|
2020-12-01 13:58:38 -08:00
|
|
|
htab->map_locked[i] = bpf_map_alloc_percpu(&htab->map,
|
|
|
|
sizeof(int),
|
|
|
|
sizeof(int),
|
|
|
|
GFP_USER);
|
2020-10-29 00:19:25 -07:00
|
|
|
if (!htab->map_locked[i])
|
|
|
|
goto free_map_locked;
|
|
|
|
}
|
|
|
|
|
2018-11-16 11:41:08 +00:00
|
|
|
if (htab->map.map_flags & BPF_F_ZERO_SEED)
|
|
|
|
htab->hashrnd = 0;
|
|
|
|
else
|
2022-10-05 17:43:22 +02:00
|
|
|
htab->hashrnd = get_random_u32();
|
2018-11-16 11:41:08 +00:00
|
|
|
|
2020-02-24 15:01:50 +01:00
|
|
|
htab_init_buckets(htab);
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2022-09-02 14:10:48 -07:00
|
|
|
/* compute_batch_value() computes batch value as num_online_cpus() * 2
|
|
|
|
* and __percpu_counter_compare() needs
|
|
|
|
* htab->max_entries - cur_number_of_elems to be more than batch * num_online_cpus()
|
|
|
|
* for percpu_counter to be faster than atomic_t. In practice the average bpf
|
|
|
|
* hash map size is 10k, which means that a system with 64 cpus will fill
|
|
|
|
* hashmap to 20% of 10k before percpu_counter becomes ineffective. Therefore
|
|
|
|
* define our own batch count as 32 then 10k hash map can be filled up to 80%:
|
|
|
|
* 10k - 8k > 32 _batch_ * 64 _cpus_
|
|
|
|
* and __percpu_counter_compare() will still be fast. At that point hash map
|
|
|
|
* collisions will dominate its performance anyway. Assume that hash map filled
|
|
|
|
* to 50+% isn't going to be O(1) and use the following formula to choose
|
|
|
|
* between percpu_counter and atomic_t.
|
|
|
|
*/
|
|
|
|
#define PERCPU_COUNTER_BATCH 32
|
|
|
|
if (attr->max_entries / 2 > num_online_cpus() * PERCPU_COUNTER_BATCH)
|
|
|
|
htab->use_percpu_counter = true;
|
|
|
|
|
|
|
|
if (htab->use_percpu_counter) {
|
|
|
|
err = percpu_counter_init(&htab->pcount, 0, GFP_KERNEL);
|
|
|
|
if (err)
|
|
|
|
goto free_map_locked;
|
|
|
|
}
|
|
|
|
|
2016-11-11 10:55:09 -08:00
|
|
|
if (prealloc) {
|
|
|
|
err = prealloc_init(htab);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (err)
|
2020-10-29 00:19:25 -07:00
|
|
|
goto free_map_locked;
|
2017-03-21 19:05:04 -07:00
|
|
|
|
|
|
|
if (!percpu && !lru) {
|
|
|
|
/* lru itself can remove the least used element, so
|
|
|
|
* there is no need for an extra elem during map_update.
|
|
|
|
*/
|
|
|
|
err = alloc_extra_elems(htab);
|
|
|
|
if (err)
|
|
|
|
goto free_prealloc;
|
|
|
|
}
|
2022-09-02 14:10:44 -07:00
|
|
|
} else {
|
2022-09-02 14:10:52 -07:00
|
|
|
err = bpf_mem_alloc_init(&htab->ma, htab->elem_size, false);
|
2022-09-02 14:10:44 -07:00
|
|
|
if (err)
|
|
|
|
goto free_map_locked;
|
2022-09-02 14:10:53 -07:00
|
|
|
if (percpu) {
|
|
|
|
err = bpf_mem_alloc_init(&htab->pcpu_ma,
|
|
|
|
round_up(htab->map.value_size, 8), true);
|
|
|
|
if (err)
|
|
|
|
goto free_map_locked;
|
|
|
|
}
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
}
|
2014-11-13 17:36:45 -08:00
|
|
|
|
|
|
|
return &htab->map;
|
|
|
|
|
2017-03-21 19:05:04 -07:00
|
|
|
free_prealloc:
|
|
|
|
prealloc_destroy(htab);
|
2020-10-29 00:19:25 -07:00
|
|
|
free_map_locked:
|
2022-09-11 00:07:11 +09:00
|
|
|
if (htab->use_percpu_counter)
|
|
|
|
percpu_counter_destroy(&htab->pcount);
|
2020-10-29 00:19:25 -07:00
|
|
|
for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
|
|
|
|
free_percpu(htab->map_locked[i]);
|
bpf: don't trigger OOM killer under pressure with map alloc
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.
Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.
Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.
Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 15:14:17 +01:00
|
|
|
bpf_map_area_free(htab->buckets);
|
2022-09-02 14:10:53 -07:00
|
|
|
bpf_mem_alloc_destroy(&htab->pcpu_ma);
|
2022-09-02 14:10:44 -07:00
|
|
|
bpf_mem_alloc_destroy(&htab->ma);
|
2014-11-13 17:36:45 -08:00
|
|
|
free_htab:
|
2020-11-02 03:41:00 -08:00
|
|
|
lockdep_unregister_key(&htab->lockdep_key);
|
2022-08-10 15:18:29 +00:00
|
|
|
bpf_map_area_free(htab);
|
2014-11-13 17:36:45 -08:00
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
|
|
|
|
2018-08-22 23:49:37 +02:00
|
|
|
static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd)
|
2014-11-13 17:36:45 -08:00
|
|
|
{
|
2023-04-01 20:06:02 +00:00
|
|
|
if (likely(key_len % 4 == 0))
|
|
|
|
return jhash2(key, key_len / 4, hashrnd);
|
2018-08-22 23:49:37 +02:00
|
|
|
return jhash(key, key_len, hashrnd);
|
2014-11-13 17:36:45 -08:00
|
|
|
}
|
|
|
|
|
2015-12-29 22:40:27 +08:00
|
|
|
static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
|
2014-11-13 17:36:45 -08:00
|
|
|
{
|
|
|
|
return &htab->buckets[hash & (htab->n_buckets - 1)];
|
|
|
|
}
|
|
|
|
|
2017-03-07 20:00:13 -08:00
|
|
|
static inline struct hlist_nulls_head *select_bucket(struct bpf_htab *htab, u32 hash)
|
2015-12-29 22:40:27 +08:00
|
|
|
{
|
|
|
|
return &__select_bucket(htab, hash)->head;
|
|
|
|
}
|
|
|
|
|
2017-03-07 20:00:13 -08:00
|
|
|
/* this lookup function can only be called with bucket lock taken */
|
|
|
|
static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash,
|
2014-11-13 17:36:45 -08:00
|
|
|
void *key, u32 key_size)
|
|
|
|
{
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_node *n;
|
2014-11-13 17:36:45 -08:00
|
|
|
struct htab_elem *l;
|
|
|
|
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
|
2014-11-13 17:36:45 -08:00
|
|
|
if (l->hash == hash && !memcmp(&l->key, key, key_size))
|
|
|
|
return l;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-03-07 20:00:13 -08:00
|
|
|
/* can be called without bucket lock. it will repeat the loop in
|
|
|
|
* the unlikely event when elements moved from one bucket into another
|
|
|
|
* while link list is being walked
|
|
|
|
*/
|
|
|
|
static struct htab_elem *lookup_nulls_elem_raw(struct hlist_nulls_head *head,
|
|
|
|
u32 hash, void *key,
|
|
|
|
u32 key_size, u32 n_buckets)
|
|
|
|
{
|
|
|
|
struct hlist_nulls_node *n;
|
|
|
|
struct htab_elem *l;
|
|
|
|
|
|
|
|
again:
|
|
|
|
hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
|
|
|
|
if (l->hash == hash && !memcmp(&l->key, key, key_size))
|
|
|
|
return l;
|
|
|
|
|
|
|
|
if (unlikely(get_nulls_value(n) != (hash & (n_buckets - 1))))
|
|
|
|
goto again;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-03-15 18:26:43 -07:00
|
|
|
/* Called from syscall or from eBPF program directly, so
|
|
|
|
* arguments have to match bpf_map_lookup_elem() exactly.
|
|
|
|
* The return value is adjusted by BPF instructions
|
|
|
|
* in htab_map_gen_lookup().
|
|
|
|
*/
|
2016-02-01 22:39:53 -08:00
|
|
|
static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)
|
2014-11-13 17:36:45 -08:00
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head *head;
|
2014-11-13 17:36:45 -08:00
|
|
|
struct htab_elem *l;
|
|
|
|
u32 hash, key_size;
|
|
|
|
|
2021-06-24 18:05:54 +02:00
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
|
|
|
|
!rcu_read_lock_bh_held());
|
2014-11-13 17:36:45 -08:00
|
|
|
|
|
|
|
key_size = map->key_size;
|
|
|
|
|
2018-08-22 23:49:37 +02:00
|
|
|
hash = htab_map_hash(key, key_size, htab->hashrnd);
|
2014-11-13 17:36:45 -08:00
|
|
|
|
|
|
|
head = select_bucket(htab, hash);
|
|
|
|
|
2017-03-07 20:00:13 -08:00
|
|
|
l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets);
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2016-02-01 22:39:53 -08:00
|
|
|
return l;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
|
|
|
|
{
|
|
|
|
struct htab_elem *l = __htab_map_lookup_elem(map, key);
|
|
|
|
|
2014-11-13 17:36:45 -08:00
|
|
|
if (l)
|
|
|
|
return l->key + round_up(map->key_size, 8);
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-03-15 18:26:43 -07:00
|
|
|
/* inline bpf_map_lookup_elem() call.
|
|
|
|
* Instead of:
|
|
|
|
* bpf_prog
|
|
|
|
* bpf_map_lookup_elem
|
|
|
|
* map->ops->map_lookup_elem
|
|
|
|
* htab_map_lookup_elem
|
|
|
|
* __htab_map_lookup_elem
|
|
|
|
* do:
|
|
|
|
* bpf_prog
|
|
|
|
* __htab_map_lookup_elem
|
|
|
|
*/
|
2020-10-11 01:40:03 +02:00
|
|
|
static int htab_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
|
2017-03-15 18:26:43 -07:00
|
|
|
{
|
|
|
|
struct bpf_insn *insn = insn_buf;
|
|
|
|
const int ret = BPF_REG_0;
|
|
|
|
|
2018-06-02 23:06:35 +02:00
|
|
|
BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
|
|
|
|
(void *(*)(struct bpf_map *map, void *key))NULL));
|
2021-09-28 16:09:45 -07:00
|
|
|
*insn++ = BPF_EMIT_CALL(__htab_map_lookup_elem);
|
2017-03-15 18:26:43 -07:00
|
|
|
*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
|
|
|
|
*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
|
|
|
|
offsetof(struct htab_elem, key) +
|
|
|
|
round_up(map->key_size, 8));
|
|
|
|
return insn - insn_buf;
|
|
|
|
}
|
|
|
|
|
2019-05-14 01:18:56 +02:00
|
|
|
static __always_inline void *__htab_lru_map_lookup_elem(struct bpf_map *map,
|
|
|
|
void *key, const bool mark)
|
2016-11-11 10:55:09 -08:00
|
|
|
{
|
|
|
|
struct htab_elem *l = __htab_map_lookup_elem(map, key);
|
|
|
|
|
|
|
|
if (l) {
|
2019-05-14 01:18:56 +02:00
|
|
|
if (mark)
|
|
|
|
bpf_lru_node_set_ref(&l->lru_node);
|
2016-11-11 10:55:09 -08:00
|
|
|
return l->key + round_up(map->key_size, 8);
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2019-05-14 01:18:56 +02:00
|
|
|
static void *htab_lru_map_lookup_elem(struct bpf_map *map, void *key)
|
|
|
|
{
|
|
|
|
return __htab_lru_map_lookup_elem(map, key, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *htab_lru_map_lookup_elem_sys(struct bpf_map *map, void *key)
|
|
|
|
{
|
|
|
|
return __htab_lru_map_lookup_elem(map, key, false);
|
|
|
|
}
|
|
|
|
|
2020-10-11 01:40:03 +02:00
|
|
|
static int htab_lru_map_gen_lookup(struct bpf_map *map,
|
2017-08-31 23:27:12 -07:00
|
|
|
struct bpf_insn *insn_buf)
|
|
|
|
{
|
|
|
|
struct bpf_insn *insn = insn_buf;
|
|
|
|
const int ret = BPF_REG_0;
|
2017-08-31 23:27:13 -07:00
|
|
|
const int ref_reg = BPF_REG_1;
|
2017-08-31 23:27:12 -07:00
|
|
|
|
2018-06-02 23:06:35 +02:00
|
|
|
BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
|
|
|
|
(void *(*)(struct bpf_map *map, void *key))NULL));
|
2021-09-28 16:09:45 -07:00
|
|
|
*insn++ = BPF_EMIT_CALL(__htab_map_lookup_elem);
|
2017-08-31 23:27:13 -07:00
|
|
|
*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 4);
|
|
|
|
*insn++ = BPF_LDX_MEM(BPF_B, ref_reg, ret,
|
|
|
|
offsetof(struct htab_elem, lru_node) +
|
|
|
|
offsetof(struct bpf_lru_node, ref));
|
|
|
|
*insn++ = BPF_JMP_IMM(BPF_JNE, ref_reg, 0, 1);
|
2017-08-31 23:27:12 -07:00
|
|
|
*insn++ = BPF_ST_MEM(BPF_B, ret,
|
|
|
|
offsetof(struct htab_elem, lru_node) +
|
|
|
|
offsetof(struct bpf_lru_node, ref),
|
|
|
|
1);
|
|
|
|
*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
|
|
|
|
offsetof(struct htab_elem, key) +
|
|
|
|
round_up(map->key_size, 8));
|
|
|
|
return insn - insn_buf;
|
|
|
|
}
|
|
|
|
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
static void check_and_free_fields(struct bpf_htab *htab,
|
|
|
|
struct htab_elem *elem)
|
2021-07-14 17:54:10 -07:00
|
|
|
{
|
2023-02-25 16:40:08 +01:00
|
|
|
if (htab_is_percpu(htab)) {
|
|
|
|
void __percpu *pptr = htab_elem_get_ptr(elem, htab->map.key_size);
|
|
|
|
int cpu;
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
|
2023-02-25 16:40:08 +01:00
|
|
|
for_each_possible_cpu(cpu)
|
|
|
|
bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
|
|
|
|
} else {
|
|
|
|
void *map_value = elem->key + round_up(htab->map.key_size, 8);
|
|
|
|
|
|
|
|
bpf_obj_free_fields(htab->map.record, map_value);
|
|
|
|
}
|
2021-07-14 17:54:10 -07:00
|
|
|
}
|
|
|
|
|
2016-11-11 10:55:09 -08:00
|
|
|
/* It is called from the bpf_lru_list when the LRU needs to delete
|
|
|
|
* older elements from the htab.
|
|
|
|
*/
|
|
|
|
static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
|
|
|
|
{
|
2022-04-12 18:50:48 -07:00
|
|
|
struct bpf_htab *htab = arg;
|
2017-03-07 20:00:13 -08:00
|
|
|
struct htab_elem *l = NULL, *tgt_l;
|
|
|
|
struct hlist_nulls_head *head;
|
|
|
|
struct hlist_nulls_node *n;
|
2016-11-11 10:55:09 -08:00
|
|
|
unsigned long flags;
|
|
|
|
struct bucket *b;
|
2020-10-29 00:19:25 -07:00
|
|
|
int ret;
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
tgt_l = container_of(node, struct htab_elem, lru_node);
|
|
|
|
b = __select_bucket(htab, tgt_l->hash);
|
|
|
|
head = &b->head;
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
ret = htab_lock_bucket(htab, b, tgt_l->hash, &flags);
|
|
|
|
if (ret)
|
|
|
|
return false;
|
2016-11-11 10:55:09 -08:00
|
|
|
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
|
2016-11-11 10:55:09 -08:00
|
|
|
if (l == tgt_l) {
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_del_rcu(&l->hash_node);
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
check_and_free_fields(htab, l);
|
2016-11-11 10:55:09 -08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, tgt_l->hash, flags);
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
return l == tgt_l;
|
|
|
|
}
|
|
|
|
|
2014-11-13 17:36:45 -08:00
|
|
|
/* Called from syscall */
|
|
|
|
static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
|
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head *head;
|
2014-11-13 17:36:45 -08:00
|
|
|
struct htab_elem *l, *next_l;
|
|
|
|
u32 hash, key_size;
|
2017-04-24 19:00:37 -07:00
|
|
|
int i = 0;
|
2014-11-13 17:36:45 -08:00
|
|
|
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held());
|
|
|
|
|
|
|
|
key_size = map->key_size;
|
|
|
|
|
2017-04-24 19:00:37 -07:00
|
|
|
if (!key)
|
|
|
|
goto find_first_elem;
|
|
|
|
|
2018-08-22 23:49:37 +02:00
|
|
|
hash = htab_map_hash(key, key_size, htab->hashrnd);
|
2014-11-13 17:36:45 -08:00
|
|
|
|
|
|
|
head = select_bucket(htab, hash);
|
|
|
|
|
|
|
|
/* lookup the key */
|
2017-03-07 20:00:13 -08:00
|
|
|
l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets);
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2017-04-24 19:00:37 -07:00
|
|
|
if (!l)
|
2014-11-13 17:36:45 -08:00
|
|
|
goto find_first_elem;
|
|
|
|
|
|
|
|
/* key was found, get next key in the same bucket */
|
2017-03-07 20:00:13 -08:00
|
|
|
next_l = hlist_nulls_entry_safe(rcu_dereference_raw(hlist_nulls_next_rcu(&l->hash_node)),
|
2014-11-13 17:36:45 -08:00
|
|
|
struct htab_elem, hash_node);
|
|
|
|
|
|
|
|
if (next_l) {
|
|
|
|
/* if next elem in this hash list is non-zero, just return it */
|
|
|
|
memcpy(next_key, next_l->key, key_size);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* no more elements in this hash list, go to the next bucket */
|
|
|
|
i = hash & (htab->n_buckets - 1);
|
|
|
|
i++;
|
|
|
|
|
|
|
|
find_first_elem:
|
|
|
|
/* iterate over buckets */
|
|
|
|
for (; i < htab->n_buckets; i++) {
|
|
|
|
head = select_bucket(htab, i);
|
|
|
|
|
|
|
|
/* pick first element in the bucket */
|
2017-03-07 20:00:13 -08:00
|
|
|
next_l = hlist_nulls_entry_safe(rcu_dereference_raw(hlist_nulls_first_rcu(head)),
|
2014-11-13 17:36:45 -08:00
|
|
|
struct htab_elem, hash_node);
|
|
|
|
if (next_l) {
|
|
|
|
/* if it's not empty, just return it */
|
|
|
|
memcpy(next_key, next_l->key, key_size);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
/* iterated over all buckets and all elements */
|
2014-11-13 17:36:45 -08:00
|
|
|
return -ENOENT;
|
|
|
|
}
|
|
|
|
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
|
2016-02-01 22:39:53 -08:00
|
|
|
{
|
2023-02-25 16:40:08 +01:00
|
|
|
check_and_free_fields(htab, l);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
|
2022-09-02 14:10:53 -07:00
|
|
|
bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
|
2022-09-02 14:10:44 -07:00
|
|
|
bpf_mem_cache_free(&htab->ma, l);
|
2016-02-01 22:39:53 -08:00
|
|
|
}
|
|
|
|
|
2020-07-28 21:09:12 -07:00
|
|
|
static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
|
2016-02-01 22:39:53 -08:00
|
|
|
{
|
2017-03-22 10:00:34 -07:00
|
|
|
struct bpf_map *map = &htab->map;
|
2020-07-28 21:09:12 -07:00
|
|
|
void *ptr;
|
2017-03-22 10:00:34 -07:00
|
|
|
|
|
|
|
if (map->ops->map_fd_put_ptr) {
|
2020-07-28 21:09:12 -07:00
|
|
|
ptr = fd_htab_map_get_ptr(map, l);
|
2017-03-22 10:00:34 -07:00
|
|
|
map->ops->map_fd_put_ptr(ptr);
|
|
|
|
}
|
2020-07-28 21:09:12 -07:00
|
|
|
}
|
|
|
|
|
2022-09-02 14:10:48 -07:00
|
|
|
static bool is_map_full(struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
if (htab->use_percpu_counter)
|
|
|
|
return __percpu_counter_compare(&htab->pcount, htab->map.max_entries,
|
|
|
|
PERCPU_COUNTER_BATCH) >= 0;
|
|
|
|
return atomic_read(&htab->count) >= htab->map.max_entries;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void inc_elem_count(struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
if (htab->use_percpu_counter)
|
|
|
|
percpu_counter_add_batch(&htab->pcount, 1, PERCPU_COUNTER_BATCH);
|
|
|
|
else
|
|
|
|
atomic_inc(&htab->count);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void dec_elem_count(struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
if (htab->use_percpu_counter)
|
|
|
|
percpu_counter_add_batch(&htab->pcount, -1, PERCPU_COUNTER_BATCH);
|
|
|
|
else
|
|
|
|
atomic_dec(&htab->count);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2020-07-28 21:09:12 -07:00
|
|
|
static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
|
|
|
|
{
|
|
|
|
htab_put_fd_value(htab, l);
|
2017-03-22 10:00:34 -07:00
|
|
|
|
2017-03-21 19:05:04 -07:00
|
|
|
if (htab_is_prealloc(htab)) {
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
check_and_free_fields(htab, l);
|
2019-01-30 18:12:43 -08:00
|
|
|
__pcpu_freelist_push(&htab->freelist, &l->fnode);
|
2016-02-01 22:39:53 -08:00
|
|
|
} else {
|
2022-09-02 14:10:48 -07:00
|
|
|
dec_elem_count(htab);
|
2022-09-02 14:10:53 -07:00
|
|
|
htab_elem_free(htab, l);
|
2016-02-01 22:39:53 -08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-11-11 10:55:08 -08:00
|
|
|
static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
|
|
|
|
void *value, bool onallcpus)
|
|
|
|
{
|
|
|
|
if (!onallcpus) {
|
|
|
|
/* copy true value_size bytes */
|
2023-02-25 16:40:08 +01:00
|
|
|
copy_map_value(&htab->map, this_cpu_ptr(pptr), value);
|
2016-11-11 10:55:08 -08:00
|
|
|
} else {
|
|
|
|
u32 size = round_up(htab->map.value_size, 8);
|
|
|
|
int off = 0, cpu;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
2023-02-25 16:40:08 +01:00
|
|
|
copy_map_value_long(&htab->map, per_cpu_ptr(pptr, cpu), value + off);
|
2016-11-11 10:55:08 -08:00
|
|
|
off += size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-11-04 12:23:32 +01:00
|
|
|
static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr,
|
|
|
|
void *value, bool onallcpus)
|
|
|
|
{
|
2022-09-02 14:10:53 -07:00
|
|
|
/* When not setting the initial value on all cpus, zero-fill element
|
|
|
|
* values for other cpus. Otherwise, bpf program has no way to ensure
|
2020-11-04 12:23:32 +01:00
|
|
|
* known initial values for cpus other than current one
|
|
|
|
* (onallcpus=false always when coming from bpf prog).
|
|
|
|
*/
|
2022-09-02 14:10:53 -07:00
|
|
|
if (!onallcpus) {
|
2020-11-04 12:23:32 +01:00
|
|
|
int current_cpu = raw_smp_processor_id();
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
if (cpu == current_cpu)
|
2023-02-25 16:40:08 +01:00
|
|
|
copy_map_value_long(&htab->map, per_cpu_ptr(pptr, cpu), value);
|
|
|
|
else /* Since elem is preallocated, we cannot touch special fields */
|
|
|
|
zero_map_value(&htab->map, per_cpu_ptr(pptr, cpu));
|
2020-11-04 12:23:32 +01:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
pcpu_copy_value(htab, pptr, value, onallcpus);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-08-23 00:06:09 +02:00
|
|
|
static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
return htab->map.map_type == BPF_MAP_TYPE_HASH_OF_MAPS &&
|
|
|
|
BITS_PER_LONG == 64;
|
|
|
|
}
|
|
|
|
|
2016-02-01 22:39:53 -08:00
|
|
|
static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
|
|
|
|
void *value, u32 key_size, u32 hash,
|
2016-08-05 14:01:27 -07:00
|
|
|
bool percpu, bool onallcpus,
|
2017-03-21 19:05:04 -07:00
|
|
|
struct htab_elem *old_elem)
|
2016-02-01 22:39:53 -08:00
|
|
|
{
|
2019-01-31 15:40:04 -08:00
|
|
|
u32 size = htab->map.value_size;
|
2017-03-21 19:05:04 -07:00
|
|
|
bool prealloc = htab_is_prealloc(htab);
|
|
|
|
struct htab_elem *l_new, **pl_new;
|
2016-02-01 22:39:53 -08:00
|
|
|
void __percpu *pptr;
|
|
|
|
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (prealloc) {
|
2017-03-21 19:05:04 -07:00
|
|
|
if (old_elem) {
|
|
|
|
/* if we're updating the existing element,
|
|
|
|
* use per-cpu extra elems to avoid freelist_pop/push
|
|
|
|
*/
|
|
|
|
pl_new = this_cpu_ptr(htab->extra_elems);
|
|
|
|
l_new = *pl_new;
|
2020-07-28 21:09:12 -07:00
|
|
|
htab_put_fd_value(htab, old_elem);
|
2017-03-21 19:05:04 -07:00
|
|
|
*pl_new = old_elem;
|
|
|
|
} else {
|
|
|
|
struct pcpu_freelist_node *l;
|
2017-03-07 20:00:12 -08:00
|
|
|
|
2019-01-30 18:12:43 -08:00
|
|
|
l = __pcpu_freelist_pop(&htab->freelist);
|
2017-03-21 19:05:04 -07:00
|
|
|
if (!l)
|
|
|
|
return ERR_PTR(-E2BIG);
|
2017-03-07 20:00:12 -08:00
|
|
|
l_new = container_of(l, struct htab_elem, fnode);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
}
|
2016-08-05 14:01:27 -07:00
|
|
|
} else {
|
2022-09-02 14:10:48 -07:00
|
|
|
if (is_map_full(htab))
|
|
|
|
if (!old_elem)
|
2017-03-21 19:05:04 -07:00
|
|
|
/* when map is full and update() is replacing
|
|
|
|
* old element, it's ok to allocate, since
|
|
|
|
* old element will be freed immediately.
|
|
|
|
* Otherwise return an error
|
|
|
|
*/
|
2022-09-02 14:10:48 -07:00
|
|
|
return ERR_PTR(-E2BIG);
|
|
|
|
inc_elem_count(htab);
|
2022-09-02 14:10:44 -07:00
|
|
|
l_new = bpf_mem_cache_alloc(&htab->ma);
|
2018-06-29 14:48:20 +02:00
|
|
|
if (!l_new) {
|
|
|
|
l_new = ERR_PTR(-ENOMEM);
|
|
|
|
goto dec_count;
|
|
|
|
}
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
}
|
2016-02-01 22:39:53 -08:00
|
|
|
|
|
|
|
memcpy(l_new->key, key, key_size);
|
|
|
|
if (percpu) {
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (prealloc) {
|
|
|
|
pptr = htab_elem_get_ptr(l_new, key_size);
|
|
|
|
} else {
|
|
|
|
/* alloc_percpu zero-fills */
|
2022-09-02 14:10:53 -07:00
|
|
|
pptr = bpf_mem_cache_alloc(&htab->pcpu_ma);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (!pptr) {
|
2022-09-02 14:10:44 -07:00
|
|
|
bpf_mem_cache_free(&htab->ma, l_new);
|
2018-06-29 14:48:20 +02:00
|
|
|
l_new = ERR_PTR(-ENOMEM);
|
|
|
|
goto dec_count;
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
}
|
2022-09-02 14:10:53 -07:00
|
|
|
l_new->ptr_to_pptr = pptr;
|
|
|
|
pptr = *(void **)pptr;
|
2016-02-01 22:39:53 -08:00
|
|
|
}
|
|
|
|
|
2020-11-04 12:23:32 +01:00
|
|
|
pcpu_init_value(htab, pptr, value, onallcpus);
|
bpf: add lookup/update support for per-cpu hash and array maps
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.
Example of single counter aggregation in user space:
unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
long values[nr_cpus];
long value = 0;
bpf_lookup_elem(fd, key, values);
for (i = 0; i < nr_cpus; i++)
value += values[i];
The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-01 22:39:55 -08:00
|
|
|
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (!prealloc)
|
|
|
|
htab_elem_set_ptr(l_new, key_size, pptr);
|
2019-01-31 15:40:04 -08:00
|
|
|
} else if (fd_htab_map_needs_adjust(htab)) {
|
|
|
|
size = round_up(size, 8);
|
2016-02-01 22:39:53 -08:00
|
|
|
memcpy(l_new->key + round_up(key_size, 8), value, size);
|
2019-01-31 15:40:04 -08:00
|
|
|
} else {
|
|
|
|
copy_map_value(&htab->map,
|
|
|
|
l_new->key + round_up(key_size, 8),
|
|
|
|
value);
|
2016-02-01 22:39:53 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
l_new->hash = hash;
|
|
|
|
return l_new;
|
2018-06-29 14:48:20 +02:00
|
|
|
dec_count:
|
2022-09-02 14:10:48 -07:00
|
|
|
dec_elem_count(htab);
|
2018-06-29 14:48:20 +02:00
|
|
|
return l_new;
|
2016-02-01 22:39:53 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
static int check_flags(struct bpf_htab *htab, struct htab_elem *l_old,
|
|
|
|
u64 map_flags)
|
|
|
|
{
|
2019-01-31 15:40:09 -08:00
|
|
|
if (l_old && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST)
|
2016-02-01 22:39:53 -08:00
|
|
|
/* elem already exists */
|
|
|
|
return -EEXIST;
|
|
|
|
|
2019-01-31 15:40:09 -08:00
|
|
|
if (!l_old && (map_flags & ~BPF_F_LOCK) == BPF_EXIST)
|
2016-02-01 22:39:53 -08:00
|
|
|
/* elem doesn't exist, cannot update it */
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-11-13 17:36:45 -08:00
|
|
|
/* Called from syscall or from eBPF program */
|
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4ce8. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 12:47:54 -07:00
|
|
|
static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
|
|
|
|
u64 map_flags)
|
2014-11-13 17:36:45 -08:00
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
2016-02-01 22:39:53 -08:00
|
|
|
struct htab_elem *l_new = NULL, *l_old;
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head *head;
|
2014-11-13 17:36:45 -08:00
|
|
|
unsigned long flags;
|
2016-02-01 22:39:53 -08:00
|
|
|
struct bucket *b;
|
|
|
|
u32 key_size, hash;
|
2014-11-13 17:36:45 -08:00
|
|
|
int ret;
|
|
|
|
|
2019-01-31 15:40:09 -08:00
|
|
|
if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
|
2014-11-13 17:36:45 -08:00
|
|
|
/* unknown flags */
|
|
|
|
return -EINVAL;
|
|
|
|
|
2021-06-24 18:05:54 +02:00
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
|
|
|
|
!rcu_read_lock_bh_held());
|
2014-11-13 17:36:45 -08:00
|
|
|
|
|
|
|
key_size = map->key_size;
|
|
|
|
|
2018-08-22 23:49:37 +02:00
|
|
|
hash = htab_map_hash(key, key_size, htab->hashrnd);
|
2016-02-01 22:39:53 -08:00
|
|
|
|
|
|
|
b = __select_bucket(htab, hash);
|
2015-12-29 22:40:27 +08:00
|
|
|
head = &b->head;
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2019-01-31 15:40:09 -08:00
|
|
|
if (unlikely(map_flags & BPF_F_LOCK)) {
|
2022-11-04 00:39:56 +05:30
|
|
|
if (unlikely(!btf_record_has_field(map->record, BPF_SPIN_LOCK)))
|
2019-01-31 15:40:09 -08:00
|
|
|
return -EINVAL;
|
|
|
|
/* find an element without taking the bucket lock */
|
|
|
|
l_old = lookup_nulls_elem_raw(head, hash, key, key_size,
|
|
|
|
htab->n_buckets);
|
|
|
|
ret = check_flags(htab, l_old, map_flags);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
if (l_old) {
|
|
|
|
/* grab the element lock and update value in place */
|
|
|
|
copy_map_value_locked(map,
|
|
|
|
l_old->key + round_up(key_size, 8),
|
|
|
|
value, false);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
/* fall through, grab the bucket lock and lookup again.
|
|
|
|
* 99.9% chance that the element won't be found,
|
|
|
|
* but second lookup under lock has to be done.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
ret = htab_lock_bucket(htab, b, hash, &flags);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2016-02-01 22:39:53 -08:00
|
|
|
l_old = lookup_elem_raw(head, hash, key, key_size);
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2016-02-01 22:39:53 -08:00
|
|
|
ret = check_flags(htab, l_old, map_flags);
|
|
|
|
if (ret)
|
2014-11-13 17:36:45 -08:00
|
|
|
goto err;
|
|
|
|
|
2019-01-31 15:40:09 -08:00
|
|
|
if (unlikely(l_old && (map_flags & BPF_F_LOCK))) {
|
|
|
|
/* first lookup without the bucket lock didn't find the element,
|
|
|
|
* but second lookup with the bucket lock found it.
|
|
|
|
* This case is highly unlikely, but has to be dealt with:
|
|
|
|
* grab the element lock in addition to the bucket lock
|
|
|
|
* and update element in place
|
|
|
|
*/
|
|
|
|
copy_map_value_locked(map,
|
|
|
|
l_old->key + round_up(key_size, 8),
|
|
|
|
value, false);
|
|
|
|
ret = 0;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2016-08-05 14:01:27 -07:00
|
|
|
l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false,
|
2017-03-21 19:05:04 -07:00
|
|
|
l_old);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (IS_ERR(l_new)) {
|
|
|
|
/* all pre-allocated elements are in use or memory exhausted */
|
|
|
|
ret = PTR_ERR(l_new);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
2016-02-01 22:39:53 -08:00
|
|
|
/* add new element to the head of the list, so that
|
|
|
|
* concurrent search will find it before old elem
|
2014-11-13 17:36:45 -08:00
|
|
|
*/
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_add_head_rcu(&l_new->hash_node, head);
|
2014-11-13 17:36:45 -08:00
|
|
|
if (l_old) {
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_del_rcu(&l_old->hash_node);
|
2017-03-21 19:05:04 -07:00
|
|
|
if (!htab_is_prealloc(htab))
|
|
|
|
free_htab_elem(htab, l_old);
|
2021-07-14 17:54:10 -07:00
|
|
|
else
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
check_and_free_fields(htab, l_old);
|
2014-11-13 17:36:45 -08:00
|
|
|
}
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
ret = 0;
|
2014-11-13 17:36:45 -08:00
|
|
|
err:
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, hash, flags);
|
2014-11-13 17:36:45 -08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-07-14 17:54:10 -07:00
|
|
|
static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
|
|
|
|
{
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
check_and_free_fields(htab, elem);
|
2021-07-14 17:54:10 -07:00
|
|
|
bpf_lru_push_free(&htab->lru, &elem->lru_node);
|
|
|
|
}
|
|
|
|
|
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4ce8. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 12:47:54 -07:00
|
|
|
static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value,
|
|
|
|
u64 map_flags)
|
2016-11-11 10:55:09 -08:00
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
|
|
|
struct htab_elem *l_new, *l_old = NULL;
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head *head;
|
2016-11-11 10:55:09 -08:00
|
|
|
unsigned long flags;
|
|
|
|
struct bucket *b;
|
|
|
|
u32 key_size, hash;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (unlikely(map_flags > BPF_EXIST))
|
|
|
|
/* unknown flags */
|
|
|
|
return -EINVAL;
|
|
|
|
|
2021-06-24 18:05:54 +02:00
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
|
|
|
|
!rcu_read_lock_bh_held());
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
key_size = map->key_size;
|
|
|
|
|
2018-08-22 23:49:37 +02:00
|
|
|
hash = htab_map_hash(key, key_size, htab->hashrnd);
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
b = __select_bucket(htab, hash);
|
|
|
|
head = &b->head;
|
|
|
|
|
|
|
|
/* For LRU, we need to alloc before taking bucket's
|
|
|
|
* spinlock because getting free nodes from LRU may need
|
|
|
|
* to remove older elements from htab and this removal
|
|
|
|
* operation will need a bucket lock.
|
|
|
|
*/
|
|
|
|
l_new = prealloc_lru_pop(htab, key, hash);
|
|
|
|
if (!l_new)
|
|
|
|
return -ENOMEM;
|
2021-07-14 17:54:10 -07:00
|
|
|
copy_map_value(&htab->map,
|
|
|
|
l_new->key + round_up(map->key_size, 8), value);
|
2016-11-11 10:55:09 -08:00
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
ret = htab_lock_bucket(htab, b, hash, &flags);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
l_old = lookup_elem_raw(head, hash, key, key_size);
|
|
|
|
|
|
|
|
ret = check_flags(htab, l_old, map_flags);
|
|
|
|
if (ret)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
/* add new element to the head of the list, so that
|
|
|
|
* concurrent search will find it before old elem
|
|
|
|
*/
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_add_head_rcu(&l_new->hash_node, head);
|
2016-11-11 10:55:09 -08:00
|
|
|
if (l_old) {
|
|
|
|
bpf_lru_node_set_ref(&l_new->lru_node);
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_del_rcu(&l_old->hash_node);
|
2016-11-11 10:55:09 -08:00
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
|
|
|
|
err:
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, hash, flags);
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
if (ret)
|
2021-07-14 17:54:10 -07:00
|
|
|
htab_lru_push_free(htab, l_new);
|
2016-11-11 10:55:09 -08:00
|
|
|
else if (l_old)
|
2021-07-14 17:54:10 -07:00
|
|
|
htab_lru_push_free(htab, l_old);
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4ce8. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 12:47:54 -07:00
|
|
|
static long __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
|
|
|
|
void *value, u64 map_flags,
|
|
|
|
bool onallcpus)
|
2016-02-01 22:39:53 -08:00
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
|
|
|
struct htab_elem *l_new = NULL, *l_old;
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head *head;
|
2016-02-01 22:39:53 -08:00
|
|
|
unsigned long flags;
|
|
|
|
struct bucket *b;
|
|
|
|
u32 key_size, hash;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (unlikely(map_flags > BPF_EXIST))
|
|
|
|
/* unknown flags */
|
|
|
|
return -EINVAL;
|
|
|
|
|
2021-06-24 18:05:54 +02:00
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
|
|
|
|
!rcu_read_lock_bh_held());
|
2016-02-01 22:39:53 -08:00
|
|
|
|
|
|
|
key_size = map->key_size;
|
|
|
|
|
2018-08-22 23:49:37 +02:00
|
|
|
hash = htab_map_hash(key, key_size, htab->hashrnd);
|
2016-02-01 22:39:53 -08:00
|
|
|
|
|
|
|
b = __select_bucket(htab, hash);
|
|
|
|
head = &b->head;
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
ret = htab_lock_bucket(htab, b, hash, &flags);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2016-02-01 22:39:53 -08:00
|
|
|
|
|
|
|
l_old = lookup_elem_raw(head, hash, key, key_size);
|
|
|
|
|
|
|
|
ret = check_flags(htab, l_old, map_flags);
|
|
|
|
if (ret)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
if (l_old) {
|
|
|
|
/* per-cpu hash map can update value in-place */
|
2016-11-11 10:55:08 -08:00
|
|
|
pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
|
|
|
|
value, onallcpus);
|
2016-02-01 22:39:53 -08:00
|
|
|
} else {
|
|
|
|
l_new = alloc_htab_elem(htab, key, value, key_size,
|
2017-03-21 19:05:04 -07:00
|
|
|
hash, true, onallcpus, NULL);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
if (IS_ERR(l_new)) {
|
|
|
|
ret = PTR_ERR(l_new);
|
2016-02-01 22:39:53 -08:00
|
|
|
goto err;
|
|
|
|
}
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_add_head_rcu(&l_new->hash_node, head);
|
2016-02-01 22:39:53 -08:00
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
err:
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, hash, flags);
|
2016-02-01 22:39:53 -08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4ce8. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 12:47:54 -07:00
|
|
|
static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
|
|
|
|
void *value, u64 map_flags,
|
|
|
|
bool onallcpus)
|
2016-11-11 10:55:10 -08:00
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
|
|
|
struct htab_elem *l_new = NULL, *l_old;
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head *head;
|
2016-11-11 10:55:10 -08:00
|
|
|
unsigned long flags;
|
|
|
|
struct bucket *b;
|
|
|
|
u32 key_size, hash;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (unlikely(map_flags > BPF_EXIST))
|
|
|
|
/* unknown flags */
|
|
|
|
return -EINVAL;
|
|
|
|
|
2021-06-24 18:05:54 +02:00
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
|
|
|
|
!rcu_read_lock_bh_held());
|
2016-11-11 10:55:10 -08:00
|
|
|
|
|
|
|
key_size = map->key_size;
|
|
|
|
|
2018-08-22 23:49:37 +02:00
|
|
|
hash = htab_map_hash(key, key_size, htab->hashrnd);
|
2016-11-11 10:55:10 -08:00
|
|
|
|
|
|
|
b = __select_bucket(htab, hash);
|
|
|
|
head = &b->head;
|
|
|
|
|
|
|
|
/* For LRU, we need to alloc before taking bucket's
|
|
|
|
* spinlock because LRU's elem alloc may need
|
|
|
|
* to remove older elem from htab and this removal
|
|
|
|
* operation will need a bucket lock.
|
|
|
|
*/
|
|
|
|
if (map_flags != BPF_EXIST) {
|
|
|
|
l_new = prealloc_lru_pop(htab, key, hash);
|
|
|
|
if (!l_new)
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
ret = htab_lock_bucket(htab, b, hash, &flags);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2016-11-11 10:55:10 -08:00
|
|
|
|
|
|
|
l_old = lookup_elem_raw(head, hash, key, key_size);
|
|
|
|
|
|
|
|
ret = check_flags(htab, l_old, map_flags);
|
|
|
|
if (ret)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
if (l_old) {
|
|
|
|
bpf_lru_node_set_ref(&l_old->lru_node);
|
|
|
|
|
|
|
|
/* per-cpu hash map can update value in-place */
|
|
|
|
pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size),
|
|
|
|
value, onallcpus);
|
|
|
|
} else {
|
2020-11-04 12:23:32 +01:00
|
|
|
pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size),
|
2016-11-11 10:55:10 -08:00
|
|
|
value, onallcpus);
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_add_head_rcu(&l_new->hash_node, head);
|
2016-11-11 10:55:10 -08:00
|
|
|
l_new = NULL;
|
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
err:
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, hash, flags);
|
2016-11-11 10:55:10 -08:00
|
|
|
if (l_new)
|
|
|
|
bpf_lru_push_free(&htab->lru, &l_new->lru_node);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4ce8. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 12:47:54 -07:00
|
|
|
static long htab_percpu_map_update_elem(struct bpf_map *map, void *key,
|
|
|
|
void *value, u64 map_flags)
|
bpf: add lookup/update support for per-cpu hash and array maps
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.
Example of single counter aggregation in user space:
unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
long values[nr_cpus];
long value = 0;
bpf_lookup_elem(fd, key, values);
for (i = 0; i < nr_cpus; i++)
value += values[i];
The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-01 22:39:55 -08:00
|
|
|
{
|
|
|
|
return __htab_percpu_map_update_elem(map, key, value, map_flags, false);
|
|
|
|
}
|
|
|
|
|
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4ce8. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 12:47:54 -07:00
|
|
|
static long htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
|
|
|
|
void *value, u64 map_flags)
|
2016-11-11 10:55:10 -08:00
|
|
|
{
|
|
|
|
return __htab_lru_percpu_map_update_elem(map, key, value, map_flags,
|
|
|
|
false);
|
|
|
|
}
|
|
|
|
|
2014-11-13 17:36:45 -08:00
|
|
|
/* Called from syscall or from eBPF program */
|
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4ce8. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 12:47:54 -07:00
|
|
|
static long htab_map_delete_elem(struct bpf_map *map, void *key)
|
2014-11-13 17:36:45 -08:00
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head *head;
|
2015-12-29 22:40:27 +08:00
|
|
|
struct bucket *b;
|
2014-11-13 17:36:45 -08:00
|
|
|
struct htab_elem *l;
|
|
|
|
unsigned long flags;
|
|
|
|
u32 hash, key_size;
|
2020-10-29 00:19:25 -07:00
|
|
|
int ret;
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2021-06-24 18:05:54 +02:00
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
|
|
|
|
!rcu_read_lock_bh_held());
|
2014-11-13 17:36:45 -08:00
|
|
|
|
|
|
|
key_size = map->key_size;
|
|
|
|
|
2018-08-22 23:49:37 +02:00
|
|
|
hash = htab_map_hash(key, key_size, htab->hashrnd);
|
2015-12-29 22:40:27 +08:00
|
|
|
b = __select_bucket(htab, hash);
|
|
|
|
head = &b->head;
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
ret = htab_lock_bucket(htab, b, hash, &flags);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2014-11-13 17:36:45 -08:00
|
|
|
|
|
|
|
l = lookup_elem_raw(head, hash, key, key_size);
|
|
|
|
|
|
|
|
if (l) {
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_del_rcu(&l->hash_node);
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
free_htab_elem(htab, l);
|
2020-10-29 00:19:25 -07:00
|
|
|
} else {
|
|
|
|
ret = -ENOENT;
|
2014-11-13 17:36:45 -08:00
|
|
|
}
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, hash, flags);
|
2014-11-13 17:36:45 -08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4ce8. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 12:47:54 -07:00
|
|
|
static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
|
2016-11-11 10:55:09 -08:00
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head *head;
|
2016-11-11 10:55:09 -08:00
|
|
|
struct bucket *b;
|
|
|
|
struct htab_elem *l;
|
|
|
|
unsigned long flags;
|
|
|
|
u32 hash, key_size;
|
2020-10-29 00:19:25 -07:00
|
|
|
int ret;
|
2016-11-11 10:55:09 -08:00
|
|
|
|
2021-06-24 18:05:54 +02:00
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
|
|
|
|
!rcu_read_lock_bh_held());
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
key_size = map->key_size;
|
|
|
|
|
2018-08-22 23:49:37 +02:00
|
|
|
hash = htab_map_hash(key, key_size, htab->hashrnd);
|
2016-11-11 10:55:09 -08:00
|
|
|
b = __select_bucket(htab, hash);
|
|
|
|
head = &b->head;
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
ret = htab_lock_bucket(htab, b, hash, &flags);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2016-11-11 10:55:09 -08:00
|
|
|
|
|
|
|
l = lookup_elem_raw(head, hash, key, key_size);
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
if (l)
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_del_rcu(&l->hash_node);
|
2020-10-29 00:19:25 -07:00
|
|
|
else
|
|
|
|
ret = -ENOENT;
|
2016-11-11 10:55:09 -08:00
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, hash, flags);
|
2016-11-11 10:55:09 -08:00
|
|
|
if (l)
|
2021-07-14 17:54:10 -07:00
|
|
|
htab_lru_push_free(htab, l);
|
2016-11-11 10:55:09 -08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-11-13 17:36:45 -08:00
|
|
|
static void delete_all_elements(struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2022-09-02 14:10:44 -07:00
|
|
|
/* It's called from a worker thread, so disable migration here,
|
|
|
|
* since bpf_mem_cache_free() relies on that.
|
|
|
|
*/
|
|
|
|
migrate_disable();
|
2014-11-13 17:36:45 -08:00
|
|
|
for (i = 0; i < htab->n_buckets; i++) {
|
2017-03-07 20:00:13 -08:00
|
|
|
struct hlist_nulls_head *head = select_bucket(htab, i);
|
|
|
|
struct hlist_nulls_node *n;
|
2014-11-13 17:36:45 -08:00
|
|
|
struct htab_elem *l;
|
|
|
|
|
2017-03-07 20:00:13 -08:00
|
|
|
hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
|
|
|
|
hlist_nulls_del_rcu(&l->hash_node);
|
2017-03-21 19:05:04 -07:00
|
|
|
htab_elem_free(htab, l);
|
2014-11-13 17:36:45 -08:00
|
|
|
}
|
|
|
|
}
|
2022-09-02 14:10:44 -07:00
|
|
|
migrate_enable();
|
2014-11-13 17:36:45 -08:00
|
|
|
}
|
2017-03-22 10:00:34 -07:00
|
|
|
|
2021-07-14 17:54:10 -07:00
|
|
|
static void htab_free_malloced_timers(struct bpf_htab *htab)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
for (i = 0; i < htab->n_buckets; i++) {
|
|
|
|
struct hlist_nulls_head *head = select_bucket(htab, i);
|
|
|
|
struct hlist_nulls_node *n;
|
|
|
|
struct htab_elem *l;
|
|
|
|
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
hlist_nulls_for_each_entry(l, n, head, hash_node) {
|
2022-11-04 00:39:56 +05:30
|
|
|
/* We only free timer on uref dropping to zero */
|
|
|
|
bpf_obj_free_timer(htab->map.record, l->key + round_up(htab->map.key_size, 8));
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
}
|
2021-07-14 17:54:10 -07:00
|
|
|
cond_resched_rcu();
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void htab_map_free_timers(struct bpf_map *map)
|
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
|
|
|
|
2022-11-04 00:39:56 +05:30
|
|
|
/* We only free timer on uref dropping to zero */
|
|
|
|
if (!btf_record_has_field(htab->map.record, BPF_TIMER))
|
2021-07-14 17:54:10 -07:00
|
|
|
return;
|
|
|
|
if (!htab_is_prealloc(htab))
|
|
|
|
htab_free_malloced_timers(htab);
|
|
|
|
else
|
|
|
|
htab_free_prealloced_timers(htab);
|
|
|
|
}
|
|
|
|
|
2014-11-13 17:36:45 -08:00
|
|
|
/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
|
|
|
|
static void htab_map_free(struct bpf_map *map)
|
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
2020-10-29 00:19:25 -07:00
|
|
|
int i;
|
2014-11-13 17:36:45 -08:00
|
|
|
|
2020-06-29 21:33:39 -07:00
|
|
|
/* bpf_free_used_maps() or close(map_fd) will trigger this map_free callback.
|
|
|
|
* bpf_free_used_maps() is called after bpf prog is no longer executing.
|
|
|
|
* There is no need to synchronize_rcu() here to protect map elements.
|
2014-11-13 17:36:45 -08:00
|
|
|
*/
|
|
|
|
|
2022-09-02 14:10:58 -07:00
|
|
|
/* htab no longer uses call_rcu() directly. bpf_mem_alloc does it
|
|
|
|
* underneath and is reponsible for waiting for callbacks to finish
|
|
|
|
* during bpf_mem_alloc_destroy().
|
2014-11-13 17:36:45 -08:00
|
|
|
*/
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
if (!htab_is_prealloc(htab)) {
|
bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.
The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work
At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.
Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.
While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.
Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.
Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.
Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.
Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:
1 cpu:
pcpu_ida 2.1M
pcpu_ida nolock 2.3M
bt 2.4M
kmalloc 1.8M
hlist+spinlock 2.3M
pcpu_freelist 2.6M
4 cpu:
pcpu_ida 1.5M
pcpu_ida nolock 1.8M
bt w/smp_align 1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock 0.2M
pcpu_freelist 2.0M
8 cpu:
pcpu_ida 0.7M
bt w/smp_align 0.8M
kmalloc 0.4M
pcpu_freelist 1.5M
32 cpu:
kmalloc 0.13M
pcpu_freelist 0.49M
pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.
bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist
hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.
kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 21:57:15 -08:00
|
|
|
delete_all_elements(htab);
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
} else {
|
bpf: Refactor kptr_off_tab into btf_record
To prepare the BPF verifier to handle special fields in both map values
and program allocated types coming from program BTF, we need to refactor
the kptr_off_tab handling code into something more generic and reusable
across both cases to avoid code duplication.
Later patches also require passing this data to helpers at runtime, so
that they can work on user defined types, initialize them, destruct
them, etc.
The main observation is that both map values and such allocated types
point to a type in program BTF, hence they can be handled similarly. We
can prepare a field metadata table for both cases and store them in
struct bpf_map or struct btf depending on the use case.
Hence, refactor the code into generic btf_record and btf_field member
structs. The btf_record represents the fields of a specific btf_type in
user BTF. The cnt indicates the number of special fields we successfully
recognized, and field_mask is a bitmask of fields that were found, to
enable quick determination of availability of a certain field.
Subsequently, refactor the rest of the code to work with these generic
types, remove assumptions about kptr and kptr_off_tab, rename variables
to more meaningful names, etc.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 00:39:55 +05:30
|
|
|
htab_free_prealloced_fields(htab);
|
2016-11-11 10:55:09 -08:00
|
|
|
prealloc_destroy(htab);
|
bpf: Wire up freeing of referenced kptr
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-25 03:18:55 +05:30
|
|
|
}
|
2016-11-11 10:55:09 -08:00
|
|
|
|
2016-08-05 14:01:27 -07:00
|
|
|
free_percpu(htab->extra_elems);
|
bpf: don't trigger OOM killer under pressure with map alloc
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.
Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.
Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.
Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 15:14:17 +01:00
|
|
|
bpf_map_area_free(htab->buckets);
|
2022-09-02 14:10:53 -07:00
|
|
|
bpf_mem_alloc_destroy(&htab->pcpu_ma);
|
2022-09-02 14:10:44 -07:00
|
|
|
bpf_mem_alloc_destroy(&htab->ma);
|
2022-09-02 14:10:48 -07:00
|
|
|
if (htab->use_percpu_counter)
|
|
|
|
percpu_counter_destroy(&htab->pcount);
|
2020-10-29 00:19:25 -07:00
|
|
|
for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
|
|
|
|
free_percpu(htab->map_locked[i]);
|
2020-11-02 03:41:00 -08:00
|
|
|
lockdep_unregister_key(&htab->lockdep_key);
|
2022-08-10 15:18:29 +00:00
|
|
|
bpf_map_area_free(htab);
|
2014-11-13 17:36:45 -08:00
|
|
|
}
|
|
|
|
|
2018-08-09 08:55:20 -07:00
|
|
|
static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
|
|
|
|
struct seq_file *m)
|
|
|
|
{
|
|
|
|
void *value;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
value = htab_map_lookup_elem(map, key);
|
|
|
|
if (!value) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
btf_type_seq_show(map->btf, map->btf_key_type_id, key, m);
|
|
|
|
seq_puts(m, ": ");
|
|
|
|
btf_type_seq_show(map->btf, map->btf_value_type_id, value, m);
|
|
|
|
seq_puts(m, "\n");
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2021-05-11 23:00:04 +02:00
|
|
|
static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
|
|
|
|
void *value, bool is_lru_map,
|
|
|
|
bool is_percpu, u64 flags)
|
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
|
|
|
struct hlist_nulls_head *head;
|
|
|
|
unsigned long bflags;
|
|
|
|
struct htab_elem *l;
|
|
|
|
u32 hash, key_size;
|
|
|
|
struct bucket *b;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
key_size = map->key_size;
|
|
|
|
|
|
|
|
hash = htab_map_hash(key, key_size, htab->hashrnd);
|
|
|
|
b = __select_bucket(htab, hash);
|
|
|
|
head = &b->head;
|
|
|
|
|
|
|
|
ret = htab_lock_bucket(htab, b, hash, &bflags);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
l = lookup_elem_raw(head, hash, key, key_size);
|
|
|
|
if (!l) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
} else {
|
|
|
|
if (is_percpu) {
|
|
|
|
u32 roundup_value_size = round_up(map->value_size, 8);
|
|
|
|
void __percpu *pptr;
|
|
|
|
int off = 0, cpu;
|
|
|
|
|
|
|
|
pptr = htab_elem_get_ptr(l, key_size);
|
|
|
|
for_each_possible_cpu(cpu) {
|
2023-02-25 16:40:08 +01:00
|
|
|
copy_map_value_long(&htab->map, value + off, per_cpu_ptr(pptr, cpu));
|
|
|
|
check_and_init_map_value(&htab->map, value + off);
|
2021-05-11 23:00:04 +02:00
|
|
|
off += roundup_value_size;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
u32 roundup_key_size = round_up(map->key_size, 8);
|
|
|
|
|
|
|
|
if (flags & BPF_F_LOCK)
|
|
|
|
copy_map_value_locked(map, value, l->key +
|
|
|
|
roundup_key_size,
|
|
|
|
true);
|
|
|
|
else
|
|
|
|
copy_map_value(map, value, l->key +
|
|
|
|
roundup_key_size);
|
bpf: Zeroing allocated object from slab in bpf memory allocator
Currently the freed element in bpf memory allocator may be immediately
reused, for htab map the reuse will reinitialize special fields in map
value (e.g., bpf_spin_lock), but lookup procedure may still access
these special fields, and it may lead to hard-lockup as shown below:
NMI backtrace for cpu 16
CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0
......
Call Trace:
<TASK>
copy_map_value_locked+0xb7/0x170
bpf_map_copy_value+0x113/0x3c0
__sys_bpf+0x1c67/0x2780
__x64_sys_bpf+0x1c/0x20
do_syscall_64+0x30/0x60
entry_SYSCALL_64_after_hwframe+0x46/0xb0
......
</TASK>
For htab map, just like the preallocated case, these is no need to
initialize these special fields in map value again once these fields
have been initialized. For preallocated htab map, these fields are
initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the
similar thing for non-preallocated htab in bpf memory allocator. And
there is no need to use __GFP_ZERO for per-cpu bpf memory allocator,
because __alloc_percpu_gfp() does it implicitly.
Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-15 16:21:31 +08:00
|
|
|
/* Zeroing special fields in the temp buffer */
|
2021-07-14 17:54:10 -07:00
|
|
|
check_and_init_map_value(map, value);
|
2021-05-11 23:00:04 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
hlist_nulls_del_rcu(&l->hash_node);
|
|
|
|
if (!is_lru_map)
|
|
|
|
free_htab_elem(htab, l);
|
|
|
|
}
|
|
|
|
|
|
|
|
htab_unlock_bucket(htab, b, hash, bflags);
|
|
|
|
|
|
|
|
if (is_lru_map && l)
|
2021-07-14 17:54:10 -07:00
|
|
|
htab_lru_push_free(htab, l);
|
2021-05-11 23:00:04 +02:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
|
|
|
|
void *value, u64 flags)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_elem(map, key, value, false, false,
|
|
|
|
flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int htab_percpu_map_lookup_and_delete_elem(struct bpf_map *map,
|
|
|
|
void *key, void *value,
|
|
|
|
u64 flags)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_elem(map, key, value, false, true,
|
|
|
|
flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int htab_lru_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
|
|
|
|
void *value, u64 flags)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_elem(map, key, value, true, false,
|
|
|
|
flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int htab_lru_percpu_map_lookup_and_delete_elem(struct bpf_map *map,
|
|
|
|
void *key, void *value,
|
|
|
|
u64 flags)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_elem(map, key, value, true, true,
|
|
|
|
flags);
|
|
|
|
}
|
|
|
|
|
2020-01-15 10:43:04 -08:00
|
|
|
static int
|
|
|
|
__htab_map_lookup_and_delete_batch(struct bpf_map *map,
|
|
|
|
const union bpf_attr *attr,
|
|
|
|
union bpf_attr __user *uattr,
|
|
|
|
bool do_delete, bool is_lru_map,
|
|
|
|
bool is_percpu)
|
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
|
|
|
u32 bucket_cnt, total, key_size, value_size, roundup_key_size;
|
|
|
|
void *keys = NULL, *values = NULL, *value, *dst_key, *dst_val;
|
|
|
|
void __user *uvalues = u64_to_user_ptr(attr->batch.values);
|
|
|
|
void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
|
2020-12-07 13:37:20 +01:00
|
|
|
void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch);
|
2022-05-10 01:22:20 -07:00
|
|
|
u32 batch, max_count, size, bucket_size, map_id;
|
2020-02-19 15:47:57 -08:00
|
|
|
struct htab_elem *node_to_free = NULL;
|
2020-01-15 10:43:04 -08:00
|
|
|
u64 elem_map_flags, map_flags;
|
|
|
|
struct hlist_nulls_head *head;
|
|
|
|
struct hlist_nulls_node *n;
|
2020-02-18 09:25:52 -08:00
|
|
|
unsigned long flags = 0;
|
|
|
|
bool locked = false;
|
2020-01-15 10:43:04 -08:00
|
|
|
struct htab_elem *l;
|
|
|
|
struct bucket *b;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
elem_map_flags = attr->batch.elem_flags;
|
|
|
|
if ((elem_map_flags & ~BPF_F_LOCK) ||
|
2022-11-04 00:39:56 +05:30
|
|
|
((elem_map_flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK)))
|
2020-01-15 10:43:04 -08:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
map_flags = attr->batch.flags;
|
|
|
|
if (map_flags)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
max_count = attr->batch.count;
|
|
|
|
if (!max_count)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (put_user(0, &uattr->batch.count))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
batch = 0;
|
|
|
|
if (ubatch && copy_from_user(&batch, ubatch, sizeof(batch)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (batch >= htab->n_buckets)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
key_size = htab->map.key_size;
|
|
|
|
roundup_key_size = round_up(htab->map.key_size, 8);
|
|
|
|
value_size = htab->map.value_size;
|
|
|
|
size = round_up(value_size, 8);
|
|
|
|
if (is_percpu)
|
|
|
|
value_size = size * num_possible_cpus();
|
|
|
|
total = 0;
|
|
|
|
/* while experimenting with hash tables with sizes ranging from 10 to
|
2022-02-20 10:40:55 -08:00
|
|
|
* 1000, it was observed that a bucket can have up to 5 entries.
|
2020-01-15 10:43:04 -08:00
|
|
|
*/
|
|
|
|
bucket_size = 5;
|
|
|
|
|
|
|
|
alloc:
|
|
|
|
/* We cannot do copy_from_user or copy_to_user inside
|
|
|
|
* the rcu_read_lock. Allocate enough space here.
|
|
|
|
*/
|
bpf: Fix integer overflow involving bucket_size
In __htab_map_lookup_and_delete_batch(), hash buckets are iterated
over to count the number of elements in each bucket (bucket_size).
If bucket_size is large enough, the multiplication to calculate
kvmalloc() size could overflow, resulting in out-of-bounds write
as reported by KASAN:
[...]
[ 104.986052] BUG: KASAN: vmalloc-out-of-bounds in __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.986489] Write of size 4194224 at addr ffffc9010503be70 by task crash/112
[ 104.986889]
[ 104.987193] CPU: 0 PID: 112 Comm: crash Not tainted 5.14.0-rc4 #13
[ 104.987552] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 104.988104] Call Trace:
[ 104.988410] dump_stack_lvl+0x34/0x44
[ 104.988706] print_address_description.constprop.0+0x21/0x140
[ 104.988991] ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.989327] ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.989622] kasan_report.cold+0x7f/0x11b
[ 104.989881] ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.990239] kasan_check_range+0x17c/0x1e0
[ 104.990467] memcpy+0x39/0x60
[ 104.990670] __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.990982] ? __wake_up_common+0x4d/0x230
[ 104.991256] ? htab_of_map_free+0x130/0x130
[ 104.991541] bpf_map_do_batch+0x1fb/0x220
[...]
In hashtable, if the elements' keys have the same jhash() value, the
elements will be put into the same bucket. By putting a lot of elements
into a single bucket, the value of bucket_size can be increased to
trigger the integer overflow.
Triggering the overflow is possible for both callers with CAP_SYS_ADMIN
and callers without CAP_SYS_ADMIN.
It will be trivial for a caller with CAP_SYS_ADMIN to intentionally
reach this overflow by enabling BPF_F_ZERO_SEED. As this flag will set
the random seed passed to jhash() to 0, it will be easy for the caller
to prepare keys which will be hashed into the same value, and thus put
all the elements into the same bucket.
If the caller does not have CAP_SYS_ADMIN, BPF_F_ZERO_SEED cannot be
used. However, it will be still technically possible to trigger the
overflow, by guessing the random seed value passed to jhash() (32bit)
and repeating the attempt to trigger the overflow. In this case,
the probability to trigger the overflow will be low and will take
a very long time.
Fix the integer overflow by calling kvmalloc_array() instead of
kvmalloc() to allocate memory.
Fixes: 057996380a42 ("bpf: Add batch ops to all htab bpf map")
Signed-off-by: Tatsuhiko Yasumatsu <th.yasumatsu@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210806150419.109658-1-th.yasumatsu@gmail.com
2021-08-07 00:04:18 +09:00
|
|
|
keys = kvmalloc_array(key_size, bucket_size, GFP_USER | __GFP_NOWARN);
|
|
|
|
values = kvmalloc_array(value_size, bucket_size, GFP_USER | __GFP_NOWARN);
|
2020-01-15 10:43:04 -08:00
|
|
|
if (!keys || !values) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto after_loop;
|
|
|
|
}
|
|
|
|
|
|
|
|
again:
|
2020-02-24 15:01:48 +01:00
|
|
|
bpf_disable_instrumentation();
|
2020-01-15 10:43:04 -08:00
|
|
|
rcu_read_lock();
|
|
|
|
again_nocopy:
|
|
|
|
dst_key = keys;
|
|
|
|
dst_val = values;
|
|
|
|
b = &htab->buckets[batch];
|
|
|
|
head = &b->head;
|
2020-02-18 09:25:52 -08:00
|
|
|
/* do not grab the lock unless need it (bucket_cnt > 0). */
|
2020-10-29 00:19:25 -07:00
|
|
|
if (locked) {
|
|
|
|
ret = htab_lock_bucket(htab, b, batch, &flags);
|
2022-08-31 12:26:28 +08:00
|
|
|
if (ret) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
bpf_enable_instrumentation();
|
|
|
|
goto after_loop;
|
|
|
|
}
|
2020-10-29 00:19:25 -07:00
|
|
|
}
|
2020-01-15 10:43:04 -08:00
|
|
|
|
|
|
|
bucket_cnt = 0;
|
|
|
|
hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
|
|
|
|
bucket_cnt++;
|
|
|
|
|
2020-02-18 09:25:52 -08:00
|
|
|
if (bucket_cnt && !locked) {
|
|
|
|
locked = true;
|
|
|
|
goto again_nocopy;
|
|
|
|
}
|
|
|
|
|
2020-01-15 10:43:04 -08:00
|
|
|
if (bucket_cnt > (max_count - total)) {
|
|
|
|
if (total == 0)
|
|
|
|
ret = -ENOSPC;
|
2020-02-18 09:25:52 -08:00
|
|
|
/* Note that since bucket_cnt > 0 here, it is implicit
|
|
|
|
* that the locked was grabbed, so release it.
|
|
|
|
*/
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, batch, flags);
|
2020-01-15 10:43:04 -08:00
|
|
|
rcu_read_unlock();
|
2020-02-24 15:01:48 +01:00
|
|
|
bpf_enable_instrumentation();
|
2020-01-15 10:43:04 -08:00
|
|
|
goto after_loop;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (bucket_cnt > bucket_size) {
|
|
|
|
bucket_size = bucket_cnt;
|
2020-02-18 09:25:52 -08:00
|
|
|
/* Note that since bucket_cnt > 0 here, it is implicit
|
|
|
|
* that the locked was grabbed, so release it.
|
|
|
|
*/
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, batch, flags);
|
2020-01-15 10:43:04 -08:00
|
|
|
rcu_read_unlock();
|
2020-02-24 15:01:48 +01:00
|
|
|
bpf_enable_instrumentation();
|
2020-01-15 10:43:04 -08:00
|
|
|
kvfree(keys);
|
|
|
|
kvfree(values);
|
|
|
|
goto alloc;
|
|
|
|
}
|
|
|
|
|
2020-02-18 09:25:52 -08:00
|
|
|
/* Next block is only safe to run if you have grabbed the lock */
|
|
|
|
if (!locked)
|
|
|
|
goto next_batch;
|
|
|
|
|
2020-01-15 10:43:04 -08:00
|
|
|
hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
|
|
|
|
memcpy(dst_key, l->key, key_size);
|
|
|
|
|
|
|
|
if (is_percpu) {
|
|
|
|
int off = 0, cpu;
|
|
|
|
void __percpu *pptr;
|
|
|
|
|
|
|
|
pptr = htab_elem_get_ptr(l, map->key_size);
|
|
|
|
for_each_possible_cpu(cpu) {
|
2023-02-25 16:40:08 +01:00
|
|
|
copy_map_value_long(&htab->map, dst_val + off, per_cpu_ptr(pptr, cpu));
|
|
|
|
check_and_init_map_value(&htab->map, dst_val + off);
|
2020-01-15 10:43:04 -08:00
|
|
|
off += size;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
value = l->key + roundup_key_size;
|
2022-05-10 01:22:20 -07:00
|
|
|
if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
|
|
|
|
struct bpf_map **inner_map = value;
|
|
|
|
|
|
|
|
/* Actual value is the id of the inner map */
|
|
|
|
map_id = map->ops->map_fd_sys_lookup_elem(*inner_map);
|
|
|
|
value = &map_id;
|
|
|
|
}
|
|
|
|
|
2020-01-15 10:43:04 -08:00
|
|
|
if (elem_map_flags & BPF_F_LOCK)
|
|
|
|
copy_map_value_locked(map, dst_val, value,
|
|
|
|
true);
|
|
|
|
else
|
|
|
|
copy_map_value(map, dst_val, value);
|
bpf: Zeroing allocated object from slab in bpf memory allocator
Currently the freed element in bpf memory allocator may be immediately
reused, for htab map the reuse will reinitialize special fields in map
value (e.g., bpf_spin_lock), but lookup procedure may still access
these special fields, and it may lead to hard-lockup as shown below:
NMI backtrace for cpu 16
CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0
......
Call Trace:
<TASK>
copy_map_value_locked+0xb7/0x170
bpf_map_copy_value+0x113/0x3c0
__sys_bpf+0x1c67/0x2780
__x64_sys_bpf+0x1c/0x20
do_syscall_64+0x30/0x60
entry_SYSCALL_64_after_hwframe+0x46/0xb0
......
</TASK>
For htab map, just like the preallocated case, these is no need to
initialize these special fields in map value again once these fields
have been initialized. For preallocated htab map, these fields are
initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the
similar thing for non-preallocated htab in bpf memory allocator. And
there is no need to use __GFP_ZERO for per-cpu bpf memory allocator,
because __alloc_percpu_gfp() does it implicitly.
Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-15 16:21:31 +08:00
|
|
|
/* Zeroing special fields in the temp buffer */
|
2021-07-14 17:54:10 -07:00
|
|
|
check_and_init_map_value(map, dst_val);
|
2020-01-15 10:43:04 -08:00
|
|
|
}
|
|
|
|
if (do_delete) {
|
|
|
|
hlist_nulls_del_rcu(&l->hash_node);
|
2020-02-19 15:47:57 -08:00
|
|
|
|
|
|
|
/* bpf_lru_push_free() will acquire lru_lock, which
|
|
|
|
* may cause deadlock. See comments in function
|
|
|
|
* prealloc_lru_pop(). Let us do bpf_lru_push_free()
|
|
|
|
* after releasing the bucket lock.
|
|
|
|
*/
|
|
|
|
if (is_lru_map) {
|
|
|
|
l->batch_flink = node_to_free;
|
|
|
|
node_to_free = l;
|
|
|
|
} else {
|
2020-01-15 10:43:04 -08:00
|
|
|
free_htab_elem(htab, l);
|
2020-02-19 15:47:57 -08:00
|
|
|
}
|
2020-01-15 10:43:04 -08:00
|
|
|
}
|
|
|
|
dst_key += key_size;
|
|
|
|
dst_val += value_size;
|
|
|
|
}
|
|
|
|
|
2020-10-29 00:19:25 -07:00
|
|
|
htab_unlock_bucket(htab, b, batch, flags);
|
2020-02-18 09:25:52 -08:00
|
|
|
locked = false;
|
2020-02-19 15:47:57 -08:00
|
|
|
|
|
|
|
while (node_to_free) {
|
|
|
|
l = node_to_free;
|
|
|
|
node_to_free = node_to_free->batch_flink;
|
2021-07-14 17:54:10 -07:00
|
|
|
htab_lru_push_free(htab, l);
|
2020-02-19 15:47:57 -08:00
|
|
|
}
|
|
|
|
|
2020-02-18 09:25:52 -08:00
|
|
|
next_batch:
|
2020-01-15 10:43:04 -08:00
|
|
|
/* If we are not copying data, we can go to next bucket and avoid
|
|
|
|
* unlocking the rcu.
|
|
|
|
*/
|
|
|
|
if (!bucket_cnt && (batch + 1 < htab->n_buckets)) {
|
|
|
|
batch++;
|
|
|
|
goto again_nocopy;
|
|
|
|
}
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
2020-02-24 15:01:48 +01:00
|
|
|
bpf_enable_instrumentation();
|
2020-01-15 10:43:04 -08:00
|
|
|
if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys,
|
|
|
|
key_size * bucket_cnt) ||
|
|
|
|
copy_to_user(uvalues + total * value_size, values,
|
|
|
|
value_size * bucket_cnt))) {
|
|
|
|
ret = -EFAULT;
|
|
|
|
goto after_loop;
|
|
|
|
}
|
|
|
|
|
|
|
|
total += bucket_cnt;
|
|
|
|
batch++;
|
|
|
|
if (batch >= htab->n_buckets) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto after_loop;
|
|
|
|
}
|
|
|
|
goto again;
|
|
|
|
|
|
|
|
after_loop:
|
|
|
|
if (ret == -EFAULT)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* copy # of entries and next batch */
|
|
|
|
ubatch = u64_to_user_ptr(attr->batch.out_batch);
|
|
|
|
if (copy_to_user(ubatch, &batch, sizeof(batch)) ||
|
|
|
|
put_user(total, &uattr->batch.count))
|
|
|
|
ret = -EFAULT;
|
|
|
|
|
|
|
|
out:
|
|
|
|
kvfree(keys);
|
|
|
|
kvfree(values);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
htab_percpu_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
|
|
|
|
union bpf_attr __user *uattr)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
|
|
|
|
false, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
htab_percpu_map_lookup_and_delete_batch(struct bpf_map *map,
|
|
|
|
const union bpf_attr *attr,
|
|
|
|
union bpf_attr __user *uattr)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
|
|
|
|
false, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
htab_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
|
|
|
|
union bpf_attr __user *uattr)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
|
|
|
|
false, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
htab_map_lookup_and_delete_batch(struct bpf_map *map,
|
|
|
|
const union bpf_attr *attr,
|
|
|
|
union bpf_attr __user *uattr)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
|
|
|
|
false, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
htab_lru_percpu_map_lookup_batch(struct bpf_map *map,
|
|
|
|
const union bpf_attr *attr,
|
|
|
|
union bpf_attr __user *uattr)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
|
|
|
|
true, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
htab_lru_percpu_map_lookup_and_delete_batch(struct bpf_map *map,
|
|
|
|
const union bpf_attr *attr,
|
|
|
|
union bpf_attr __user *uattr)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
|
|
|
|
true, true);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
htab_lru_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
|
|
|
|
union bpf_attr __user *uattr)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_batch(map, attr, uattr, false,
|
|
|
|
true, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
htab_lru_map_lookup_and_delete_batch(struct bpf_map *map,
|
|
|
|
const union bpf_attr *attr,
|
|
|
|
union bpf_attr __user *uattr)
|
|
|
|
{
|
|
|
|
return __htab_map_lookup_and_delete_batch(map, attr, uattr, true,
|
|
|
|
true, false);
|
|
|
|
}
|
|
|
|
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
struct bpf_iter_seq_hash_map_info {
|
|
|
|
struct bpf_map *map;
|
|
|
|
struct bpf_htab *htab;
|
|
|
|
void *percpu_value_buf; // non-zero means percpu hash
|
|
|
|
u32 bucket_id;
|
|
|
|
u32 skip_elems;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct htab_elem *
|
|
|
|
bpf_hash_map_seq_find_next(struct bpf_iter_seq_hash_map_info *info,
|
|
|
|
struct htab_elem *prev_elem)
|
|
|
|
{
|
|
|
|
const struct bpf_htab *htab = info->htab;
|
|
|
|
u32 skip_elems = info->skip_elems;
|
|
|
|
u32 bucket_id = info->bucket_id;
|
|
|
|
struct hlist_nulls_head *head;
|
|
|
|
struct hlist_nulls_node *n;
|
|
|
|
struct htab_elem *elem;
|
|
|
|
struct bucket *b;
|
|
|
|
u32 i, count;
|
|
|
|
|
|
|
|
if (bucket_id >= htab->n_buckets)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/* try to find next elem in the same bucket */
|
|
|
|
if (prev_elem) {
|
|
|
|
/* no update/deletion on this bucket, prev_elem should be still valid
|
|
|
|
* and we won't skip elements.
|
|
|
|
*/
|
|
|
|
n = rcu_dereference_raw(hlist_nulls_next_rcu(&prev_elem->hash_node));
|
|
|
|
elem = hlist_nulls_entry_safe(n, struct htab_elem, hash_node);
|
|
|
|
if (elem)
|
|
|
|
return elem;
|
|
|
|
|
|
|
|
/* not found, unlock and go to the next bucket */
|
|
|
|
b = &htab->buckets[bucket_id++];
|
bpf: Do not use bucket_lock for hashmap iterator
Currently, for hashmap, the bpf iterator will grab a bucket lock, a
spinlock, before traversing the elements in the bucket. This can ensure
all bpf visted elements are valid. But this mechanism may cause
deadlock if update/deletion happens to the same bucket of the
visited map in the program. For example, if we added bpf_map_update_elem()
call to the same visited element in selftests bpf_iter_bpf_hash_map.c,
we will have the following deadlock:
============================================
WARNING: possible recursive locking detected
5.9.0-rc1+ #841 Not tainted
--------------------------------------------
test_progs/1750 is trying to acquire lock:
ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410
but task is already holding lock:
ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&htab->buckets[i].raw_lock);
lock(&htab->buckets[i].raw_lock);
*** DEADLOCK ***
...
Call Trace:
dump_stack+0x78/0xa0
__lock_acquire.cold.74+0x209/0x2e3
lock_acquire+0xba/0x380
? htab_map_update_elem+0x1cf/0x410
? __lock_acquire+0x639/0x20c0
_raw_spin_lock_irqsave+0x3b/0x80
? htab_map_update_elem+0x1cf/0x410
htab_map_update_elem+0x1cf/0x410
? lock_acquire+0xba/0x380
bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c
? find_held_lock+0x34/0xa0
bpf_iter_run_prog+0x81/0x16e
__bpf_hash_map_seq_show+0x145/0x180
bpf_seq_read+0xff/0x3d0
vfs_read+0xad/0x1c0
ksys_read+0x5f/0xe0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(),
and then grabbed again in htab_map_update_elem() in the bpf program, causing
deadlocks.
Actually, we do not need bucket_lock here, we can just use rcu_read_lock()
similar to netlink iterator where the rcu_read_{lock,unlock} likes below:
seq_ops->start():
rcu_read_lock();
seq_ops->next():
rcu_read_unlock();
/* next element */
rcu_read_lock();
seq_ops->stop();
rcu_read_unlock();
Compared to old bucket_lock mechanism, if concurrent updata/delete happens,
we may visit stale elements, miss some elements, or repeat some elements.
I think this is a reasonable compromise. For users wanting to avoid
stale, missing/repeated accesses, bpf_map batch access syscall interface
can be used.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200902235340.2001375-1-yhs@fb.com
2020-09-02 16:53:40 -07:00
|
|
|
rcu_read_unlock();
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
skip_elems = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = bucket_id; i < htab->n_buckets; i++) {
|
|
|
|
b = &htab->buckets[i];
|
bpf: Do not use bucket_lock for hashmap iterator
Currently, for hashmap, the bpf iterator will grab a bucket lock, a
spinlock, before traversing the elements in the bucket. This can ensure
all bpf visted elements are valid. But this mechanism may cause
deadlock if update/deletion happens to the same bucket of the
visited map in the program. For example, if we added bpf_map_update_elem()
call to the same visited element in selftests bpf_iter_bpf_hash_map.c,
we will have the following deadlock:
============================================
WARNING: possible recursive locking detected
5.9.0-rc1+ #841 Not tainted
--------------------------------------------
test_progs/1750 is trying to acquire lock:
ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410
but task is already holding lock:
ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&htab->buckets[i].raw_lock);
lock(&htab->buckets[i].raw_lock);
*** DEADLOCK ***
...
Call Trace:
dump_stack+0x78/0xa0
__lock_acquire.cold.74+0x209/0x2e3
lock_acquire+0xba/0x380
? htab_map_update_elem+0x1cf/0x410
? __lock_acquire+0x639/0x20c0
_raw_spin_lock_irqsave+0x3b/0x80
? htab_map_update_elem+0x1cf/0x410
htab_map_update_elem+0x1cf/0x410
? lock_acquire+0xba/0x380
bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c
? find_held_lock+0x34/0xa0
bpf_iter_run_prog+0x81/0x16e
__bpf_hash_map_seq_show+0x145/0x180
bpf_seq_read+0xff/0x3d0
vfs_read+0xad/0x1c0
ksys_read+0x5f/0xe0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(),
and then grabbed again in htab_map_update_elem() in the bpf program, causing
deadlocks.
Actually, we do not need bucket_lock here, we can just use rcu_read_lock()
similar to netlink iterator where the rcu_read_{lock,unlock} likes below:
seq_ops->start():
rcu_read_lock();
seq_ops->next():
rcu_read_unlock();
/* next element */
rcu_read_lock();
seq_ops->stop();
rcu_read_unlock();
Compared to old bucket_lock mechanism, if concurrent updata/delete happens,
we may visit stale elements, miss some elements, or repeat some elements.
I think this is a reasonable compromise. For users wanting to avoid
stale, missing/repeated accesses, bpf_map batch access syscall interface
can be used.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200902235340.2001375-1-yhs@fb.com
2020-09-02 16:53:40 -07:00
|
|
|
rcu_read_lock();
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
|
|
|
|
count = 0;
|
|
|
|
head = &b->head;
|
|
|
|
hlist_nulls_for_each_entry_rcu(elem, n, head, hash_node) {
|
|
|
|
if (count >= skip_elems) {
|
|
|
|
info->bucket_id = i;
|
|
|
|
info->skip_elems = count;
|
|
|
|
return elem;
|
|
|
|
}
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
|
bpf: Do not use bucket_lock for hashmap iterator
Currently, for hashmap, the bpf iterator will grab a bucket lock, a
spinlock, before traversing the elements in the bucket. This can ensure
all bpf visted elements are valid. But this mechanism may cause
deadlock if update/deletion happens to the same bucket of the
visited map in the program. For example, if we added bpf_map_update_elem()
call to the same visited element in selftests bpf_iter_bpf_hash_map.c,
we will have the following deadlock:
============================================
WARNING: possible recursive locking detected
5.9.0-rc1+ #841 Not tainted
--------------------------------------------
test_progs/1750 is trying to acquire lock:
ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410
but task is already holding lock:
ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&htab->buckets[i].raw_lock);
lock(&htab->buckets[i].raw_lock);
*** DEADLOCK ***
...
Call Trace:
dump_stack+0x78/0xa0
__lock_acquire.cold.74+0x209/0x2e3
lock_acquire+0xba/0x380
? htab_map_update_elem+0x1cf/0x410
? __lock_acquire+0x639/0x20c0
_raw_spin_lock_irqsave+0x3b/0x80
? htab_map_update_elem+0x1cf/0x410
htab_map_update_elem+0x1cf/0x410
? lock_acquire+0xba/0x380
bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c
? find_held_lock+0x34/0xa0
bpf_iter_run_prog+0x81/0x16e
__bpf_hash_map_seq_show+0x145/0x180
bpf_seq_read+0xff/0x3d0
vfs_read+0xad/0x1c0
ksys_read+0x5f/0xe0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(),
and then grabbed again in htab_map_update_elem() in the bpf program, causing
deadlocks.
Actually, we do not need bucket_lock here, we can just use rcu_read_lock()
similar to netlink iterator where the rcu_read_{lock,unlock} likes below:
seq_ops->start():
rcu_read_lock();
seq_ops->next():
rcu_read_unlock();
/* next element */
rcu_read_lock();
seq_ops->stop();
rcu_read_unlock();
Compared to old bucket_lock mechanism, if concurrent updata/delete happens,
we may visit stale elements, miss some elements, or repeat some elements.
I think this is a reasonable compromise. For users wanting to avoid
stale, missing/repeated accesses, bpf_map batch access syscall interface
can be used.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200902235340.2001375-1-yhs@fb.com
2020-09-02 16:53:40 -07:00
|
|
|
rcu_read_unlock();
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
skip_elems = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
info->bucket_id = i;
|
|
|
|
info->skip_elems = 0;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *bpf_hash_map_seq_start(struct seq_file *seq, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct bpf_iter_seq_hash_map_info *info = seq->private;
|
|
|
|
struct htab_elem *elem;
|
|
|
|
|
|
|
|
elem = bpf_hash_map_seq_find_next(info, NULL);
|
|
|
|
if (!elem)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (*pos == 0)
|
|
|
|
++*pos;
|
|
|
|
return elem;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *bpf_hash_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct bpf_iter_seq_hash_map_info *info = seq->private;
|
|
|
|
|
|
|
|
++*pos;
|
|
|
|
++info->skip_elems;
|
|
|
|
return bpf_hash_map_seq_find_next(info, v);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __bpf_hash_map_seq_show(struct seq_file *seq, struct htab_elem *elem)
|
|
|
|
{
|
|
|
|
struct bpf_iter_seq_hash_map_info *info = seq->private;
|
|
|
|
u32 roundup_key_size, roundup_value_size;
|
|
|
|
struct bpf_iter__bpf_map_elem ctx = {};
|
|
|
|
struct bpf_map *map = info->map;
|
|
|
|
struct bpf_iter_meta meta;
|
|
|
|
int ret = 0, off = 0, cpu;
|
|
|
|
struct bpf_prog *prog;
|
|
|
|
void __percpu *pptr;
|
|
|
|
|
|
|
|
meta.seq = seq;
|
|
|
|
prog = bpf_iter_get_info(&meta, elem == NULL);
|
|
|
|
if (prog) {
|
|
|
|
ctx.meta = &meta;
|
|
|
|
ctx.map = info->map;
|
|
|
|
if (elem) {
|
|
|
|
roundup_key_size = round_up(map->key_size, 8);
|
|
|
|
ctx.key = elem->key;
|
|
|
|
if (!info->percpu_value_buf) {
|
|
|
|
ctx.value = elem->key + roundup_key_size;
|
|
|
|
} else {
|
|
|
|
roundup_value_size = round_up(map->value_size, 8);
|
|
|
|
pptr = htab_elem_get_ptr(elem, map->key_size);
|
|
|
|
for_each_possible_cpu(cpu) {
|
2023-02-25 16:40:08 +01:00
|
|
|
copy_map_value_long(map, info->percpu_value_buf + off,
|
|
|
|
per_cpu_ptr(pptr, cpu));
|
|
|
|
check_and_init_map_value(map, info->percpu_value_buf + off);
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
off += roundup_value_size;
|
|
|
|
}
|
|
|
|
ctx.value = info->percpu_value_buf;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ret = bpf_iter_run_prog(prog, &ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int bpf_hash_map_seq_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
return __bpf_hash_map_seq_show(seq, v);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_hash_map_seq_stop(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
if (!v)
|
|
|
|
(void)__bpf_hash_map_seq_show(seq, NULL);
|
|
|
|
else
|
bpf: Do not use bucket_lock for hashmap iterator
Currently, for hashmap, the bpf iterator will grab a bucket lock, a
spinlock, before traversing the elements in the bucket. This can ensure
all bpf visted elements are valid. But this mechanism may cause
deadlock if update/deletion happens to the same bucket of the
visited map in the program. For example, if we added bpf_map_update_elem()
call to the same visited element in selftests bpf_iter_bpf_hash_map.c,
we will have the following deadlock:
============================================
WARNING: possible recursive locking detected
5.9.0-rc1+ #841 Not tainted
--------------------------------------------
test_progs/1750 is trying to acquire lock:
ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410
but task is already holding lock:
ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&htab->buckets[i].raw_lock);
lock(&htab->buckets[i].raw_lock);
*** DEADLOCK ***
...
Call Trace:
dump_stack+0x78/0xa0
__lock_acquire.cold.74+0x209/0x2e3
lock_acquire+0xba/0x380
? htab_map_update_elem+0x1cf/0x410
? __lock_acquire+0x639/0x20c0
_raw_spin_lock_irqsave+0x3b/0x80
? htab_map_update_elem+0x1cf/0x410
htab_map_update_elem+0x1cf/0x410
? lock_acquire+0xba/0x380
bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c
? find_held_lock+0x34/0xa0
bpf_iter_run_prog+0x81/0x16e
__bpf_hash_map_seq_show+0x145/0x180
bpf_seq_read+0xff/0x3d0
vfs_read+0xad/0x1c0
ksys_read+0x5f/0xe0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(),
and then grabbed again in htab_map_update_elem() in the bpf program, causing
deadlocks.
Actually, we do not need bucket_lock here, we can just use rcu_read_lock()
similar to netlink iterator where the rcu_read_{lock,unlock} likes below:
seq_ops->start():
rcu_read_lock();
seq_ops->next():
rcu_read_unlock();
/* next element */
rcu_read_lock();
seq_ops->stop();
rcu_read_unlock();
Compared to old bucket_lock mechanism, if concurrent updata/delete happens,
we may visit stale elements, miss some elements, or repeat some elements.
I think this is a reasonable compromise. For users wanting to avoid
stale, missing/repeated accesses, bpf_map batch access syscall interface
can be used.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200902235340.2001375-1-yhs@fb.com
2020-09-02 16:53:40 -07:00
|
|
|
rcu_read_unlock();
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static int bpf_iter_init_hash_map(void *priv_data,
|
|
|
|
struct bpf_iter_aux_info *aux)
|
|
|
|
{
|
|
|
|
struct bpf_iter_seq_hash_map_info *seq_info = priv_data;
|
|
|
|
struct bpf_map *map = aux->map;
|
|
|
|
void *value_buf;
|
|
|
|
u32 buf_size;
|
|
|
|
|
|
|
|
if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
|
|
|
|
map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
|
|
|
|
buf_size = round_up(map->value_size, 8) * num_possible_cpus();
|
|
|
|
value_buf = kmalloc(buf_size, GFP_USER | __GFP_NOWARN);
|
|
|
|
if (!value_buf)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
seq_info->percpu_value_buf = value_buf;
|
|
|
|
}
|
|
|
|
|
2022-08-10 16:05:31 +08:00
|
|
|
bpf_map_inc_with_uref(map);
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
seq_info->map = map;
|
|
|
|
seq_info->htab = container_of(map, struct bpf_htab, map);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_iter_fini_hash_map(void *priv_data)
|
|
|
|
{
|
|
|
|
struct bpf_iter_seq_hash_map_info *seq_info = priv_data;
|
|
|
|
|
2022-08-10 16:05:31 +08:00
|
|
|
bpf_map_put_with_uref(seq_info->map);
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
kfree(seq_info->percpu_value_buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct seq_operations bpf_hash_map_seq_ops = {
|
|
|
|
.start = bpf_hash_map_seq_start,
|
|
|
|
.next = bpf_hash_map_seq_next,
|
|
|
|
.stop = bpf_hash_map_seq_stop,
|
|
|
|
.show = bpf_hash_map_seq_show,
|
|
|
|
};
|
|
|
|
|
|
|
|
static const struct bpf_iter_seq_info iter_seq_info = {
|
|
|
|
.seq_ops = &bpf_hash_map_seq_ops,
|
|
|
|
.init_seq_private = bpf_iter_init_hash_map,
|
|
|
|
.fini_seq_private = bpf_iter_fini_hash_map,
|
|
|
|
.seq_priv_size = sizeof(struct bpf_iter_seq_hash_map_info),
|
|
|
|
};
|
|
|
|
|
bpf: return long from bpf_map_ops funcs
This patch changes the return types of bpf_map_ops functions to long, where
previously int was returned. Using long allows for bpf programs to maintain
the sign bit in the absence of sign extension during situations where
inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative
error is returned.
The definitions of the helper funcs are generated from comments in the bpf
uapi header at `include/uapi/linux/bpf.h`. The return type of these
helpers was previously changed from int to long in commit bdb7b79b4ce8. For
any case where one of the map helpers call the bpf_map_ops funcs that are
still returning 32-bit int, a compiler might not include sign extension
instructions to properly convert the 32-bit negative value a 64-bit
negative value.
For example:
bpf assembly excerpt of an inlined helper calling a kernel function and
checking for a specific error:
; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST);
...
46: call 0xffffffffe103291c ; htab_map_update_elem
; if (err && err != -EEXIST) {
4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax
kernel function assembly excerpt of return value from
`htab_map_update_elem` returning 32-bit int:
movl $0xffffffef, %r9d
...
movl %r9d, %eax
...results in the comparison:
cmp $0xffffffffffffffef, $0x00000000ffffffef
Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long")
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Link: https://lore.kernel.org/r/20230322194754.185781-3-inwardvessel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-22 12:47:54 -07:00
|
|
|
static long bpf_for_each_hash_elem(struct bpf_map *map, bpf_callback_t callback_fn,
|
|
|
|
void *callback_ctx, u64 flags)
|
2021-02-26 12:49:27 -08:00
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
|
|
|
struct hlist_nulls_head *head;
|
|
|
|
struct hlist_nulls_node *n;
|
|
|
|
struct htab_elem *elem;
|
|
|
|
u32 roundup_key_size;
|
|
|
|
int i, num_elems = 0;
|
|
|
|
void __percpu *pptr;
|
|
|
|
struct bucket *b;
|
|
|
|
void *key, *val;
|
|
|
|
bool is_percpu;
|
|
|
|
u64 ret = 0;
|
|
|
|
|
|
|
|
if (flags != 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
is_percpu = htab_is_percpu(htab);
|
|
|
|
|
|
|
|
roundup_key_size = round_up(map->key_size, 8);
|
|
|
|
/* disable migration so percpu value prepared here will be the
|
|
|
|
* same as the one seen by the bpf program with bpf_map_lookup_elem().
|
|
|
|
*/
|
|
|
|
if (is_percpu)
|
|
|
|
migrate_disable();
|
|
|
|
for (i = 0; i < htab->n_buckets; i++) {
|
|
|
|
b = &htab->buckets[i];
|
|
|
|
rcu_read_lock();
|
|
|
|
head = &b->head;
|
|
|
|
hlist_nulls_for_each_entry_rcu(elem, n, head, hash_node) {
|
|
|
|
key = elem->key;
|
|
|
|
if (is_percpu) {
|
|
|
|
/* current cpu value for percpu map */
|
|
|
|
pptr = htab_elem_get_ptr(elem, map->key_size);
|
|
|
|
val = this_cpu_ptr(pptr);
|
|
|
|
} else {
|
|
|
|
val = elem->key + roundup_key_size;
|
|
|
|
}
|
|
|
|
num_elems++;
|
2021-09-28 16:09:46 -07:00
|
|
|
ret = callback_fn((u64)(long)map, (u64)(long)key,
|
|
|
|
(u64)(long)val, (u64)(long)callback_ctx, 0);
|
2021-02-26 12:49:27 -08:00
|
|
|
/* return value: 0 - continue, 1 - stop and return */
|
|
|
|
if (ret) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
if (is_percpu)
|
|
|
|
migrate_enable();
|
|
|
|
return num_elems;
|
|
|
|
}
|
|
|
|
|
bpf: hashtab memory usage
htab_map_mem_usage() is introduced to calculate hashmap memory usage. In
this helper, some small memory allocations are ignore, as their size is
quite small compared with the total size. The inner_map_meta in
hash_of_map is also ignored.
The result for hashtab as follows,
- before this change
1: hash name count_map flags 0x1 <<<< no prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 41943040B
2: hash name count_map flags 0x1 <<<< no prealloc, none set
key 16B value 24B max_entries 1048576 memlock 41943040B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 41943040B
The memlock is always a fixed size whatever it is preallocated or
not, and whatever the count of allocated elements is.
- after this change
1: hash name count_map flags 0x1 <<<< non prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 117441536B
2: hash name count_map flags 0x1 <<<< non prealloc, non set
key 16B value 24B max_entries 1048576 memlock 16778240B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 109056000B
The memlock now is hashtab actually allocated.
The result for percpu hash map as follows,
- before this change
4: percpu_hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
5: percpu_hash name count_map flags 0x1 <<<< no prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
- after this change
4: percpu_hash name count_map flags 0x0
key 16B value 24B max_entries 1048576 memlock 897582080B
5: percpu_hash name count_map flags 0x1
key 16B value 24B max_entries 1048576 memlock 922748736B
At worst, the difference can be 10x, for example,
- before this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 8388608B
- after this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 83889408B
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230305124615.12358-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-05 12:46:00 +00:00
|
|
|
static u64 htab_map_mem_usage(const struct bpf_map *map)
|
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
|
|
|
u32 value_size = round_up(htab->map.value_size, 8);
|
|
|
|
bool prealloc = htab_is_prealloc(htab);
|
|
|
|
bool percpu = htab_is_percpu(htab);
|
|
|
|
bool lru = htab_is_lru(htab);
|
|
|
|
u64 num_entries;
|
|
|
|
u64 usage = sizeof(struct bpf_htab);
|
|
|
|
|
|
|
|
usage += sizeof(struct bucket) * htab->n_buckets;
|
|
|
|
usage += sizeof(int) * num_possible_cpus() * HASHTAB_MAP_LOCK_COUNT;
|
|
|
|
if (prealloc) {
|
|
|
|
num_entries = map->max_entries;
|
|
|
|
if (htab_has_extra_elems(htab))
|
|
|
|
num_entries += num_possible_cpus();
|
|
|
|
|
|
|
|
usage += htab->elem_size * num_entries;
|
|
|
|
|
|
|
|
if (percpu)
|
|
|
|
usage += value_size * num_possible_cpus() * num_entries;
|
|
|
|
else if (!lru)
|
|
|
|
usage += sizeof(struct htab_elem *) * num_possible_cpus();
|
|
|
|
} else {
|
|
|
|
#define LLIST_NODE_SZ sizeof(struct llist_node)
|
|
|
|
|
|
|
|
num_entries = htab->use_percpu_counter ?
|
|
|
|
percpu_counter_sum(&htab->pcount) :
|
|
|
|
atomic_read(&htab->count);
|
|
|
|
usage += (htab->elem_size + LLIST_NODE_SZ) * num_entries;
|
|
|
|
if (percpu) {
|
|
|
|
usage += (LLIST_NODE_SZ + sizeof(void *)) * num_entries;
|
|
|
|
usage += value_size * num_possible_cpus() * num_entries;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return usage;
|
|
|
|
}
|
|
|
|
|
2022-04-25 21:32:47 +08:00
|
|
|
BTF_ID_LIST_SINGLE(htab_map_btf_ids, struct, bpf_htab)
|
2017-04-11 15:34:58 +02:00
|
|
|
const struct bpf_map_ops htab_map_ops = {
|
2020-08-27 18:18:06 -07:00
|
|
|
.map_meta_equal = bpf_map_meta_equal,
|
2018-01-11 20:29:05 -08:00
|
|
|
.map_alloc_check = htab_map_alloc_check,
|
2014-11-13 17:36:45 -08:00
|
|
|
.map_alloc = htab_map_alloc,
|
|
|
|
.map_free = htab_map_free,
|
|
|
|
.map_get_next_key = htab_map_get_next_key,
|
2021-07-14 17:54:10 -07:00
|
|
|
.map_release_uref = htab_map_free_timers,
|
2014-11-13 17:36:45 -08:00
|
|
|
.map_lookup_elem = htab_map_lookup_elem,
|
2021-05-11 23:00:04 +02:00
|
|
|
.map_lookup_and_delete_elem = htab_map_lookup_and_delete_elem,
|
2014-11-13 17:36:45 -08:00
|
|
|
.map_update_elem = htab_map_update_elem,
|
|
|
|
.map_delete_elem = htab_map_delete_elem,
|
2017-03-15 18:26:43 -07:00
|
|
|
.map_gen_lookup = htab_map_gen_lookup,
|
2018-08-09 08:55:20 -07:00
|
|
|
.map_seq_show_elem = htab_map_seq_show_elem,
|
2021-02-26 12:49:27 -08:00
|
|
|
.map_set_for_each_callback_args = map_set_for_each_callback_args,
|
|
|
|
.map_for_each_callback = bpf_for_each_hash_elem,
|
bpf: hashtab memory usage
htab_map_mem_usage() is introduced to calculate hashmap memory usage. In
this helper, some small memory allocations are ignore, as their size is
quite small compared with the total size. The inner_map_meta in
hash_of_map is also ignored.
The result for hashtab as follows,
- before this change
1: hash name count_map flags 0x1 <<<< no prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 41943040B
2: hash name count_map flags 0x1 <<<< no prealloc, none set
key 16B value 24B max_entries 1048576 memlock 41943040B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 41943040B
The memlock is always a fixed size whatever it is preallocated or
not, and whatever the count of allocated elements is.
- after this change
1: hash name count_map flags 0x1 <<<< non prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 117441536B
2: hash name count_map flags 0x1 <<<< non prealloc, non set
key 16B value 24B max_entries 1048576 memlock 16778240B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 109056000B
The memlock now is hashtab actually allocated.
The result for percpu hash map as follows,
- before this change
4: percpu_hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
5: percpu_hash name count_map flags 0x1 <<<< no prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
- after this change
4: percpu_hash name count_map flags 0x0
key 16B value 24B max_entries 1048576 memlock 897582080B
5: percpu_hash name count_map flags 0x1
key 16B value 24B max_entries 1048576 memlock 922748736B
At worst, the difference can be 10x, for example,
- before this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 8388608B
- after this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 83889408B
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230305124615.12358-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-05 12:46:00 +00:00
|
|
|
.map_mem_usage = htab_map_mem_usage,
|
2020-01-15 10:43:04 -08:00
|
|
|
BATCH_OPS(htab),
|
2022-04-25 21:32:47 +08:00
|
|
|
.map_btf_id = &htab_map_btf_ids[0],
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
.iter_seq_info = &iter_seq_info,
|
2014-11-13 17:36:45 -08:00
|
|
|
};
|
|
|
|
|
2017-04-11 15:34:58 +02:00
|
|
|
const struct bpf_map_ops htab_lru_map_ops = {
|
2020-08-27 18:18:06 -07:00
|
|
|
.map_meta_equal = bpf_map_meta_equal,
|
2018-01-11 20:29:05 -08:00
|
|
|
.map_alloc_check = htab_map_alloc_check,
|
2016-11-11 10:55:09 -08:00
|
|
|
.map_alloc = htab_map_alloc,
|
|
|
|
.map_free = htab_map_free,
|
|
|
|
.map_get_next_key = htab_map_get_next_key,
|
2021-07-14 17:54:10 -07:00
|
|
|
.map_release_uref = htab_map_free_timers,
|
2016-11-11 10:55:09 -08:00
|
|
|
.map_lookup_elem = htab_lru_map_lookup_elem,
|
2021-05-11 23:00:04 +02:00
|
|
|
.map_lookup_and_delete_elem = htab_lru_map_lookup_and_delete_elem,
|
2019-05-14 01:18:56 +02:00
|
|
|
.map_lookup_elem_sys_only = htab_lru_map_lookup_elem_sys,
|
2016-11-11 10:55:09 -08:00
|
|
|
.map_update_elem = htab_lru_map_update_elem,
|
|
|
|
.map_delete_elem = htab_lru_map_delete_elem,
|
2017-08-31 23:27:12 -07:00
|
|
|
.map_gen_lookup = htab_lru_map_gen_lookup,
|
2018-08-09 08:55:20 -07:00
|
|
|
.map_seq_show_elem = htab_map_seq_show_elem,
|
2021-02-26 12:49:27 -08:00
|
|
|
.map_set_for_each_callback_args = map_set_for_each_callback_args,
|
|
|
|
.map_for_each_callback = bpf_for_each_hash_elem,
|
bpf: hashtab memory usage
htab_map_mem_usage() is introduced to calculate hashmap memory usage. In
this helper, some small memory allocations are ignore, as their size is
quite small compared with the total size. The inner_map_meta in
hash_of_map is also ignored.
The result for hashtab as follows,
- before this change
1: hash name count_map flags 0x1 <<<< no prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 41943040B
2: hash name count_map flags 0x1 <<<< no prealloc, none set
key 16B value 24B max_entries 1048576 memlock 41943040B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 41943040B
The memlock is always a fixed size whatever it is preallocated or
not, and whatever the count of allocated elements is.
- after this change
1: hash name count_map flags 0x1 <<<< non prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 117441536B
2: hash name count_map flags 0x1 <<<< non prealloc, non set
key 16B value 24B max_entries 1048576 memlock 16778240B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 109056000B
The memlock now is hashtab actually allocated.
The result for percpu hash map as follows,
- before this change
4: percpu_hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
5: percpu_hash name count_map flags 0x1 <<<< no prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
- after this change
4: percpu_hash name count_map flags 0x0
key 16B value 24B max_entries 1048576 memlock 897582080B
5: percpu_hash name count_map flags 0x1
key 16B value 24B max_entries 1048576 memlock 922748736B
At worst, the difference can be 10x, for example,
- before this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 8388608B
- after this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 83889408B
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230305124615.12358-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-05 12:46:00 +00:00
|
|
|
.map_mem_usage = htab_map_mem_usage,
|
2020-01-15 10:43:04 -08:00
|
|
|
BATCH_OPS(htab_lru),
|
2022-04-25 21:32:47 +08:00
|
|
|
.map_btf_id = &htab_map_btf_ids[0],
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
.iter_seq_info = &iter_seq_info,
|
2016-11-11 10:55:09 -08:00
|
|
|
};
|
|
|
|
|
2016-02-01 22:39:53 -08:00
|
|
|
/* Called from eBPF program */
|
|
|
|
static void *htab_percpu_map_lookup_elem(struct bpf_map *map, void *key)
|
|
|
|
{
|
|
|
|
struct htab_elem *l = __htab_map_lookup_elem(map, key);
|
|
|
|
|
|
|
|
if (l)
|
|
|
|
return this_cpu_ptr(htab_elem_get_ptr(l, map->key_size));
|
|
|
|
else
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2022-05-11 17:38:53 +08:00
|
|
|
static void *htab_percpu_map_lookup_percpu_elem(struct bpf_map *map, void *key, u32 cpu)
|
|
|
|
{
|
|
|
|
struct htab_elem *l;
|
|
|
|
|
|
|
|
if (cpu >= nr_cpu_ids)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
l = __htab_map_lookup_elem(map, key);
|
|
|
|
if (l)
|
|
|
|
return per_cpu_ptr(htab_elem_get_ptr(l, map->key_size), cpu);
|
|
|
|
else
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2016-11-11 10:55:10 -08:00
|
|
|
static void *htab_lru_percpu_map_lookup_elem(struct bpf_map *map, void *key)
|
|
|
|
{
|
|
|
|
struct htab_elem *l = __htab_map_lookup_elem(map, key);
|
|
|
|
|
|
|
|
if (l) {
|
|
|
|
bpf_lru_node_set_ref(&l->lru_node);
|
|
|
|
return this_cpu_ptr(htab_elem_get_ptr(l, map->key_size));
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2022-05-11 17:38:53 +08:00
|
|
|
static void *htab_lru_percpu_map_lookup_percpu_elem(struct bpf_map *map, void *key, u32 cpu)
|
|
|
|
{
|
|
|
|
struct htab_elem *l;
|
|
|
|
|
|
|
|
if (cpu >= nr_cpu_ids)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
l = __htab_map_lookup_elem(map, key);
|
|
|
|
if (l) {
|
|
|
|
bpf_lru_node_set_ref(&l->lru_node);
|
|
|
|
return per_cpu_ptr(htab_elem_get_ptr(l, map->key_size), cpu);
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
bpf: add lookup/update support for per-cpu hash and array maps
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.
Example of single counter aggregation in user space:
unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
long values[nr_cpus];
long value = 0;
bpf_lookup_elem(fd, key, values);
for (i = 0; i < nr_cpus; i++)
value += values[i];
The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-01 22:39:55 -08:00
|
|
|
int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value)
|
|
|
|
{
|
|
|
|
struct htab_elem *l;
|
|
|
|
void __percpu *pptr;
|
|
|
|
int ret = -ENOENT;
|
|
|
|
int cpu, off = 0;
|
|
|
|
u32 size;
|
|
|
|
|
|
|
|
/* per_cpu areas are zero-filled and bpf programs can only
|
|
|
|
* access 'value_size' of them, so copying rounded areas
|
|
|
|
* will not leak any kernel data
|
|
|
|
*/
|
|
|
|
size = round_up(map->value_size, 8);
|
|
|
|
rcu_read_lock();
|
|
|
|
l = __htab_map_lookup_elem(map, key);
|
|
|
|
if (!l)
|
|
|
|
goto out;
|
2019-05-14 01:18:56 +02:00
|
|
|
/* We do not mark LRU map element here in order to not mess up
|
|
|
|
* eviction heuristics when user space does a map walk.
|
|
|
|
*/
|
bpf: add lookup/update support for per-cpu hash and array maps
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.
Example of single counter aggregation in user space:
unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
long values[nr_cpus];
long value = 0;
bpf_lookup_elem(fd, key, values);
for (i = 0; i < nr_cpus; i++)
value += values[i];
The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-01 22:39:55 -08:00
|
|
|
pptr = htab_elem_get_ptr(l, map->key_size);
|
|
|
|
for_each_possible_cpu(cpu) {
|
2023-02-25 16:40:08 +01:00
|
|
|
copy_map_value_long(map, value + off, per_cpu_ptr(pptr, cpu));
|
|
|
|
check_and_init_map_value(map, value + off);
|
bpf: add lookup/update support for per-cpu hash and array maps
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.
Example of single counter aggregation in user space:
unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
long values[nr_cpus];
long value = 0;
bpf_lookup_elem(fd, key, values);
for (i = 0; i < nr_cpus; i++)
value += values[i];
The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-01 22:39:55 -08:00
|
|
|
off += size;
|
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
out:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value,
|
|
|
|
u64 map_flags)
|
|
|
|
{
|
2016-11-11 10:55:10 -08:00
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
2016-02-19 13:53:10 -05:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2016-11-11 10:55:10 -08:00
|
|
|
if (htab_is_lru(htab))
|
|
|
|
ret = __htab_lru_percpu_map_update_elem(map, key, value,
|
|
|
|
map_flags, true);
|
|
|
|
else
|
|
|
|
ret = __htab_percpu_map_update_elem(map, key, value, map_flags,
|
|
|
|
true);
|
2016-02-19 13:53:10 -05:00
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
bpf: add lookup/update support for per-cpu hash and array maps
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.
Example of single counter aggregation in user space:
unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
long values[nr_cpus];
long value = 0;
bpf_lookup_elem(fd, key, values);
for (i = 0; i < nr_cpus; i++)
value += values[i];
The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-01 22:39:55 -08:00
|
|
|
}
|
|
|
|
|
bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash
Added bpffs pretty print for percpu arraymap, percpu hashmap
and percpu lru hashmap.
For each map <key, value> pair, the format is:
<key_value>: {
cpu0: <value_on_cpu0>
cpu1: <value_on_cpu1>
...
cpun: <value_on_cpun>
}
For example, on my VM, there are 4 cpus, and
for test_btf test in the next patch:
cat /sys/fs/bpf/pprint_test_percpu_hash
You may get:
...
43602: {
cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
}
72847: {
cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
}
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-29 14:43:13 -07:00
|
|
|
static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key,
|
|
|
|
struct seq_file *m)
|
|
|
|
{
|
|
|
|
struct htab_elem *l;
|
|
|
|
void __percpu *pptr;
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
l = __htab_map_lookup_elem(map, key);
|
|
|
|
if (!l) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
btf_type_seq_show(map->btf, map->btf_key_type_id, key, m);
|
|
|
|
seq_puts(m, ": {\n");
|
|
|
|
pptr = htab_elem_get_ptr(l, map->key_size);
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
seq_printf(m, "\tcpu%d: ", cpu);
|
|
|
|
btf_type_seq_show(map->btf, map->btf_value_type_id,
|
|
|
|
per_cpu_ptr(pptr, cpu), m);
|
|
|
|
seq_puts(m, "\n");
|
|
|
|
}
|
|
|
|
seq_puts(m, "}\n");
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
2017-04-11 15:34:58 +02:00
|
|
|
const struct bpf_map_ops htab_percpu_map_ops = {
|
2020-08-27 18:18:06 -07:00
|
|
|
.map_meta_equal = bpf_map_meta_equal,
|
2018-01-11 20:29:05 -08:00
|
|
|
.map_alloc_check = htab_map_alloc_check,
|
2016-02-01 22:39:53 -08:00
|
|
|
.map_alloc = htab_map_alloc,
|
|
|
|
.map_free = htab_map_free,
|
|
|
|
.map_get_next_key = htab_map_get_next_key,
|
|
|
|
.map_lookup_elem = htab_percpu_map_lookup_elem,
|
2021-05-11 23:00:04 +02:00
|
|
|
.map_lookup_and_delete_elem = htab_percpu_map_lookup_and_delete_elem,
|
2016-02-01 22:39:53 -08:00
|
|
|
.map_update_elem = htab_percpu_map_update_elem,
|
|
|
|
.map_delete_elem = htab_map_delete_elem,
|
2022-05-11 17:38:53 +08:00
|
|
|
.map_lookup_percpu_elem = htab_percpu_map_lookup_percpu_elem,
|
bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash
Added bpffs pretty print for percpu arraymap, percpu hashmap
and percpu lru hashmap.
For each map <key, value> pair, the format is:
<key_value>: {
cpu0: <value_on_cpu0>
cpu1: <value_on_cpu1>
...
cpun: <value_on_cpun>
}
For example, on my VM, there are 4 cpus, and
for test_btf test in the next patch:
cat /sys/fs/bpf/pprint_test_percpu_hash
You may get:
...
43602: {
cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
}
72847: {
cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
}
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-29 14:43:13 -07:00
|
|
|
.map_seq_show_elem = htab_percpu_map_seq_show_elem,
|
2021-02-26 12:49:27 -08:00
|
|
|
.map_set_for_each_callback_args = map_set_for_each_callback_args,
|
|
|
|
.map_for_each_callback = bpf_for_each_hash_elem,
|
bpf: hashtab memory usage
htab_map_mem_usage() is introduced to calculate hashmap memory usage. In
this helper, some small memory allocations are ignore, as their size is
quite small compared with the total size. The inner_map_meta in
hash_of_map is also ignored.
The result for hashtab as follows,
- before this change
1: hash name count_map flags 0x1 <<<< no prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 41943040B
2: hash name count_map flags 0x1 <<<< no prealloc, none set
key 16B value 24B max_entries 1048576 memlock 41943040B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 41943040B
The memlock is always a fixed size whatever it is preallocated or
not, and whatever the count of allocated elements is.
- after this change
1: hash name count_map flags 0x1 <<<< non prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 117441536B
2: hash name count_map flags 0x1 <<<< non prealloc, non set
key 16B value 24B max_entries 1048576 memlock 16778240B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 109056000B
The memlock now is hashtab actually allocated.
The result for percpu hash map as follows,
- before this change
4: percpu_hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
5: percpu_hash name count_map flags 0x1 <<<< no prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
- after this change
4: percpu_hash name count_map flags 0x0
key 16B value 24B max_entries 1048576 memlock 897582080B
5: percpu_hash name count_map flags 0x1
key 16B value 24B max_entries 1048576 memlock 922748736B
At worst, the difference can be 10x, for example,
- before this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 8388608B
- after this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 83889408B
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230305124615.12358-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-05 12:46:00 +00:00
|
|
|
.map_mem_usage = htab_map_mem_usage,
|
2020-01-15 10:43:04 -08:00
|
|
|
BATCH_OPS(htab_percpu),
|
2022-04-25 21:32:47 +08:00
|
|
|
.map_btf_id = &htab_map_btf_ids[0],
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
.iter_seq_info = &iter_seq_info,
|
2016-02-01 22:39:53 -08:00
|
|
|
};
|
|
|
|
|
2017-04-11 15:34:58 +02:00
|
|
|
const struct bpf_map_ops htab_lru_percpu_map_ops = {
|
2020-08-27 18:18:06 -07:00
|
|
|
.map_meta_equal = bpf_map_meta_equal,
|
2018-01-11 20:29:05 -08:00
|
|
|
.map_alloc_check = htab_map_alloc_check,
|
2016-11-11 10:55:10 -08:00
|
|
|
.map_alloc = htab_map_alloc,
|
|
|
|
.map_free = htab_map_free,
|
|
|
|
.map_get_next_key = htab_map_get_next_key,
|
|
|
|
.map_lookup_elem = htab_lru_percpu_map_lookup_elem,
|
2021-05-11 23:00:04 +02:00
|
|
|
.map_lookup_and_delete_elem = htab_lru_percpu_map_lookup_and_delete_elem,
|
2016-11-11 10:55:10 -08:00
|
|
|
.map_update_elem = htab_lru_percpu_map_update_elem,
|
|
|
|
.map_delete_elem = htab_lru_map_delete_elem,
|
2022-05-11 17:38:53 +08:00
|
|
|
.map_lookup_percpu_elem = htab_lru_percpu_map_lookup_percpu_elem,
|
bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash
Added bpffs pretty print for percpu arraymap, percpu hashmap
and percpu lru hashmap.
For each map <key, value> pair, the format is:
<key_value>: {
cpu0: <value_on_cpu0>
cpu1: <value_on_cpu1>
...
cpun: <value_on_cpun>
}
For example, on my VM, there are 4 cpus, and
for test_btf test in the next patch:
cat /sys/fs/bpf/pprint_test_percpu_hash
You may get:
...
43602: {
cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
}
72847: {
cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
}
...
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-29 14:43:13 -07:00
|
|
|
.map_seq_show_elem = htab_percpu_map_seq_show_elem,
|
2021-02-26 12:49:27 -08:00
|
|
|
.map_set_for_each_callback_args = map_set_for_each_callback_args,
|
|
|
|
.map_for_each_callback = bpf_for_each_hash_elem,
|
bpf: hashtab memory usage
htab_map_mem_usage() is introduced to calculate hashmap memory usage. In
this helper, some small memory allocations are ignore, as their size is
quite small compared with the total size. The inner_map_meta in
hash_of_map is also ignored.
The result for hashtab as follows,
- before this change
1: hash name count_map flags 0x1 <<<< no prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 41943040B
2: hash name count_map flags 0x1 <<<< no prealloc, none set
key 16B value 24B max_entries 1048576 memlock 41943040B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 41943040B
The memlock is always a fixed size whatever it is preallocated or
not, and whatever the count of allocated elements is.
- after this change
1: hash name count_map flags 0x1 <<<< non prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 117441536B
2: hash name count_map flags 0x1 <<<< non prealloc, non set
key 16B value 24B max_entries 1048576 memlock 16778240B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 109056000B
The memlock now is hashtab actually allocated.
The result for percpu hash map as follows,
- before this change
4: percpu_hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
5: percpu_hash name count_map flags 0x1 <<<< no prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
- after this change
4: percpu_hash name count_map flags 0x0
key 16B value 24B max_entries 1048576 memlock 897582080B
5: percpu_hash name count_map flags 0x1
key 16B value 24B max_entries 1048576 memlock 922748736B
At worst, the difference can be 10x, for example,
- before this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 8388608B
- after this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 83889408B
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230305124615.12358-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-05 12:46:00 +00:00
|
|
|
.map_mem_usage = htab_map_mem_usage,
|
2020-01-15 10:43:04 -08:00
|
|
|
BATCH_OPS(htab_lru_percpu),
|
2022-04-25 21:32:47 +08:00
|
|
|
.map_btf_id = &htab_map_btf_ids[0],
|
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-23 11:41:14 -07:00
|
|
|
.iter_seq_info = &iter_seq_info,
|
2016-11-11 10:55:10 -08:00
|
|
|
};
|
|
|
|
|
2018-01-11 20:29:05 -08:00
|
|
|
static int fd_htab_map_alloc_check(union bpf_attr *attr)
|
2017-03-22 10:00:34 -07:00
|
|
|
{
|
|
|
|
if (attr->value_size != sizeof(u32))
|
2018-01-11 20:29:05 -08:00
|
|
|
return -EINVAL;
|
|
|
|
return htab_map_alloc_check(attr);
|
2017-03-22 10:00:34 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static void fd_htab_map_free(struct bpf_map *map)
|
|
|
|
{
|
|
|
|
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
|
|
|
|
struct hlist_nulls_node *n;
|
|
|
|
struct hlist_nulls_head *head;
|
|
|
|
struct htab_elem *l;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < htab->n_buckets; i++) {
|
|
|
|
head = select_bucket(htab, i);
|
|
|
|
|
|
|
|
hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
|
|
|
|
void *ptr = fd_htab_map_get_ptr(map, l);
|
|
|
|
|
|
|
|
map->ops->map_fd_put_ptr(ptr);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
htab_map_free(map);
|
|
|
|
}
|
|
|
|
|
2017-06-27 23:08:34 -07:00
|
|
|
/* only called from syscall */
|
|
|
|
int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value)
|
|
|
|
{
|
|
|
|
void **ptr;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (!map->ops->map_fd_sys_lookup_elem)
|
|
|
|
return -ENOTSUPP;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
ptr = htab_map_lookup_elem(map, key);
|
|
|
|
if (ptr)
|
|
|
|
*value = map->ops->map_fd_sys_lookup_elem(READ_ONCE(*ptr));
|
|
|
|
else
|
|
|
|
ret = -ENOENT;
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-03-22 10:00:34 -07:00
|
|
|
/* only called from syscall */
|
|
|
|
int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
|
|
|
|
void *key, void *value, u64 map_flags)
|
|
|
|
{
|
|
|
|
void *ptr;
|
|
|
|
int ret;
|
|
|
|
u32 ufd = *(u32 *)value;
|
|
|
|
|
|
|
|
ptr = map->ops->map_fd_get_ptr(map, map_file, ufd);
|
|
|
|
if (IS_ERR(ptr))
|
|
|
|
return PTR_ERR(ptr);
|
|
|
|
|
|
|
|
ret = htab_map_update_elem(map, key, &ptr, map_flags);
|
|
|
|
if (ret)
|
|
|
|
map->ops->map_fd_put_ptr(ptr);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct bpf_map *htab_of_map_alloc(union bpf_attr *attr)
|
|
|
|
{
|
|
|
|
struct bpf_map *map, *inner_map_meta;
|
|
|
|
|
|
|
|
inner_map_meta = bpf_map_meta_alloc(attr->inner_map_fd);
|
|
|
|
if (IS_ERR(inner_map_meta))
|
|
|
|
return inner_map_meta;
|
|
|
|
|
2018-01-11 20:29:05 -08:00
|
|
|
map = htab_map_alloc(attr);
|
2017-03-22 10:00:34 -07:00
|
|
|
if (IS_ERR(map)) {
|
|
|
|
bpf_map_meta_free(inner_map_meta);
|
|
|
|
return map;
|
|
|
|
}
|
|
|
|
|
|
|
|
map->inner_map_meta = inner_map_meta;
|
|
|
|
|
|
|
|
return map;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *htab_of_map_lookup_elem(struct bpf_map *map, void *key)
|
|
|
|
{
|
|
|
|
struct bpf_map **inner_map = htab_map_lookup_elem(map, key);
|
|
|
|
|
|
|
|
if (!inner_map)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return READ_ONCE(*inner_map);
|
|
|
|
}
|
|
|
|
|
2020-10-11 01:40:03 +02:00
|
|
|
static int htab_of_map_gen_lookup(struct bpf_map *map,
|
2017-08-19 03:12:46 +02:00
|
|
|
struct bpf_insn *insn_buf)
|
|
|
|
{
|
|
|
|
struct bpf_insn *insn = insn_buf;
|
|
|
|
const int ret = BPF_REG_0;
|
|
|
|
|
2018-06-02 23:06:35 +02:00
|
|
|
BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
|
|
|
|
(void *(*)(struct bpf_map *map, void *key))NULL));
|
2021-09-28 16:09:45 -07:00
|
|
|
*insn++ = BPF_EMIT_CALL(__htab_map_lookup_elem);
|
2017-08-19 03:12:46 +02:00
|
|
|
*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 2);
|
|
|
|
*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
|
|
|
|
offsetof(struct htab_elem, key) +
|
|
|
|
round_up(map->key_size, 8));
|
|
|
|
*insn++ = BPF_LDX_MEM(BPF_DW, ret, ret, 0);
|
|
|
|
|
|
|
|
return insn - insn_buf;
|
|
|
|
}
|
|
|
|
|
2017-03-22 10:00:34 -07:00
|
|
|
static void htab_of_map_free(struct bpf_map *map)
|
|
|
|
{
|
|
|
|
bpf_map_meta_free(map->inner_map_meta);
|
|
|
|
fd_htab_map_free(map);
|
|
|
|
}
|
|
|
|
|
2017-04-11 15:34:58 +02:00
|
|
|
const struct bpf_map_ops htab_of_maps_map_ops = {
|
2018-01-11 20:29:05 -08:00
|
|
|
.map_alloc_check = fd_htab_map_alloc_check,
|
2017-03-22 10:00:34 -07:00
|
|
|
.map_alloc = htab_of_map_alloc,
|
|
|
|
.map_free = htab_of_map_free,
|
|
|
|
.map_get_next_key = htab_map_get_next_key,
|
|
|
|
.map_lookup_elem = htab_of_map_lookup_elem,
|
|
|
|
.map_delete_elem = htab_map_delete_elem,
|
|
|
|
.map_fd_get_ptr = bpf_map_fd_get_ptr,
|
|
|
|
.map_fd_put_ptr = bpf_map_fd_put_ptr,
|
2017-06-27 23:08:34 -07:00
|
|
|
.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
|
2017-08-19 03:12:46 +02:00
|
|
|
.map_gen_lookup = htab_of_map_gen_lookup,
|
2018-08-12 01:59:17 +02:00
|
|
|
.map_check_btf = map_check_no_btf,
|
bpf: hashtab memory usage
htab_map_mem_usage() is introduced to calculate hashmap memory usage. In
this helper, some small memory allocations are ignore, as their size is
quite small compared with the total size. The inner_map_meta in
hash_of_map is also ignored.
The result for hashtab as follows,
- before this change
1: hash name count_map flags 0x1 <<<< no prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 41943040B
2: hash name count_map flags 0x1 <<<< no prealloc, none set
key 16B value 24B max_entries 1048576 memlock 41943040B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 41943040B
The memlock is always a fixed size whatever it is preallocated or
not, and whatever the count of allocated elements is.
- after this change
1: hash name count_map flags 0x1 <<<< non prealloc, fully set
key 16B value 24B max_entries 1048576 memlock 117441536B
2: hash name count_map flags 0x1 <<<< non prealloc, non set
key 16B value 24B max_entries 1048576 memlock 16778240B
3: hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 109056000B
The memlock now is hashtab actually allocated.
The result for percpu hash map as follows,
- before this change
4: percpu_hash name count_map flags 0x0 <<<< prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
5: percpu_hash name count_map flags 0x1 <<<< no prealloc
key 16B value 24B max_entries 1048576 memlock 822083584B
- after this change
4: percpu_hash name count_map flags 0x0
key 16B value 24B max_entries 1048576 memlock 897582080B
5: percpu_hash name count_map flags 0x1
key 16B value 24B max_entries 1048576 memlock 922748736B
At worst, the difference can be 10x, for example,
- before this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 8388608B
- after this change
6: hash name count_map flags 0x0
key 4B value 4B max_entries 1048576 memlock 83889408B
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230305124615.12358-4-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-05 12:46:00 +00:00
|
|
|
.map_mem_usage = htab_map_mem_usage,
|
2022-05-10 01:22:20 -07:00
|
|
|
BATCH_OPS(htab),
|
2022-04-25 21:32:47 +08:00
|
|
|
.map_btf_id = &htab_map_btf_ids[0],
|
2017-03-22 10:00:34 -07:00
|
|
|
};
|