License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2012-07-06 15:25:10 -05:00
|
|
|
/*
|
|
|
|
* Slab allocator functions that are independent of the allocator strategy
|
|
|
|
*
|
|
|
|
* (C) 2012 Christoph Lameter <cl@linux.com>
|
|
|
|
*/
|
|
|
|
#include <linux/slab.h>
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/poison.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/memory.h>
|
2018-04-05 16:20:11 -07:00
|
|
|
#include <linux/cache.h>
|
2012-07-06 15:25:10 -05:00
|
|
|
#include <linux/compiler.h>
|
2021-02-25 17:19:11 -08:00
|
|
|
#include <linux/kfence.h>
|
2012-07-06 15:25:10 -05:00
|
|
|
#include <linux/module.h>
|
2012-07-06 15:25:13 -05:00
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/uaccess.h>
|
2012-10-19 18:20:25 +04:00
|
|
|
#include <linux/seq_file.h>
|
2023-06-12 16:31:48 +01:00
|
|
|
#include <linux/dma-mapping.h>
|
2023-06-12 16:32:00 +01:00
|
|
|
#include <linux/swiotlb.h>
|
2012-10-19 18:20:25 +04:00
|
|
|
#include <linux/proc_fs.h>
|
2019-07-11 20:56:38 -07:00
|
|
|
#include <linux/debugfs.h>
|
2023-10-03 11:57:45 +02:00
|
|
|
#include <linux/kmemleak.h>
|
2020-12-22 12:03:31 -08:00
|
|
|
#include <linux/kasan.h>
|
2012-07-06 15:25:10 -05:00
|
|
|
#include <asm/cacheflush.h>
|
|
|
|
#include <asm/tlbflush.h>
|
|
|
|
#include <asm/page.h>
|
2012-12-18 14:22:34 -08:00
|
|
|
#include <linux/memcontrol.h>
|
2021-07-07 18:07:47 -07:00
|
|
|
#include <linux/stackdepot.h>
|
2014-08-06 16:04:44 -07:00
|
|
|
|
2020-08-06 23:18:28 -07:00
|
|
|
#include "internal.h"
|
2012-07-06 15:25:11 -05:00
|
|
|
#include "slab.h"
|
|
|
|
|
2022-06-03 06:21:49 +03:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/kmem.h>
|
|
|
|
|
2012-07-06 15:25:11 -05:00
|
|
|
enum slab_state slab_state;
|
2012-07-06 15:25:12 -05:00
|
|
|
LIST_HEAD(slab_caches);
|
|
|
|
DEFINE_MUTEX(slab_mutex);
|
2012-09-05 00:20:33 +00:00
|
|
|
struct kmem_cache *kmem_cache;
|
2012-07-06 15:25:11 -05:00
|
|
|
|
2014-10-09 15:26:22 -07:00
|
|
|
/*
|
|
|
|
* Set of flags that will prevent slab merging
|
|
|
|
*/
|
|
|
|
#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
|
2017-01-18 02:53:44 -08:00
|
|
|
SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
|
2024-02-23 19:27:19 +01:00
|
|
|
SLAB_FAILSLAB | SLAB_NO_MERGE)
|
2014-10-09 15:26:22 -07:00
|
|
|
|
2016-01-14 15:18:15 -08:00
|
|
|
#define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
|
mm: add support for kmem caches in DMA32 zone
Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
v6.
This is a followup to the discussion in [1], [2].
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For L1 tables that are bigger than a page, we can just use
__get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
use GFP_DMA).
For L2 tables that only take 1KB, it would be a waste to allocate a full
page, so we considered 3 approaches:
1. This series, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2 page
tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable to reuse
freed fragments until the whole page is freed. [3]
This series is the most memory-efficient approach.
stable@ note:
We confirmed that this is a regression, and IOMMU errors happen on 4.19
and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
platforms (and maybe others?).
[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
[2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
[3] https://patchwork.codeaurora.org/patch/671639/
This patch (of 3):
IOMMUs using ARMv7 short-descriptor format require page tables to be
allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
this is done by passing GFP_DMA32 flag to memory allocation functions.
For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
a full page using get_free_pages, so we considered 3 approaches:
1. This patch, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2
page tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable
to reuse freed fragments until the whole page is freed.
This change makes it possible to create a custom cache in DMA32 zone using
kmem_cache_create, then allocate memory using kmem_cache_alloc.
We do not create a DMA32 kmalloc cache array, as there are currently no
users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
unnecessary).
Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Sasha Levin <Alexander.Levin@microsoft.com>
Cc: Huaisheng Ye <yehs1@lenovo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yong Wu <yong.wu@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Tomasz Figa <tfiga@google.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-28 20:43:42 -07:00
|
|
|
SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
|
2014-10-09 15:26:22 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Merge control. If this is set then no merging of slab caches will occur.
|
|
|
|
*/
|
2017-07-06 15:36:40 -07:00
|
|
|
static bool slab_nomerge = !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT);
|
2014-10-09 15:26:22 -07:00
|
|
|
|
|
|
|
static int __init setup_slab_nomerge(char *str)
|
|
|
|
{
|
2017-07-06 15:36:40 -07:00
|
|
|
slab_nomerge = true;
|
2014-10-09 15:26:22 -07:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2021-04-29 22:54:39 -07:00
|
|
|
static int __init setup_slab_merge(char *str)
|
|
|
|
{
|
|
|
|
slab_nomerge = false;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2014-10-09 15:26:22 -07:00
|
|
|
__setup_param("slub_nomerge", slub_nomerge, setup_slab_nomerge, 0);
|
2021-04-29 22:54:39 -07:00
|
|
|
__setup_param("slub_merge", slub_merge, setup_slab_merge, 0);
|
2014-10-09 15:26:22 -07:00
|
|
|
|
|
|
|
__setup("slab_nomerge", setup_slab_nomerge);
|
2021-04-29 22:54:39 -07:00
|
|
|
__setup("slab_merge", setup_slab_merge);
|
2014-10-09 15:26:22 -07:00
|
|
|
|
2014-10-09 15:26:00 -07:00
|
|
|
/*
|
|
|
|
* Determine the size of a slab object
|
|
|
|
*/
|
|
|
|
unsigned int kmem_cache_size(struct kmem_cache *s)
|
|
|
|
{
|
|
|
|
return s->object_size;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_cache_size);
|
|
|
|
|
2012-08-16 00:09:46 -07:00
|
|
|
#ifdef CONFIG_DEBUG_VM
|
2024-08-07 10:07:46 +01:00
|
|
|
|
|
|
|
static bool kmem_cache_is_duplicate_name(const char *name)
|
|
|
|
{
|
|
|
|
struct kmem_cache *s;
|
|
|
|
|
|
|
|
list_for_each_entry(s, &slab_caches, list) {
|
|
|
|
if (!strcmp(s->name, name))
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2018-04-05 16:20:37 -07:00
|
|
|
static int kmem_cache_sanity_check(const char *name, unsigned int size)
|
2012-07-06 15:25:10 -05:00
|
|
|
{
|
2021-06-15 18:23:22 -07:00
|
|
|
if (!name || in_interrupt() || size > KMALLOC_MAX_SIZE) {
|
2012-08-16 00:09:46 -07:00
|
|
|
pr_err("kmem_cache_create(%s) integrity check failed\n", name);
|
|
|
|
return -EINVAL;
|
2012-07-06 15:25:10 -05:00
|
|
|
}
|
2012-08-16 10:12:18 +03:00
|
|
|
|
2024-08-07 10:07:46 +01:00
|
|
|
/* Duplicate names will confuse slabtop, et al */
|
|
|
|
WARN(kmem_cache_is_duplicate_name(name),
|
|
|
|
"kmem_cache of name '%s' already exists\n", name);
|
|
|
|
|
2012-07-06 15:25:13 -05:00
|
|
|
WARN_ON(strchr(name, ' ')); /* It confuses parsers */
|
2012-08-16 00:09:46 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else
|
2018-04-05 16:20:37 -07:00
|
|
|
static inline int kmem_cache_sanity_check(const char *name, unsigned int size)
|
2012-08-16 00:09:46 -07:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2012-07-06 15:25:13 -05:00
|
|
|
#endif
|
|
|
|
|
2018-01-31 16:15:36 -08:00
|
|
|
/*
|
|
|
|
* Figure out what the alignment of the objects will be given a set of
|
|
|
|
* flags, a user specified alignment and the size of the objects.
|
|
|
|
*/
|
2018-04-05 16:20:37 -07:00
|
|
|
static unsigned int calculate_alignment(slab_flags_t flags,
|
|
|
|
unsigned int align, unsigned int size)
|
2018-01-31 16:15:36 -08:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the user wants hardware cache aligned objects then follow that
|
|
|
|
* suggestion if the object is sufficiently large.
|
|
|
|
*
|
|
|
|
* The hardware cache alignment cannot override the specified
|
|
|
|
* alignment though. If that is greater then use it.
|
|
|
|
*/
|
|
|
|
if (flags & SLAB_HWCACHE_ALIGN) {
|
2018-04-05 16:20:37 -07:00
|
|
|
unsigned int ralign;
|
2018-01-31 16:15:36 -08:00
|
|
|
|
|
|
|
ralign = cache_line_size();
|
|
|
|
while (size <= ralign / 2)
|
|
|
|
ralign /= 2;
|
|
|
|
align = max(align, ralign);
|
|
|
|
}
|
|
|
|
|
2022-05-09 18:20:53 -07:00
|
|
|
align = max(align, arch_slab_minalign());
|
2018-01-31 16:15:36 -08:00
|
|
|
|
|
|
|
return ALIGN(align, sizeof(void *));
|
|
|
|
}
|
|
|
|
|
2014-10-09 15:26:22 -07:00
|
|
|
/*
|
|
|
|
* Find a mergeable slab cache
|
|
|
|
*/
|
|
|
|
int slab_unmergeable(struct kmem_cache *s)
|
|
|
|
{
|
|
|
|
if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (s->ctor)
|
|
|
|
return 1;
|
|
|
|
|
2022-11-16 15:56:32 +01:00
|
|
|
#ifdef CONFIG_HARDENED_USERCOPY
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-10 22:50:28 -04:00
|
|
|
if (s->usersize)
|
|
|
|
return 1;
|
2022-11-16 15:56:32 +01:00
|
|
|
#endif
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-10 22:50:28 -04:00
|
|
|
|
2014-10-09 15:26:22 -07:00
|
|
|
/*
|
|
|
|
* We may have set a slab to be unmergeable during bootstrap.
|
|
|
|
*/
|
|
|
|
if (s->refcount < 0)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-04-05 16:20:37 -07:00
|
|
|
struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
|
2017-11-15 17:32:18 -08:00
|
|
|
slab_flags_t flags, const char *name, void (*ctor)(void *))
|
2014-10-09 15:26:22 -07:00
|
|
|
{
|
|
|
|
struct kmem_cache *s;
|
|
|
|
|
2017-02-22 15:40:59 -08:00
|
|
|
if (slab_nomerge)
|
2014-10-09 15:26:22 -07:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (ctor)
|
|
|
|
return NULL;
|
|
|
|
|
2024-02-21 12:12:53 +00:00
|
|
|
flags = kmem_cache_flags(flags, name);
|
2014-10-09 15:26:22 -07:00
|
|
|
|
2017-02-22 15:40:59 -08:00
|
|
|
if (flags & SLAB_NEVER_MERGE)
|
|
|
|
return NULL;
|
|
|
|
|
2024-09-04 15:40:37 +08:00
|
|
|
size = ALIGN(size, sizeof(void *));
|
|
|
|
align = calculate_alignment(flags, align, size);
|
|
|
|
size = ALIGN(size, align);
|
|
|
|
|
2020-08-06 23:21:20 -07:00
|
|
|
list_for_each_entry_reverse(s, &slab_caches, list) {
|
2014-10-09 15:26:22 -07:00
|
|
|
if (slab_unmergeable(s))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (size > s->size)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* Check if alignment is compatible.
|
|
|
|
* Courtesy of Adrian Drzewiecki
|
|
|
|
*/
|
|
|
|
if ((s->size & ~(align - 1)) != s->size)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (s->size - size >= sizeof(void *))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2015-11-05 18:45:08 -08:00
|
|
|
static struct kmem_cache *create_cache(const char *name,
|
2024-09-05 09:56:49 +02:00
|
|
|
unsigned int object_size,
|
|
|
|
struct kmem_cache_args *args,
|
|
|
|
slab_flags_t flags)
|
2014-04-07 15:39:26 -07:00
|
|
|
{
|
|
|
|
struct kmem_cache *s;
|
|
|
|
int err;
|
|
|
|
|
2024-08-28 12:56:24 +02:00
|
|
|
/* If a custom freelist pointer is requested make sure it's sane. */
|
|
|
|
err = -EINVAL;
|
2024-09-05 09:56:49 +02:00
|
|
|
if (args->use_freeptr_offset &&
|
|
|
|
(args->freeptr_offset >= object_size ||
|
|
|
|
!(flags & SLAB_TYPESAFE_BY_RCU) ||
|
2024-11-20 13:46:21 +01:00
|
|
|
!IS_ALIGNED(args->freeptr_offset, __alignof__(freeptr_t))))
|
2024-08-28 12:56:24 +02:00
|
|
|
goto out;
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-10 22:50:28 -04:00
|
|
|
|
2014-04-07 15:39:26 -07:00
|
|
|
err = -ENOMEM;
|
|
|
|
s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
|
|
|
|
if (!s)
|
|
|
|
goto out;
|
2024-09-05 09:56:51 +02:00
|
|
|
err = do_kmem_cache_create(s, name, object_size, args, flags);
|
2014-04-07 15:39:26 -07:00
|
|
|
if (err)
|
|
|
|
goto out_free_cache;
|
|
|
|
|
|
|
|
s->refcount = 1;
|
|
|
|
list_add(&s->list, &slab_caches);
|
|
|
|
return s;
|
|
|
|
|
|
|
|
out_free_cache:
|
2015-02-10 14:09:40 -08:00
|
|
|
kmem_cache_free(kmem_cache, s);
|
2023-06-06 14:55:43 +08:00
|
|
|
out:
|
|
|
|
return ERR_PTR(err);
|
2014-04-07 15:39:26 -07:00
|
|
|
}
|
2012-11-28 16:23:16 +00:00
|
|
|
|
2018-12-06 23:13:00 +02:00
|
|
|
/**
|
2024-09-13 10:15:56 +02:00
|
|
|
* __kmem_cache_create_args - Create a kmem cache.
|
2012-08-16 00:09:46 -07:00
|
|
|
* @name: A string which is used in /proc/slabinfo to identify this cache.
|
2024-09-05 09:56:45 +02:00
|
|
|
* @object_size: The size of objects to be created in this cache.
|
2024-09-13 10:15:56 +02:00
|
|
|
* @args: Additional arguments for the cache creation (see
|
|
|
|
* &struct kmem_cache_args).
|
2024-10-09 16:29:37 +02:00
|
|
|
* @flags: See the desriptions of individual flags. The common ones are listed
|
|
|
|
* in the description below.
|
2012-08-16 00:09:46 -07:00
|
|
|
*
|
2024-09-13 10:15:56 +02:00
|
|
|
* Not to be called directly, use the kmem_cache_create() wrapper with the same
|
|
|
|
* parameters.
|
2012-08-16 00:09:46 -07:00
|
|
|
*
|
2024-10-09 16:29:37 +02:00
|
|
|
* Commonly used @flags:
|
|
|
|
*
|
|
|
|
* &SLAB_ACCOUNT - Account allocations to memcg.
|
|
|
|
*
|
|
|
|
* &SLAB_HWCACHE_ALIGN - Align objects on cache line boundaries.
|
|
|
|
*
|
|
|
|
* &SLAB_RECLAIM_ACCOUNT - Objects are reclaimable.
|
|
|
|
*
|
|
|
|
* &SLAB_TYPESAFE_BY_RCU - Slab page (not individual objects) freeing delayed
|
|
|
|
* by a grace period - see the full description before using.
|
|
|
|
*
|
2024-09-13 10:15:56 +02:00
|
|
|
* Context: Cannot be called within a interrupt, but can be interrupted.
|
2018-12-06 23:13:00 +02:00
|
|
|
*
|
|
|
|
* Return: a pointer to the cache on success, NULL on failure.
|
2012-08-16 00:09:46 -07:00
|
|
|
*/
|
2024-09-05 09:56:45 +02:00
|
|
|
struct kmem_cache *__kmem_cache_create_args(const char *name,
|
|
|
|
unsigned int object_size,
|
|
|
|
struct kmem_cache_args *args,
|
|
|
|
slab_flags_t flags)
|
2012-08-16 00:09:46 -07:00
|
|
|
{
|
2015-11-05 18:45:43 -08:00
|
|
|
struct kmem_cache *s = NULL;
|
2015-02-13 14:36:38 -08:00
|
|
|
const char *cache_name;
|
2014-01-23 15:52:55 -08:00
|
|
|
int err;
|
2012-07-06 15:25:10 -05:00
|
|
|
|
2021-05-14 17:27:10 -07:00
|
|
|
#ifdef CONFIG_SLUB_DEBUG
|
|
|
|
/*
|
2023-12-15 11:41:48 +08:00
|
|
|
* If no slab_debug was enabled globally, the static key is not yet
|
2021-05-14 17:27:10 -07:00
|
|
|
* enabled by setup_slub_debug(). Enable it if the cache is being
|
|
|
|
* created with any of the debugging flags passed explicitly.
|
2021-07-07 18:07:47 -07:00
|
|
|
* It's also possible that this is the first cache created with
|
|
|
|
* SLAB_STORE_USER and we should init stack_depot for it.
|
2021-05-14 17:27:10 -07:00
|
|
|
*/
|
|
|
|
if (flags & SLAB_DEBUG_FLAGS)
|
|
|
|
static_branch_enable(&slub_debug_enabled);
|
2021-07-07 18:07:47 -07:00
|
|
|
if (flags & SLAB_STORE_USER)
|
|
|
|
stack_depot_init();
|
2021-05-14 17:27:10 -07:00
|
|
|
#endif
|
|
|
|
|
2012-08-16 00:09:46 -07:00
|
|
|
mutex_lock(&slab_mutex);
|
2012-09-05 00:20:33 +00:00
|
|
|
|
2024-09-05 09:56:45 +02:00
|
|
|
err = kmem_cache_sanity_check(name, object_size);
|
2014-10-09 15:25:58 -07:00
|
|
|
if (err) {
|
2014-01-23 15:52:55 -08:00
|
|
|
goto out_unlock;
|
2014-10-09 15:25:58 -07:00
|
|
|
}
|
2012-09-05 00:20:33 +00:00
|
|
|
|
2016-12-12 16:41:38 -08:00
|
|
|
/* Refuse requests with allocator specific flags */
|
|
|
|
if (flags & ~SLAB_FLAGS_PERMITTED) {
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2012-10-17 15:36:51 +04:00
|
|
|
/*
|
|
|
|
* Some allocators will constraint the set of valid flags to a subset
|
|
|
|
* of all flags. We expect them to define CACHE_CREATE_MASK in this
|
|
|
|
* case, and we'll just provide them with a sanitized version of the
|
|
|
|
* passed flags.
|
|
|
|
*/
|
|
|
|
flags &= CACHE_CREATE_MASK;
|
2012-09-05 00:20:33 +00:00
|
|
|
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-10 22:50:28 -04:00
|
|
|
/* Fail closed on bad usersize of useroffset values. */
|
2022-11-16 15:56:32 +01:00
|
|
|
if (!IS_ENABLED(CONFIG_HARDENED_USERCOPY) ||
|
2024-09-05 09:56:45 +02:00
|
|
|
WARN_ON(!args->usersize && args->useroffset) ||
|
|
|
|
WARN_ON(object_size < args->usersize ||
|
|
|
|
object_size - args->usersize < args->useroffset))
|
|
|
|
args->usersize = args->useroffset = 0;
|
|
|
|
|
|
|
|
if (!args->usersize)
|
|
|
|
s = __kmem_cache_alias(name, object_size, args->align, flags,
|
|
|
|
args->ctor);
|
2014-04-07 15:39:26 -07:00
|
|
|
if (s)
|
2014-01-23 15:52:55 -08:00
|
|
|
goto out_unlock;
|
2012-12-18 14:22:34 -08:00
|
|
|
|
2015-02-13 14:36:38 -08:00
|
|
|
cache_name = kstrdup_const(name, GFP_KERNEL);
|
2014-04-07 15:39:26 -07:00
|
|
|
if (!cache_name) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2012-09-04 23:38:33 +00:00
|
|
|
|
2024-09-05 09:56:49 +02:00
|
|
|
args->align = calculate_alignment(flags, args->align, object_size);
|
|
|
|
s = create_cache(cache_name, object_size, args, flags);
|
2014-04-07 15:39:26 -07:00
|
|
|
if (IS_ERR(s)) {
|
|
|
|
err = PTR_ERR(s);
|
2015-02-13 14:36:38 -08:00
|
|
|
kfree_const(cache_name);
|
2014-04-07 15:39:26 -07:00
|
|
|
}
|
2014-01-23 15:52:55 -08:00
|
|
|
|
|
|
|
out_unlock:
|
2012-07-06 15:25:13 -05:00
|
|
|
mutex_unlock(&slab_mutex);
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:07:20 -07:00
|
|
|
|
slab: fix wrong retval on kmem_cache_create_memcg error path
On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the
new cache ptr) undefined. The latter can be NULL if we could not
allocate the cache, or pointing to a freed area if we failed somewhere
later while trying to initialize it. Initially we checked 'err'
immediately before exiting the function and returned NULL if it was set
ignoring the value of 's':
out_unlock:
...
if (err) {
/* report error */
return NULL;
}
return s;
Recently this check was, in fact, broken by commit f717eb3abb5e ("slab:
do not panic if we fail to create memcg cache"), which turned it to:
out_unlock:
...
if (err && !memcg) {
/* report error */
return NULL;
}
return s;
As a result, if we are failing creating a cache for a memcg, we will
skip the check and return 's' that can contain crap. Obviously, commit
f717eb3abb5e intended not to return crap on error allocating a cache for
a memcg, but only to remove the error reporting in this case, so the
check should look like this:
out_unlock:
...
if (err) {
if (!memcg)
return NULL;
/* report error */
return NULL;
}
return s;
[rientjes@google.com: despaghettification]
[vdavydov@parallels.com: patch monkeying]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29 14:05:48 -08:00
|
|
|
if (err) {
|
2012-09-05 00:20:33 +00:00
|
|
|
if (flags & SLAB_PANIC)
|
2021-06-28 19:34:27 -07:00
|
|
|
panic("%s: Failed to create slab '%s'. Error %d\n",
|
|
|
|
__func__, name, err);
|
2012-09-05 00:20:33 +00:00
|
|
|
else {
|
2021-06-28 19:34:27 -07:00
|
|
|
pr_warn("%s(%s) failed with error %d\n",
|
|
|
|
__func__, name, err);
|
2012-09-05 00:20:33 +00:00
|
|
|
dump_stack();
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
2012-07-06 15:25:10 -05:00
|
|
|
return s;
|
|
|
|
}
|
2024-09-05 09:56:45 +02:00
|
|
|
EXPORT_SYMBOL(__kmem_cache_create_args);
|
2012-12-18 14:22:34 -08:00
|
|
|
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
static struct kmem_cache *kmem_buckets_cache __ro_after_init;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kmem_buckets_create - Create a set of caches that handle dynamic sized
|
|
|
|
* allocations via kmem_buckets_alloc()
|
|
|
|
* @name: A prefix string which is used in /proc/slabinfo to identify this
|
|
|
|
* cache. The individual caches with have their sizes as the suffix.
|
|
|
|
* @flags: SLAB flags (see kmem_cache_create() for details).
|
|
|
|
* @useroffset: Starting offset within an allocation that may be copied
|
|
|
|
* to/from userspace.
|
|
|
|
* @usersize: How many bytes, starting at @useroffset, may be copied
|
|
|
|
* to/from userspace.
|
|
|
|
* @ctor: A constructor for the objects, run when new allocations are made.
|
|
|
|
*
|
|
|
|
* Cannot be called within an interrupt, but can be interrupted.
|
|
|
|
*
|
|
|
|
* Return: a pointer to the cache on success, NULL on failure. When
|
|
|
|
* CONFIG_SLAB_BUCKETS is not enabled, ZERO_SIZE_PTR is returned, and
|
|
|
|
* subsequent calls to kmem_buckets_alloc() will fall back to kmalloc().
|
|
|
|
* (i.e. callers only need to check for NULL on failure.)
|
|
|
|
*/
|
|
|
|
kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags,
|
|
|
|
unsigned int useroffset,
|
|
|
|
unsigned int usersize,
|
|
|
|
void (*ctor)(void *))
|
|
|
|
{
|
2024-11-05 11:27:47 +09:00
|
|
|
unsigned long mask = 0;
|
|
|
|
unsigned int idx;
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
kmem_buckets *b;
|
2024-11-05 11:27:47 +09:00
|
|
|
|
|
|
|
BUILD_BUG_ON(ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]) > BITS_PER_LONG);
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When the separate buckets API is not built in, just return
|
|
|
|
* a non-NULL value for the kmem_buckets pointer, which will be
|
|
|
|
* unused when performing allocations.
|
|
|
|
*/
|
|
|
|
if (!IS_ENABLED(CONFIG_SLAB_BUCKETS))
|
|
|
|
return ZERO_SIZE_PTR;
|
|
|
|
|
|
|
|
if (WARN_ON(!kmem_buckets_cache))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
b = kmem_cache_alloc(kmem_buckets_cache, GFP_KERNEL|__GFP_ZERO);
|
|
|
|
if (WARN_ON(!b))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
flags |= SLAB_NO_MERGE;
|
|
|
|
|
|
|
|
for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) {
|
|
|
|
char *short_size, *cache_name;
|
|
|
|
unsigned int cache_useroffset, cache_usersize;
|
2024-11-05 11:27:47 +09:00
|
|
|
unsigned int size, aligned_idx;
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
|
|
|
|
if (!kmalloc_caches[KMALLOC_NORMAL][idx])
|
|
|
|
continue;
|
|
|
|
|
|
|
|
size = kmalloc_caches[KMALLOC_NORMAL][idx]->object_size;
|
|
|
|
if (!size)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
short_size = strchr(kmalloc_caches[KMALLOC_NORMAL][idx]->name, '-');
|
|
|
|
if (WARN_ON(!short_size))
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
if (useroffset >= size) {
|
|
|
|
cache_useroffset = 0;
|
|
|
|
cache_usersize = 0;
|
|
|
|
} else {
|
|
|
|
cache_useroffset = useroffset;
|
|
|
|
cache_usersize = min(size - cache_useroffset, usersize);
|
|
|
|
}
|
2024-11-05 11:27:47 +09:00
|
|
|
|
|
|
|
aligned_idx = __kmalloc_index(size, false);
|
|
|
|
if (!(*b)[aligned_idx]) {
|
|
|
|
cache_name = kasprintf(GFP_KERNEL, "%s-%s", name, short_size + 1);
|
|
|
|
if (WARN_ON(!cache_name))
|
|
|
|
goto fail;
|
|
|
|
(*b)[aligned_idx] = kmem_cache_create_usercopy(cache_name, size,
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
0, flags, cache_useroffset,
|
|
|
|
cache_usersize, ctor);
|
2024-11-05 11:27:47 +09:00
|
|
|
kfree(cache_name);
|
|
|
|
if (WARN_ON(!(*b)[aligned_idx]))
|
|
|
|
goto fail;
|
|
|
|
set_bit(aligned_idx, &mask);
|
|
|
|
}
|
|
|
|
if (idx != aligned_idx)
|
|
|
|
(*b)[idx] = (*b)[aligned_idx];
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
return b;
|
|
|
|
|
|
|
|
fail:
|
2024-11-05 11:27:47 +09:00
|
|
|
for_each_set_bit(idx, &mask, ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]))
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
kmem_cache_destroy((*b)[idx]);
|
2024-08-22 10:27:04 +08:00
|
|
|
kmem_cache_free(kmem_buckets_cache, b);
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_buckets_create);
|
|
|
|
|
2022-08-12 14:30:33 -04:00
|
|
|
/*
|
|
|
|
* For a given kmem_cache, kmem_cache_destroy() should only be called
|
|
|
|
* once or there will be a use-after-free problem. The actual deletion
|
|
|
|
* and release of the kobject does not need slab_mutex or cpu_hotplug_lock
|
|
|
|
* protection. So they are now done without holding those locks.
|
|
|
|
*/
|
|
|
|
static void kmem_cache_release(struct kmem_cache *s)
|
|
|
|
{
|
2024-08-07 12:31:16 +02:00
|
|
|
kfence_shutdown_cache(s);
|
2024-08-07 12:31:15 +02:00
|
|
|
if (__is_defined(SLAB_SUPPORTS_SYSFS) && slab_state >= FULL)
|
2024-02-28 11:04:08 +08:00
|
|
|
sysfs_slab_release(s);
|
2024-08-07 12:31:15 +02:00
|
|
|
else
|
2024-02-28 11:04:08 +08:00
|
|
|
slab_kmem_cache_release(s);
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10 14:11:47 -08:00
|
|
|
}
|
|
|
|
|
2014-05-06 12:50:08 -07:00
|
|
|
void slab_kmem_cache_release(struct kmem_cache *s)
|
|
|
|
{
|
2016-02-17 13:11:37 -08:00
|
|
|
__kmem_cache_release(s);
|
2015-02-13 14:36:38 -08:00
|
|
|
kfree_const(s->name);
|
2014-05-06 12:50:08 -07:00
|
|
|
kmem_cache_free(kmem_cache, s);
|
|
|
|
}
|
|
|
|
|
2012-09-04 23:18:33 +00:00
|
|
|
void kmem_cache_destroy(struct kmem_cache *s)
|
|
|
|
{
|
2024-08-07 12:31:15 +02:00
|
|
|
int err;
|
2022-08-12 14:30:33 -04:00
|
|
|
|
2022-01-14 14:04:54 -08:00
|
|
|
if (unlikely(!s) || !kasan_check_byte(s))
|
2015-09-08 15:00:50 -07:00
|
|
|
return;
|
|
|
|
|
2024-08-07 12:31:19 +02:00
|
|
|
/* in-flight kfree_rcu()'s may include objects from our cache */
|
|
|
|
kvfree_rcu_barrier();
|
|
|
|
|
2024-08-09 17:36:56 +02:00
|
|
|
if (IS_ENABLED(CONFIG_SLUB_RCU_DEBUG) &&
|
|
|
|
(s->flags & SLAB_TYPESAFE_BY_RCU)) {
|
|
|
|
/*
|
|
|
|
* Under CONFIG_SLUB_RCU_DEBUG, when objects in a
|
|
|
|
* SLAB_TYPESAFE_BY_RCU slab are freed, SLUB will internally
|
|
|
|
* defer their freeing with call_rcu().
|
|
|
|
* Wait for such call_rcu() invocations here before actually
|
|
|
|
* destroying the cache.
|
|
|
|
*
|
|
|
|
* It doesn't matter that we haven't looked at the slab refcount
|
|
|
|
* yet - slabs with SLAB_TYPESAFE_BY_RCU can't be merged, so
|
|
|
|
* the refcount should be 1 here.
|
|
|
|
*/
|
|
|
|
rcu_barrier();
|
|
|
|
}
|
|
|
|
|
2021-02-26 17:11:55 +01:00
|
|
|
cpus_read_lock();
|
2012-09-04 23:18:33 +00:00
|
|
|
mutex_lock(&slab_mutex);
|
2014-04-07 15:39:28 -07:00
|
|
|
|
mm/slab_common: fix slab_caches list corruption after kmem_cache_destroy()
After the commit in Fixes:, if a module that created a slab cache does not
release all of its allocated objects before destroying the cache (at rmmod
time), we might end up releasing the kmem_cache object without removing it
from the slab_caches list thus corrupting the list as kmem_cache_destroy()
ignores the return value from shutdown_cache(), which in turn never removes
the kmem_cache object from slabs_list in case __kmem_cache_shutdown() fails
to release all of the cache's slabs.
This is easily observable on a kernel built with CONFIG_DEBUG_LIST=y
as after that ill release the system will immediately trip on list_add,
or list_del, assertions similar to the one shown below as soon as another
kmem_cache gets created, or destroyed:
[ 1041.213632] list_del corruption. next->prev should be ffff89f596fb5768, but was 52f1e5016aeee75d. (next=ffff89f595a1b268)
[ 1041.219165] ------------[ cut here ]------------
[ 1041.221517] kernel BUG at lib/list_debug.c:62!
[ 1041.223452] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 1041.225408] CPU: 2 PID: 1852 Comm: rmmod Kdump: loaded Tainted: G B W OE 6.5.0 #15
[ 1041.228244] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc37 05/24/2023
[ 1041.231212] RIP: 0010:__list_del_entry_valid+0xae/0xb0
Another quick way to trigger this issue, in a kernel with CONFIG_SLUB=y,
is to set slub_debug to poison the released objects and then just run
cat /proc/slabinfo after removing the module that leaks slab objects,
in which case the kernel will panic:
[ 50.954843] general protection fault, probably for non-canonical address 0xa56b6b6b6b6b6b8b: 0000 [#1] PREEMPT SMP PTI
[ 50.961545] CPU: 2 PID: 1495 Comm: cat Kdump: loaded Tainted: G B W OE 6.5.0 #15
[ 50.966808] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc37 05/24/2023
[ 50.972663] RIP: 0010:get_slabinfo+0x42/0xf0
This patch fixes this issue by properly checking shutdown_cache()'s
return value before taking the kmem_cache_release() branch.
Fixes: 0495e337b703 ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock")
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-09-08 19:06:49 -04:00
|
|
|
s->refcount--;
|
2024-08-07 12:31:15 +02:00
|
|
|
if (s->refcount) {
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
cpus_read_unlock();
|
|
|
|
return;
|
|
|
|
}
|
2014-04-07 15:39:28 -07:00
|
|
|
|
2024-08-07 12:31:14 +02:00
|
|
|
/* free asan quarantined objects */
|
|
|
|
kasan_cache_shutdown(s);
|
|
|
|
|
|
|
|
err = __kmem_cache_shutdown(s);
|
2024-10-01 18:20:48 +02:00
|
|
|
if (!slab_in_kunit_test())
|
|
|
|
WARN(err, "%s %s: Slab cache still has objects when called from %pS",
|
|
|
|
__func__, s->name, (void *)_RET_IP_);
|
2024-08-07 12:31:14 +02:00
|
|
|
|
|
|
|
list_del(&s->list);
|
|
|
|
|
2014-04-07 15:39:28 -07:00
|
|
|
mutex_unlock(&slab_mutex);
|
2021-02-26 17:11:55 +01:00
|
|
|
cpus_read_unlock();
|
2024-08-07 12:31:15 +02:00
|
|
|
|
|
|
|
if (slab_state >= FULL)
|
|
|
|
sysfs_slab_unlink(s);
|
|
|
|
debugfs_slab_release(s);
|
|
|
|
|
|
|
|
if (err)
|
|
|
|
return;
|
|
|
|
|
2024-08-07 12:31:17 +02:00
|
|
|
if (s->flags & SLAB_TYPESAFE_BY_RCU)
|
|
|
|
rcu_barrier();
|
|
|
|
|
|
|
|
kmem_cache_release(s);
|
2012-09-04 23:18:33 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_cache_destroy);
|
|
|
|
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:07:20 -07:00
|
|
|
/**
|
|
|
|
* kmem_cache_shrink - Shrink a cache.
|
|
|
|
* @cachep: The cache to shrink.
|
|
|
|
*
|
|
|
|
* Releases as many slabs as possible for a cache.
|
|
|
|
* To help debugging, a zero exit status indicates all slabs were released.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: %0 if all slabs were released, non-zero otherwise
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:07:20 -07:00
|
|
|
*/
|
|
|
|
int kmem_cache_shrink(struct kmem_cache *cachep)
|
|
|
|
{
|
mm: kasan: initial memory quarantine implementation
Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.
When the object is freed, its state changes from KASAN_STATE_ALLOC to
KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
instead of being returned to the allocator, therefore every subsequent
access to that object triggers a KASAN error, and the error handler is
able to say where the object has been allocated and deallocated.
When it's time for the object to leave quarantine, its state becomes
KASAN_STATE_FREE and it's returned to the allocator. From now on the
allocator may reuse it for another allocation. Before that happens,
it's still possible to detect a use-after free on that object (it
retains the allocation/deallocation stacks).
When the allocator reuses this object, the shadow is unpoisoned and old
allocation/deallocation stacks are wiped. Therefore a use of this
object, even an incorrect one, won't trigger ASan warning.
Without the quarantine, it's not guaranteed that the objects aren't
reused immediately, that's why the probability of catching a
use-after-free is lower than with quarantine in place.
Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.
Freed objects are first added to per-cpu quarantine queues. When a
cache is destroyed or memory shrinking is requested, the objects are
moved into the global quarantine queue. Whenever a kmalloc call allows
memory reclaiming, the oldest objects are popped out of the global queue
until the total size of objects in quarantine is less than 3/4 of the
maximum quarantine size (which is a fraction of installed physical
memory).
As long as an object remains in the quarantine, KASAN is able to report
accesses to it, so the chance of reporting a use-after-free is
increased. Once the object leaves quarantine, the allocator may reuse
it, in which case the object is unpoisoned and KASAN can't detect
incorrect accesses to it.
Right now quarantine support is only enabled in SLAB allocator.
Unification of KASAN features in SLAB and SLUB will be done later.
This patch is based on the "mm: kasan: quarantine" patch originally
prepared by Dmitry Chernenkov. A number of improvements have been
suggested by Andrey Ryabinin.
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 16:59:11 -07:00
|
|
|
kasan_cache_shrink(cachep);
|
mm, slab, slub: stop taking memory hotplug lock
Since commit 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
SLAB and SLUB when creating, destroying or shrinking a cache. It is quite
a heavy lock and it's best to avoid it if possible, as we had several
issues with lockdep complaining about ordering in the past, see e.g.
e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").
The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock)
can be summarized as follows: while there's slab_mutex synchronizing new
kmem cache creation and SLUB's MEM_GOING_ONLINE callback
slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
for the hotplugged node in the new kmem cache, because the hotplug
callback doesn't yet see the new cache, and cache creation in
init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
as that happens only later after the MEM_GOING_ONLINE callback.
Instead of using get/put_online_mems(), the problem can be solved by SLUB
maintaining its own nodemask of nodes for which it has allocated the
per-node kmem_cache_node structures. This nodemask would generally mirror
the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
control in its memory hotplug callbacks under the slab_mutex. This patch
adds such nodemask and its handling.
Commit 03afc0e25f7f mentiones "issues like [the one above]", but there
don't appear to be further issues. All the paths (shared for SLAB and
SLUB) taking the memory hotplug locks are also taking the slab_mutex,
except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
get/put_online_mems().
We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
SLUB can enters the function from a write to sysfs 'shrink' file, thus
holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
within slab_mutex. But on closer inspection we don't actually need to
protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
__kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
added in parallel hotplug is not fatal, and parallel hotremove does not
free kmem_cache_node's anymore after the previous patch, so use-after free
cannot happen. The per-node shrinking itself is protected by
n->list_lock. Same is true for SLAB, and SLOB is no-op.
SLAB also doesn't need the memory hotplug locking, which it only gained by
03afc0e25f7f through the shared paths in slab_common.c. Its memory
hotplug callbacks are also protected by slab_mutex against races with
these paths. The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
node is already set there during the MEM_GOING_ONLINE callback, so no
special care is needed for SLAB.
As such, this patch removes all get/put_online_mems() usage by the slab
subsystem.
Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <cai@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 12:01:12 -08:00
|
|
|
|
2022-08-23 07:52:41 +00:00
|
|
|
return __kmem_cache_shrink(cachep);
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:07:20 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_cache_shrink);
|
|
|
|
|
2015-11-05 18:44:59 -08:00
|
|
|
bool slab_is_available(void)
|
2012-07-06 15:25:11 -05:00
|
|
|
{
|
|
|
|
return slab_state >= UP;
|
|
|
|
}
|
2012-10-19 18:20:25 +04:00
|
|
|
|
2021-01-07 13:46:11 -08:00
|
|
|
#ifdef CONFIG_PRINTK
|
2022-04-14 19:13:40 -07:00
|
|
|
static void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
|
|
|
|
{
|
|
|
|
if (__kfence_obj_info(kpp, object, slab))
|
|
|
|
return;
|
|
|
|
__kmem_obj_info(kpp, object, slab);
|
|
|
|
}
|
|
|
|
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
/**
|
|
|
|
* kmem_dump_obj - Print available slab provenance information
|
|
|
|
* @object: slab object for which to find provenance information.
|
|
|
|
*
|
|
|
|
* This function uses pr_cont(), so that the caller is expected to have
|
|
|
|
* printed out whatever preamble is appropriate. The provenance information
|
|
|
|
* depends on the type of object and on how much debugging is enabled.
|
|
|
|
* For a slab-cache object, the fact that it is a slab object is printed,
|
|
|
|
* and, if available, the slab name, return address, and stack trace from
|
2021-03-16 16:07:11 +05:30
|
|
|
* the allocation and last free path of that object.
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
*
|
2023-08-05 11:17:25 +08:00
|
|
|
* Return: %true if the pointer is to a not-yet-freed object from
|
|
|
|
* kmalloc() or kmem_cache_alloc(), either %true or %false if the pointer
|
|
|
|
* is to an already-freed object, and %false otherwise.
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
*/
|
2023-08-05 11:17:25 +08:00
|
|
|
bool kmem_dump_obj(void *object)
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
{
|
|
|
|
char *cp = IS_ENABLED(CONFIG_MMU) ? "" : "/vmalloc";
|
|
|
|
int i;
|
2021-10-04 14:45:55 +01:00
|
|
|
struct slab *slab;
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
unsigned long ptroffset;
|
|
|
|
struct kmem_obj_info kp = { };
|
|
|
|
|
2023-08-05 11:17:25 +08:00
|
|
|
/* Some arches consider ZERO_SIZE_PTR to be a valid address. */
|
|
|
|
if (object < (void *)PAGE_SIZE || !virt_addr_valid(object))
|
|
|
|
return false;
|
2021-10-04 14:45:55 +01:00
|
|
|
slab = virt_to_slab(object);
|
2023-08-05 11:17:25 +08:00
|
|
|
if (!slab)
|
|
|
|
return false;
|
|
|
|
|
2021-10-04 14:45:55 +01:00
|
|
|
kmem_obj_info(&kp, object, slab);
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
if (kp.kp_slab_cache)
|
|
|
|
pr_cont(" slab%s %s", cp, kp.kp_slab_cache->name);
|
|
|
|
else
|
|
|
|
pr_cont(" slab%s", cp);
|
2022-04-14 19:13:40 -07:00
|
|
|
if (is_kfence_address(object))
|
|
|
|
pr_cont(" (kfence)");
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
if (kp.kp_objp)
|
|
|
|
pr_cont(" start %px", kp.kp_objp);
|
|
|
|
if (kp.kp_data_offset)
|
|
|
|
pr_cont(" data offset %lu", kp.kp_data_offset);
|
|
|
|
if (kp.kp_objp) {
|
|
|
|
ptroffset = ((char *)object - (char *)kp.kp_objp) - kp.kp_data_offset;
|
|
|
|
pr_cont(" pointer offset %lu", ptroffset);
|
|
|
|
}
|
2022-11-16 15:56:32 +01:00
|
|
|
if (kp.kp_slab_cache && kp.kp_slab_cache->object_size)
|
|
|
|
pr_cont(" size %u", kp.kp_slab_cache->object_size);
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
if (kp.kp_ret)
|
|
|
|
pr_cont(" allocated at %pS\n", kp.kp_ret);
|
|
|
|
else
|
|
|
|
pr_cont("\n");
|
|
|
|
for (i = 0; i < ARRAY_SIZE(kp.kp_stack); i++) {
|
|
|
|
if (!kp.kp_stack[i])
|
|
|
|
break;
|
|
|
|
pr_info(" %pS\n", kp.kp_stack[i]);
|
|
|
|
}
|
2021-03-16 16:07:11 +05:30
|
|
|
|
|
|
|
if (kp.kp_free_stack[0])
|
|
|
|
pr_cont(" Free path:\n");
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(kp.kp_free_stack); i++) {
|
|
|
|
if (!kp.kp_free_stack[i])
|
|
|
|
break;
|
|
|
|
pr_info(" %pS\n", kp.kp_free_stack[i]);
|
|
|
|
}
|
|
|
|
|
2023-08-05 11:17:25 +08:00
|
|
|
return true;
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
}
|
2020-12-07 21:23:36 -08:00
|
|
|
EXPORT_SYMBOL_GPL(kmem_dump_obj);
|
2021-01-07 13:46:11 -08:00
|
|
|
#endif
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
|
2012-11-28 16:23:07 +00:00
|
|
|
/* Create a cache during boot when no slab services are available yet */
|
2018-04-05 16:20:33 -07:00
|
|
|
void __init create_boot_cache(struct kmem_cache *s, const char *name,
|
|
|
|
unsigned int size, slab_flags_t flags,
|
|
|
|
unsigned int useroffset, unsigned int usersize)
|
2012-11-28 16:23:07 +00:00
|
|
|
{
|
|
|
|
int err;
|
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-06 17:58:45 -07:00
|
|
|
unsigned int align = ARCH_KMALLOC_MINALIGN;
|
2024-09-05 09:56:51 +02:00
|
|
|
struct kmem_cache_args kmem_args = {};
|
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-06 17:58:45 -07:00
|
|
|
|
|
|
|
/*
|
slab, rust: extend kmalloc() alignment guarantees to remove Rust padding
Slab allocators have been guaranteeing natural alignment for
power-of-two sizes since commit 59bb47985c1d ("mm, sl[aou]b: guarantee
natural alignment for kmalloc(power-of-two)"), while any other sizes are
guaranteed to be aligned only to ARCH_KMALLOC_MINALIGN bytes (although
in practice are aligned more than that in non-debug scenarios).
Rust's allocator API specifies size and alignment per allocation, which
have to satisfy the following rules, per Alice Ryhl [1]:
1. The alignment is a power of two.
2. The size is non-zero.
3. When you round up the size to the next multiple of the alignment,
then it must not overflow the signed type isize / ssize_t.
In order to map this to kmalloc()'s guarantees, some requested
allocation sizes have to be padded to the next power-of-two size [2].
For example, an allocation of size 96 and alignment of 32 will be padded
to an allocation of size 128, because the existing kmalloc-96 bucket
doesn't guarantee alignent above ARCH_KMALLOC_MINALIGN. Without slab
debugging active, the layout of the kmalloc-96 slabs however naturally
align the objects to 32 bytes, so extending the size to 128 bytes is
wasteful.
To improve the situation we can extend the kmalloc() alignment
guarantees in a way that
1) doesn't change the current slab layout (and thus does not increase
internal fragmentation) when slab debugging is not active
2) reduces waste in the Rust allocator use case
3) is a superset of the current guarantee for power-of-two sizes.
The extended guarantee is that alignment is at least the largest
power-of-two divisor of the requested size. For power-of-two sizes the
largest divisor is the size itself, but let's keep this case documented
separately for clarity.
For current kmalloc size buckets, it means kmalloc-96 will guarantee
alignment of 32 bytes and kmalloc-196 will guarantee 64 bytes.
This covers the rules 1 and 2 above of Rust's API as long as the size is
a multiple of the alignment. The Rust layer should now only need to
round up the size to the next multiple if it isn't, while enforcing the
rule 3.
Implementation-wise, this changes the alignment calculation in
create_boot_cache(). While at it also do the calulation only for caches
with the SLAB_KMALLOC flag, because the function is also used to create
the initial kmem_cache and kmem_cache_node caches, where no alignment
guarantee is necessary.
In the Rust allocator's krealloc_aligned(), remove the code that padded
sizes to the next power of two (suggested by Alice Ryhl) as it's no
longer necessary with the new guarantees.
Reported-by: Alice Ryhl <aliceryhl@google.com>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/all/CAH5fLggjrbdUuT-H-5vbQfMazjRDpp2%2Bk3%3DYhPyS17ezEqxwcw@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/CAH5fLghsZRemYUwVvhk77o6y1foqnCeDzW4WZv6ScEWna2+_jw@mail.gmail.com/ [2]
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03 09:25:21 +02:00
|
|
|
* kmalloc caches guarantee alignment of at least the largest
|
|
|
|
* power-of-two divisor of the size. For power-of-two sizes,
|
|
|
|
* it is the size itself.
|
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-06 17:58:45 -07:00
|
|
|
*/
|
slab, rust: extend kmalloc() alignment guarantees to remove Rust padding
Slab allocators have been guaranteeing natural alignment for
power-of-two sizes since commit 59bb47985c1d ("mm, sl[aou]b: guarantee
natural alignment for kmalloc(power-of-two)"), while any other sizes are
guaranteed to be aligned only to ARCH_KMALLOC_MINALIGN bytes (although
in practice are aligned more than that in non-debug scenarios).
Rust's allocator API specifies size and alignment per allocation, which
have to satisfy the following rules, per Alice Ryhl [1]:
1. The alignment is a power of two.
2. The size is non-zero.
3. When you round up the size to the next multiple of the alignment,
then it must not overflow the signed type isize / ssize_t.
In order to map this to kmalloc()'s guarantees, some requested
allocation sizes have to be padded to the next power-of-two size [2].
For example, an allocation of size 96 and alignment of 32 will be padded
to an allocation of size 128, because the existing kmalloc-96 bucket
doesn't guarantee alignent above ARCH_KMALLOC_MINALIGN. Without slab
debugging active, the layout of the kmalloc-96 slabs however naturally
align the objects to 32 bytes, so extending the size to 128 bytes is
wasteful.
To improve the situation we can extend the kmalloc() alignment
guarantees in a way that
1) doesn't change the current slab layout (and thus does not increase
internal fragmentation) when slab debugging is not active
2) reduces waste in the Rust allocator use case
3) is a superset of the current guarantee for power-of-two sizes.
The extended guarantee is that alignment is at least the largest
power-of-two divisor of the requested size. For power-of-two sizes the
largest divisor is the size itself, but let's keep this case documented
separately for clarity.
For current kmalloc size buckets, it means kmalloc-96 will guarantee
alignment of 32 bytes and kmalloc-196 will guarantee 64 bytes.
This covers the rules 1 and 2 above of Rust's API as long as the size is
a multiple of the alignment. The Rust layer should now only need to
round up the size to the next multiple if it isn't, while enforcing the
rule 3.
Implementation-wise, this changes the alignment calculation in
create_boot_cache(). While at it also do the calulation only for caches
with the SLAB_KMALLOC flag, because the function is also used to create
the initial kmem_cache and kmem_cache_node caches, where no alignment
guarantee is necessary.
In the Rust allocator's krealloc_aligned(), remove the code that padded
sizes to the next power of two (suggested by Alice Ryhl) as it's no
longer necessary with the new guarantees.
Reported-by: Alice Ryhl <aliceryhl@google.com>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/all/CAH5fLggjrbdUuT-H-5vbQfMazjRDpp2%2Bk3%3DYhPyS17ezEqxwcw@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/CAH5fLghsZRemYUwVvhk77o6y1foqnCeDzW4WZv6ScEWna2+_jw@mail.gmail.com/ [2]
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03 09:25:21 +02:00
|
|
|
if (flags & SLAB_KMALLOC)
|
|
|
|
align = max(align, 1U << (ffs(size) - 1));
|
2024-09-05 09:56:51 +02:00
|
|
|
kmem_args.align = calculate_alignment(flags, align, size);
|
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-06 17:58:45 -07:00
|
|
|
|
2022-11-16 15:56:32 +01:00
|
|
|
#ifdef CONFIG_HARDENED_USERCOPY
|
2024-09-05 09:56:51 +02:00
|
|
|
kmem_args.useroffset = useroffset;
|
|
|
|
kmem_args.usersize = usersize;
|
2022-11-16 15:56:32 +01:00
|
|
|
#endif
|
2015-02-12 14:59:20 -08:00
|
|
|
|
2024-09-05 09:56:51 +02:00
|
|
|
err = do_kmem_cache_create(s, name, size, &kmem_args, flags);
|
2012-11-28 16:23:07 +00:00
|
|
|
|
|
|
|
if (err)
|
2018-04-05 16:20:33 -07:00
|
|
|
panic("Creation of kmalloc slab %s size=%u failed. Reason %d\n",
|
2012-11-28 16:23:07 +00:00
|
|
|
name, size, err);
|
|
|
|
|
|
|
|
s->refcount = -1; /* Exempt from merging for now */
|
|
|
|
}
|
|
|
|
|
2023-06-12 16:31:47 +01:00
|
|
|
static struct kmem_cache *__init create_kmalloc_cache(const char *name,
|
|
|
|
unsigned int size,
|
|
|
|
slab_flags_t flags)
|
2012-11-28 16:23:07 +00:00
|
|
|
{
|
|
|
|
struct kmem_cache *s = kmem_cache_zalloc(kmem_cache, GFP_NOWAIT);
|
|
|
|
|
|
|
|
if (!s)
|
|
|
|
panic("Out of memory when creating slab %s\n", name);
|
|
|
|
|
2023-06-12 16:31:47 +01:00
|
|
|
create_boot_cache(s, name, size, flags | SLAB_KMALLOC, 0, size);
|
2012-11-28 16:23:07 +00:00
|
|
|
list_add(&s->list, &slab_caches);
|
|
|
|
s->refcount = 1;
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2024-07-01 12:12:58 -07:00
|
|
|
kmem_buckets kmalloc_caches[NR_KMALLOC_TYPES] __ro_after_init =
|
2024-01-09 15:16:31 -07:00
|
|
|
{ /* initialization for https://llvm.org/pr42570 */ };
|
2013-01-10 19:12:17 +00:00
|
|
|
EXPORT_SYMBOL(kmalloc_caches);
|
|
|
|
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
#ifdef CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
unsigned long random_kmalloc_seed __ro_after_init;
|
|
|
|
EXPORT_SYMBOL(random_kmalloc_seed);
|
|
|
|
#endif
|
|
|
|
|
2013-01-10 19:14:19 +00:00
|
|
|
/*
|
|
|
|
* Conversion table for small slabs sizes / 8 to the index in the
|
|
|
|
* kmalloc array. This is necessary for slabs < 192 since we have non power
|
|
|
|
* of two cache sizes there. The size of larger slabs can be determined using
|
|
|
|
* fls.
|
|
|
|
*/
|
2023-11-13 12:02:02 +01:00
|
|
|
u8 kmalloc_size_index[24] __ro_after_init = {
|
2013-01-10 19:14:19 +00:00
|
|
|
3, /* 8 */
|
|
|
|
4, /* 16 */
|
|
|
|
5, /* 24 */
|
|
|
|
5, /* 32 */
|
|
|
|
6, /* 40 */
|
|
|
|
6, /* 48 */
|
|
|
|
6, /* 56 */
|
|
|
|
6, /* 64 */
|
|
|
|
1, /* 72 */
|
|
|
|
1, /* 80 */
|
|
|
|
1, /* 88 */
|
|
|
|
1, /* 96 */
|
|
|
|
7, /* 104 */
|
|
|
|
7, /* 112 */
|
|
|
|
7, /* 120 */
|
|
|
|
7, /* 128 */
|
|
|
|
2, /* 136 */
|
|
|
|
2, /* 144 */
|
|
|
|
2, /* 152 */
|
|
|
|
2, /* 160 */
|
|
|
|
2, /* 168 */
|
|
|
|
2, /* 176 */
|
|
|
|
2, /* 184 */
|
|
|
|
2 /* 192 */
|
|
|
|
};
|
|
|
|
|
slab: Introduce kmalloc_size_roundup()
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.
There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:
1) An allocation has been determined to be too small, and needs to be
resized. Instead of the caller choosing its own next best size, it
wants to minimize the number of calls to krealloc(), so it just uses
ksize() plus some additional bytes, forcing the realloc into the next
bucket size, from which it can learn how large it is now. For example:
data = krealloc(data, ksize(data) + 1, gfp);
data_len = ksize(data);
2) The minimum size of an allocation is calculated, but since it may
grow in the future, just use all the space available in the chosen
bucket immediately, to avoid needing to reallocate later. A good
example of this is skbuff's allocators:
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
...
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
* to allow max possible filling before reallocation.
*/
osize = ksize(data);
size = SKB_WITH_OVERHEAD(osize);
In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.
We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".
Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().
[1] https://github.com/ClangBuiltLinux/linux/issues/1599
https://github.com/KSPP/linux/issues/183
[ vbabka@suse.cz: add SLOB version ]
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-23 13:28:08 -07:00
|
|
|
size_t kmalloc_size_roundup(size_t size)
|
|
|
|
{
|
2023-09-07 12:42:20 +00:00
|
|
|
if (size && size <= KMALLOC_MAX_CACHE_SIZE) {
|
|
|
|
/*
|
|
|
|
* The flags don't matter since size_index is common to all.
|
|
|
|
* Neither does the caller for just getting ->object_size.
|
|
|
|
*/
|
2024-07-01 12:12:59 -07:00
|
|
|
return kmalloc_slab(size, NULL, GFP_KERNEL, 0)->object_size;
|
2023-09-07 12:42:20 +00:00
|
|
|
}
|
|
|
|
|
slab: Introduce kmalloc_size_roundup()
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.
There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:
1) An allocation has been determined to be too small, and needs to be
resized. Instead of the caller choosing its own next best size, it
wants to minimize the number of calls to krealloc(), so it just uses
ksize() plus some additional bytes, forcing the realloc into the next
bucket size, from which it can learn how large it is now. For example:
data = krealloc(data, ksize(data) + 1, gfp);
data_len = ksize(data);
2) The minimum size of an allocation is calculated, but since it may
grow in the future, just use all the space available in the chosen
bucket immediately, to avoid needing to reallocate later. A good
example of this is skbuff's allocators:
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
...
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
* to allow max possible filling before reallocation.
*/
osize = ksize(data);
size = SKB_WITH_OVERHEAD(osize);
In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.
We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".
Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().
[1] https://github.com/ClangBuiltLinux/linux/issues/1599
https://github.com/KSPP/linux/issues/183
[ vbabka@suse.cz: add SLOB version ]
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-23 13:28:08 -07:00
|
|
|
/* Above the smaller buckets, size is a multiple of page size. */
|
2023-09-07 12:42:20 +00:00
|
|
|
if (size && size <= KMALLOC_MAX_SIZE)
|
slab: Introduce kmalloc_size_roundup()
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.
There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:
1) An allocation has been determined to be too small, and needs to be
resized. Instead of the caller choosing its own next best size, it
wants to minimize the number of calls to krealloc(), so it just uses
ksize() plus some additional bytes, forcing the realloc into the next
bucket size, from which it can learn how large it is now. For example:
data = krealloc(data, ksize(data) + 1, gfp);
data_len = ksize(data);
2) The minimum size of an allocation is calculated, but since it may
grow in the future, just use all the space available in the chosen
bucket immediately, to avoid needing to reallocate later. A good
example of this is skbuff's allocators:
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
...
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
* to allow max possible filling before reallocation.
*/
osize = ksize(data);
size = SKB_WITH_OVERHEAD(osize);
In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.
We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".
Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().
[1] https://github.com/ClangBuiltLinux/linux/issues/1599
https://github.com/KSPP/linux/issues/183
[ vbabka@suse.cz: add SLOB version ]
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-23 13:28:08 -07:00
|
|
|
return PAGE_SIZE << get_order(size);
|
|
|
|
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
/*
|
2023-09-07 12:42:20 +00:00
|
|
|
* Return 'size' for 0 - kmalloc() returns ZERO_SIZE_PTR
|
|
|
|
* and very large size - kmalloc() may fail.
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
*/
|
2023-09-07 12:42:20 +00:00
|
|
|
return size;
|
|
|
|
|
slab: Introduce kmalloc_size_roundup()
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.
There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:
1) An allocation has been determined to be too small, and needs to be
resized. Instead of the caller choosing its own next best size, it
wants to minimize the number of calls to krealloc(), so it just uses
ksize() plus some additional bytes, forcing the realloc into the next
bucket size, from which it can learn how large it is now. For example:
data = krealloc(data, ksize(data) + 1, gfp);
data_len = ksize(data);
2) The minimum size of an allocation is calculated, but since it may
grow in the future, just use all the space available in the chosen
bucket immediately, to avoid needing to reallocate later. A good
example of this is skbuff's allocators:
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
...
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
* to allow max possible filling before reallocation.
*/
osize = ksize(data);
size = SKB_WITH_OVERHEAD(osize);
In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.
We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".
Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().
[1] https://github.com/ClangBuiltLinux/linux/issues/1599
https://github.com/KSPP/linux/issues/183
[ vbabka@suse.cz: add SLOB version ]
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-23 13:28:08 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmalloc_size_roundup);
|
|
|
|
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
#ifdef CONFIG_ZONE_DMA
|
2021-06-28 19:37:38 -07:00
|
|
|
#define KMALLOC_DMA_NAME(sz) .name[KMALLOC_DMA] = "dma-kmalloc-" #sz,
|
|
|
|
#else
|
|
|
|
#define KMALLOC_DMA_NAME(sz)
|
|
|
|
#endif
|
|
|
|
|
2024-07-01 11:31:15 -04:00
|
|
|
#ifdef CONFIG_MEMCG
|
2021-06-28 19:37:38 -07:00
|
|
|
#define KMALLOC_CGROUP_NAME(sz) .name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz,
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
#else
|
2021-06-28 19:37:38 -07:00
|
|
|
#define KMALLOC_CGROUP_NAME(sz)
|
|
|
|
#endif
|
|
|
|
|
2022-11-15 18:19:28 +01:00
|
|
|
#ifndef CONFIG_SLUB_TINY
|
|
|
|
#define KMALLOC_RCL_NAME(sz) .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #sz,
|
|
|
|
#else
|
|
|
|
#define KMALLOC_RCL_NAME(sz)
|
|
|
|
#endif
|
|
|
|
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
#ifdef CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
#define __KMALLOC_RANDOM_CONCAT(a, b) a ## b
|
|
|
|
#define KMALLOC_RANDOM_NAME(N, sz) __KMALLOC_RANDOM_CONCAT(KMA_RAND_, N)(sz)
|
|
|
|
#define KMA_RAND_1(sz) .name[KMALLOC_RANDOM_START + 1] = "kmalloc-rnd-01-" #sz,
|
|
|
|
#define KMA_RAND_2(sz) KMA_RAND_1(sz) .name[KMALLOC_RANDOM_START + 2] = "kmalloc-rnd-02-" #sz,
|
|
|
|
#define KMA_RAND_3(sz) KMA_RAND_2(sz) .name[KMALLOC_RANDOM_START + 3] = "kmalloc-rnd-03-" #sz,
|
|
|
|
#define KMA_RAND_4(sz) KMA_RAND_3(sz) .name[KMALLOC_RANDOM_START + 4] = "kmalloc-rnd-04-" #sz,
|
|
|
|
#define KMA_RAND_5(sz) KMA_RAND_4(sz) .name[KMALLOC_RANDOM_START + 5] = "kmalloc-rnd-05-" #sz,
|
|
|
|
#define KMA_RAND_6(sz) KMA_RAND_5(sz) .name[KMALLOC_RANDOM_START + 6] = "kmalloc-rnd-06-" #sz,
|
|
|
|
#define KMA_RAND_7(sz) KMA_RAND_6(sz) .name[KMALLOC_RANDOM_START + 7] = "kmalloc-rnd-07-" #sz,
|
|
|
|
#define KMA_RAND_8(sz) KMA_RAND_7(sz) .name[KMALLOC_RANDOM_START + 8] = "kmalloc-rnd-08-" #sz,
|
|
|
|
#define KMA_RAND_9(sz) KMA_RAND_8(sz) .name[KMALLOC_RANDOM_START + 9] = "kmalloc-rnd-09-" #sz,
|
|
|
|
#define KMA_RAND_10(sz) KMA_RAND_9(sz) .name[KMALLOC_RANDOM_START + 10] = "kmalloc-rnd-10-" #sz,
|
|
|
|
#define KMA_RAND_11(sz) KMA_RAND_10(sz) .name[KMALLOC_RANDOM_START + 11] = "kmalloc-rnd-11-" #sz,
|
|
|
|
#define KMA_RAND_12(sz) KMA_RAND_11(sz) .name[KMALLOC_RANDOM_START + 12] = "kmalloc-rnd-12-" #sz,
|
|
|
|
#define KMA_RAND_13(sz) KMA_RAND_12(sz) .name[KMALLOC_RANDOM_START + 13] = "kmalloc-rnd-13-" #sz,
|
|
|
|
#define KMA_RAND_14(sz) KMA_RAND_13(sz) .name[KMALLOC_RANDOM_START + 14] = "kmalloc-rnd-14-" #sz,
|
|
|
|
#define KMA_RAND_15(sz) KMA_RAND_14(sz) .name[KMALLOC_RANDOM_START + 15] = "kmalloc-rnd-15-" #sz,
|
|
|
|
#else // CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
#define KMALLOC_RANDOM_NAME(N, sz)
|
|
|
|
#endif
|
|
|
|
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
#define INIT_KMALLOC_INFO(__size, __short_size) \
|
|
|
|
{ \
|
|
|
|
.name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \
|
2022-11-15 18:19:28 +01:00
|
|
|
KMALLOC_RCL_NAME(__short_size) \
|
2021-06-28 19:37:38 -07:00
|
|
|
KMALLOC_CGROUP_NAME(__short_size) \
|
|
|
|
KMALLOC_DMA_NAME(__short_size) \
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
KMALLOC_RANDOM_NAME(RANDOM_KMALLOC_CACHES_NR, __short_size) \
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
.size = __size, \
|
|
|
|
}
|
|
|
|
|
2015-06-24 16:55:54 -07:00
|
|
|
/*
|
2023-12-15 11:41:48 +08:00
|
|
|
* kmalloc_info[] is to make slab_debug=,kmalloc-xx option work at boot time.
|
2022-08-17 19:18:19 +09:00
|
|
|
* kmalloc_index() supports up to 2^21=2MB, so the final entry of the table is
|
|
|
|
* kmalloc-2M.
|
2015-06-24 16:55:54 -07:00
|
|
|
*/
|
2017-02-22 15:41:05 -08:00
|
|
|
const struct kmalloc_info_struct kmalloc_info[] __initconst = {
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
INIT_KMALLOC_INFO(0, 0),
|
|
|
|
INIT_KMALLOC_INFO(96, 96),
|
|
|
|
INIT_KMALLOC_INFO(192, 192),
|
|
|
|
INIT_KMALLOC_INFO(8, 8),
|
|
|
|
INIT_KMALLOC_INFO(16, 16),
|
|
|
|
INIT_KMALLOC_INFO(32, 32),
|
|
|
|
INIT_KMALLOC_INFO(64, 64),
|
|
|
|
INIT_KMALLOC_INFO(128, 128),
|
|
|
|
INIT_KMALLOC_INFO(256, 256),
|
|
|
|
INIT_KMALLOC_INFO(512, 512),
|
|
|
|
INIT_KMALLOC_INFO(1024, 1k),
|
|
|
|
INIT_KMALLOC_INFO(2048, 2k),
|
|
|
|
INIT_KMALLOC_INFO(4096, 4k),
|
|
|
|
INIT_KMALLOC_INFO(8192, 8k),
|
|
|
|
INIT_KMALLOC_INFO(16384, 16k),
|
|
|
|
INIT_KMALLOC_INFO(32768, 32k),
|
|
|
|
INIT_KMALLOC_INFO(65536, 64k),
|
|
|
|
INIT_KMALLOC_INFO(131072, 128k),
|
|
|
|
INIT_KMALLOC_INFO(262144, 256k),
|
|
|
|
INIT_KMALLOC_INFO(524288, 512k),
|
|
|
|
INIT_KMALLOC_INFO(1048576, 1M),
|
2022-08-17 19:18:19 +09:00
|
|
|
INIT_KMALLOC_INFO(2097152, 2M)
|
2015-06-24 16:55:54 -07:00
|
|
|
};
|
|
|
|
|
2013-01-10 19:12:17 +00:00
|
|
|
/*
|
2015-06-24 16:55:57 -07:00
|
|
|
* Patch up the size_index table if we have strange large alignment
|
|
|
|
* requirements for the kmalloc array. This is only the case for
|
|
|
|
* MIPS it seems. The standard arches will not generate any code here.
|
|
|
|
*
|
|
|
|
* Largest permitted alignment is 256 bytes due to the way we
|
|
|
|
* handle the index determination for the smaller caches.
|
|
|
|
*
|
|
|
|
* Make sure that nothing crazy happens if someone starts tinkering
|
|
|
|
* around with ARCH_KMALLOC_MINALIGN
|
2013-01-10 19:12:17 +00:00
|
|
|
*/
|
2015-06-24 16:55:57 -07:00
|
|
|
void __init setup_kmalloc_cache_index_table(void)
|
2013-01-10 19:12:17 +00:00
|
|
|
{
|
2018-04-05 16:20:44 -07:00
|
|
|
unsigned int i;
|
2013-01-10 19:12:17 +00:00
|
|
|
|
2013-01-10 19:14:19 +00:00
|
|
|
BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
|
2022-02-17 17:16:09 +08:00
|
|
|
!is_power_of_2(KMALLOC_MIN_SIZE));
|
2013-01-10 19:14:19 +00:00
|
|
|
|
|
|
|
for (i = 8; i < KMALLOC_MIN_SIZE; i += 8) {
|
2018-04-05 16:20:44 -07:00
|
|
|
unsigned int elem = size_index_elem(i);
|
2013-01-10 19:14:19 +00:00
|
|
|
|
2023-11-13 12:02:02 +01:00
|
|
|
if (elem >= ARRAY_SIZE(kmalloc_size_index))
|
2013-01-10 19:14:19 +00:00
|
|
|
break;
|
2023-11-13 12:02:02 +01:00
|
|
|
kmalloc_size_index[elem] = KMALLOC_SHIFT_LOW;
|
2013-01-10 19:14:19 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (KMALLOC_MIN_SIZE >= 64) {
|
|
|
|
/*
|
2022-01-14 14:09:25 -08:00
|
|
|
* The 96 byte sized cache is not used if the alignment
|
2013-01-10 19:14:19 +00:00
|
|
|
* is 64 byte.
|
|
|
|
*/
|
|
|
|
for (i = 64 + 8; i <= 96; i += 8)
|
2023-11-13 12:02:02 +01:00
|
|
|
kmalloc_size_index[size_index_elem(i)] = 7;
|
2013-01-10 19:14:19 +00:00
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
if (KMALLOC_MIN_SIZE >= 128) {
|
|
|
|
/*
|
|
|
|
* The 192 byte sized cache is not used if the alignment
|
|
|
|
* is 128 byte. Redirect kmalloc to use the 256 byte cache
|
|
|
|
* instead.
|
|
|
|
*/
|
|
|
|
for (i = 128 + 8; i <= 192; i += 8)
|
2023-11-13 12:02:02 +01:00
|
|
|
kmalloc_size_index[size_index_elem(i)] = 8;
|
2013-01-10 19:14:19 +00:00
|
|
|
}
|
2015-06-24 16:55:57 -07:00
|
|
|
}
|
|
|
|
|
2023-06-12 16:31:48 +01:00
|
|
|
static unsigned int __kmalloc_minalign(void)
|
|
|
|
{
|
2023-10-06 17:39:34 +01:00
|
|
|
unsigned int minalign = dma_get_cache_alignment();
|
|
|
|
|
2023-08-01 08:23:57 +02:00
|
|
|
if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) &&
|
|
|
|
is_swiotlb_allocated())
|
2023-10-06 17:39:34 +01:00
|
|
|
minalign = ARCH_KMALLOC_MINALIGN;
|
|
|
|
|
|
|
|
return max(minalign, arch_slab_minalign());
|
2023-06-12 16:31:48 +01:00
|
|
|
}
|
|
|
|
|
2024-01-30 09:41:07 +08:00
|
|
|
static void __init
|
|
|
|
new_kmalloc_cache(int idx, enum kmalloc_cache_type type)
|
2015-06-29 09:28:08 -05:00
|
|
|
{
|
2024-01-30 09:41:07 +08:00
|
|
|
slab_flags_t flags = 0;
|
2023-06-12 16:31:48 +01:00
|
|
|
unsigned int minalign = __kmalloc_minalign();
|
|
|
|
unsigned int aligned_size = kmalloc_info[idx].size;
|
|
|
|
int aligned_idx = idx;
|
|
|
|
|
2022-11-15 18:19:28 +01:00
|
|
|
if ((KMALLOC_RECLAIM != KMALLOC_NORMAL) && (type == KMALLOC_RECLAIM)) {
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:05:38 -07:00
|
|
|
flags |= SLAB_RECLAIM_ACCOUNT;
|
2024-07-01 11:31:15 -04:00
|
|
|
} else if (IS_ENABLED(CONFIG_MEMCG) && (type == KMALLOC_CGROUP)) {
|
2022-01-14 14:05:29 -08:00
|
|
|
if (mem_cgroup_kmem_disabled()) {
|
2021-06-28 19:37:38 -07:00
|
|
|
kmalloc_caches[type][idx] = kmalloc_caches[KMALLOC_NORMAL][idx];
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
flags |= SLAB_ACCOUNT;
|
mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache()
There are four types of kmalloc_caches: KMALLOC_NORMAL, KMALLOC_CGROUP,
KMALLOC_RECLAIM, and KMALLOC_DMA. While the first three types are
created using new_kmalloc_cache(), KMALLOC_DMA caches are created in a
separate logic. Let KMALLOC_DMA caches be also created using
new_kmalloc_cache(), to enhance readability.
Historically, there were only KMALLOC_NORMAL caches and KMALLOC_DMA
caches in the first place, and they were initialized in two separate
logics. However, when KMALLOC_RECLAIM was introduced in v4.20 via
commit 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable
caches") and KMALLOC_CGROUP was introduced in v5.14 via
commit 494c1dfe855e ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches"), their creations were merged with KMALLOC_NORMAL's only.
KMALLOC_DMA creation logic should be merged with them, too.
By merging KMALLOC_DMA initialization with other types, the following
two changes might occur:
1. The order dma-kmalloc-<n> caches added in slab_cache list may be
sorted by size. i.e. the order they appear in /proc/slabinfo may change
as well.
2. slab_state will be set to UP after KMALLOC_DMA is created.
In case of slub, freelist randomization is dependent on slab_state>=UP,
and therefore KMALLOC_DMA cache's freelist will not be randomized in
creation, but will be deferred to init_freelist_randomization().
Co-developed-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: Ohhoon Kwon <ohkwon1043@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220410162511.656541-1-ohkwon1043@gmail.com
2022-04-11 01:25:11 +09:00
|
|
|
} else if (IS_ENABLED(CONFIG_ZONE_DMA) && (type == KMALLOC_DMA)) {
|
|
|
|
flags |= SLAB_CACHE_DMA;
|
2021-06-28 19:37:38 -07:00
|
|
|
}
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:05:38 -07:00
|
|
|
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
#ifdef CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
if (type >= KMALLOC_RANDOM_START && type <= KMALLOC_RANDOM_END)
|
|
|
|
flags |= SLAB_NO_MERGE;
|
|
|
|
#endif
|
|
|
|
|
2021-06-28 19:37:41 -07:00
|
|
|
/*
|
2024-07-01 11:31:15 -04:00
|
|
|
* If CONFIG_MEMCG is enabled, disable cache merging for
|
2021-06-28 19:37:41 -07:00
|
|
|
* KMALLOC_NORMAL caches.
|
|
|
|
*/
|
2024-07-01 11:31:15 -04:00
|
|
|
if (IS_ENABLED(CONFIG_MEMCG) && (type == KMALLOC_NORMAL))
|
2023-06-13 12:28:21 +02:00
|
|
|
flags |= SLAB_NO_MERGE;
|
|
|
|
|
2023-06-12 16:31:48 +01:00
|
|
|
if (minalign > ARCH_KMALLOC_MINALIGN) {
|
|
|
|
aligned_size = ALIGN(aligned_size, minalign);
|
|
|
|
aligned_idx = __kmalloc_index(aligned_size, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!kmalloc_caches[type][aligned_idx])
|
|
|
|
kmalloc_caches[type][aligned_idx] = create_kmalloc_cache(
|
|
|
|
kmalloc_info[aligned_idx].name[type],
|
|
|
|
aligned_size, flags);
|
|
|
|
if (idx != aligned_idx)
|
|
|
|
kmalloc_caches[type][idx] = kmalloc_caches[type][aligned_idx];
|
2015-06-29 09:28:08 -05:00
|
|
|
}
|
|
|
|
|
2015-06-24 16:55:57 -07:00
|
|
|
/*
|
|
|
|
* Create the kmalloc array. Some of the regular kmalloc arrays
|
|
|
|
* may already have been created because they were needed to
|
|
|
|
* enable allocations for slab creation.
|
|
|
|
*/
|
2024-01-30 09:41:07 +08:00
|
|
|
void __init create_kmalloc_caches(void)
|
2015-06-24 16:55:57 -07:00
|
|
|
{
|
2019-11-30 17:49:28 -08:00
|
|
|
int i;
|
|
|
|
enum kmalloc_cache_type type;
|
2015-06-24 16:55:57 -07:00
|
|
|
|
2021-06-28 19:37:38 -07:00
|
|
|
/*
|
2024-07-01 11:31:15 -04:00
|
|
|
* Including KMALLOC_CGROUP if CONFIG_MEMCG defined
|
2021-06-28 19:37:38 -07:00
|
|
|
*/
|
mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache()
There are four types of kmalloc_caches: KMALLOC_NORMAL, KMALLOC_CGROUP,
KMALLOC_RECLAIM, and KMALLOC_DMA. While the first three types are
created using new_kmalloc_cache(), KMALLOC_DMA caches are created in a
separate logic. Let KMALLOC_DMA caches be also created using
new_kmalloc_cache(), to enhance readability.
Historically, there were only KMALLOC_NORMAL caches and KMALLOC_DMA
caches in the first place, and they were initialized in two separate
logics. However, when KMALLOC_RECLAIM was introduced in v4.20 via
commit 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable
caches") and KMALLOC_CGROUP was introduced in v5.14 via
commit 494c1dfe855e ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches"), their creations were merged with KMALLOC_NORMAL's only.
KMALLOC_DMA creation logic should be merged with them, too.
By merging KMALLOC_DMA initialization with other types, the following
two changes might occur:
1. The order dma-kmalloc-<n> caches added in slab_cache list may be
sorted by size. i.e. the order they appear in /proc/slabinfo may change
as well.
2. slab_state will be set to UP after KMALLOC_DMA is created.
In case of slub, freelist randomization is dependent on slab_state>=UP,
and therefore KMALLOC_DMA cache's freelist will not be randomized in
creation, but will be deferred to init_freelist_randomization().
Co-developed-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: Ohhoon Kwon <ohkwon1043@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220410162511.656541-1-ohkwon1043@gmail.com
2022-04-11 01:25:11 +09:00
|
|
|
for (type = KMALLOC_NORMAL; type < NR_KMALLOC_TYPES; type++) {
|
2024-04-24 23:04:21 +09:00
|
|
|
/* Caches that are NOT of the two-to-the-power-of size. */
|
2024-04-24 23:04:22 +09:00
|
|
|
if (KMALLOC_MIN_SIZE <= 32)
|
2024-04-24 23:04:21 +09:00
|
|
|
new_kmalloc_cache(1, type);
|
2024-04-24 23:04:22 +09:00
|
|
|
if (KMALLOC_MIN_SIZE <= 64)
|
2024-04-24 23:04:21 +09:00
|
|
|
new_kmalloc_cache(2, type);
|
|
|
|
|
|
|
|
/* Caches that are of the two-to-the-power-of size. */
|
2024-04-24 23:04:22 +09:00
|
|
|
for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
|
|
|
|
new_kmalloc_cache(i, type);
|
2013-05-03 18:04:18 +00:00
|
|
|
}
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
#ifdef CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
random_kmalloc_seed = get_random_u64();
|
|
|
|
#endif
|
2013-05-03 18:04:18 +00:00
|
|
|
|
2013-01-10 19:12:17 +00:00
|
|
|
/* Kmalloc array is now usable */
|
|
|
|
slab_state = UP;
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
|
|
|
|
if (IS_ENABLED(CONFIG_SLAB_BUCKETS))
|
|
|
|
kmem_buckets_cache = kmem_cache_create("kmalloc_buckets",
|
|
|
|
sizeof(kmem_buckets),
|
|
|
|
0, SLAB_NO_MERGE, NULL);
|
2013-01-10 19:12:17 +00:00
|
|
|
}
|
2022-08-17 19:18:19 +09:00
|
|
|
|
2022-09-29 11:30:55 +02:00
|
|
|
/**
|
|
|
|
* __ksize -- Report full size of underlying allocation
|
2022-10-31 10:29:20 +01:00
|
|
|
* @object: pointer to the object
|
2022-09-29 11:30:55 +02:00
|
|
|
*
|
|
|
|
* This should only be used internally to query the true size of allocations.
|
|
|
|
* It is not meant to be a way to discover the usable size of an allocation
|
|
|
|
* after the fact. Instead, use kmalloc_size_roundup(). Using memory beyond
|
|
|
|
* the originally requested allocation size may trigger KASAN, UBSAN_BOUNDS,
|
|
|
|
* and/or FORTIFY_SOURCE.
|
|
|
|
*
|
2022-10-31 10:29:20 +01:00
|
|
|
* Return: size of the actual memory used by @object in bytes
|
2022-09-29 11:30:55 +02:00
|
|
|
*/
|
2022-08-17 19:18:21 +09:00
|
|
|
size_t __ksize(const void *object)
|
|
|
|
{
|
|
|
|
struct folio *folio;
|
|
|
|
|
|
|
|
if (unlikely(object == ZERO_SIZE_PTR))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
folio = virt_to_folio(object);
|
|
|
|
|
2022-08-17 19:18:26 +09:00
|
|
|
if (unlikely(!folio_test_slab(folio))) {
|
|
|
|
if (WARN_ON(folio_size(folio) <= KMALLOC_MAX_CACHE_SIZE))
|
|
|
|
return 0;
|
|
|
|
if (WARN_ON(object != folio_address(folio)))
|
|
|
|
return 0;
|
2022-08-17 19:18:21 +09:00
|
|
|
return folio_size(folio);
|
2022-08-17 19:18:26 +09:00
|
|
|
}
|
2022-08-17 19:18:21 +09:00
|
|
|
|
mm/slub: extend redzone check to extra allocated kmalloc space than requested
kmalloc will round up the request size to a fixed size (mostly power
of 2), so there could be a extra space than what is requested, whose
size is the actual buffer size minus original request size.
To better detect out of bound access or abuse of this space, add
redzone sanity check for it.
In current kernel, some kmalloc user already knows the existence of
the space and utilizes it after calling 'ksize()' to know the real
size of the allocated buffer. So we skip the sanity check for objects
which have been called with ksize(), as treating them as legitimate
users. Kees Cook is working on sanitizing all these user cases,
by using kmalloc_size_roundup() to avoid ambiguous usages. And after
this is done, this special handling for ksize() can be removed.
In some cases, the free pointer could be saved inside the latter
part of object data area, which may overlap the redzone part(for
small sizes of kmalloc objects). As suggested by Hyeonggon Yoo,
force the free pointer to be in meta data area when kmalloc redzone
debug is enabled, to make all kmalloc objects covered by redzone
check.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-10-21 11:24:05 +08:00
|
|
|
#ifdef CONFIG_SLUB_DEBUG
|
|
|
|
skip_orig_size_check(folio_slab(folio)->slab_cache, object);
|
|
|
|
#endif
|
|
|
|
|
2022-08-17 19:18:21 +09:00
|
|
|
return slab_ksize(folio_slab(folio)->slab_cache);
|
|
|
|
}
|
2022-08-17 19:18:22 +09:00
|
|
|
|
2020-08-06 23:18:28 -07:00
|
|
|
gfp_t kmalloc_fix_flags(gfp_t flags)
|
|
|
|
{
|
|
|
|
gfp_t invalid_mask = flags & GFP_SLAB_BUG_MASK;
|
|
|
|
|
|
|
|
flags &= ~GFP_SLAB_BUG_MASK;
|
|
|
|
pr_warn("Unexpected gfp: %#x (%pGg). Fixing up to gfp: %#x (%pGg). Fix your code!\n",
|
|
|
|
invalid_mask, &invalid_mask, flags, &flags);
|
|
|
|
dump_stack();
|
|
|
|
|
|
|
|
return flags;
|
|
|
|
}
|
|
|
|
|
2016-07-26 15:21:56 -07:00
|
|
|
#ifdef CONFIG_SLAB_FREELIST_RANDOM
|
|
|
|
/* Randomize a generic freelist */
|
2023-04-16 20:22:55 +03:00
|
|
|
static void freelist_randomize(unsigned int *list,
|
2018-04-05 16:21:46 -07:00
|
|
|
unsigned int count)
|
2016-07-26 15:21:56 -07:00
|
|
|
{
|
|
|
|
unsigned int rand;
|
2018-04-05 16:21:46 -07:00
|
|
|
unsigned int i;
|
2016-07-26 15:21:56 -07:00
|
|
|
|
|
|
|
for (i = 0; i < count; i++)
|
|
|
|
list[i] = i;
|
|
|
|
|
|
|
|
/* Fisher-Yates shuffle */
|
|
|
|
for (i = count - 1; i > 0; i--) {
|
2023-04-16 20:22:55 +03:00
|
|
|
rand = get_random_u32_below(i + 1);
|
2016-07-26 15:21:56 -07:00
|
|
|
swap(list[i], list[rand]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Create a random sequence per cache */
|
|
|
|
int cache_random_seq_create(struct kmem_cache *cachep, unsigned int count,
|
|
|
|
gfp_t gfp)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (count < 2 || cachep->random_seq)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
cachep->random_seq = kcalloc(count, sizeof(unsigned int), gfp);
|
|
|
|
if (!cachep->random_seq)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2023-04-16 20:22:55 +03:00
|
|
|
freelist_randomize(cachep->random_seq, count);
|
2016-07-26 15:21:56 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Destroy the per-cache random freelist sequence */
|
|
|
|
void cache_random_seq_destroy(struct kmem_cache *cachep)
|
|
|
|
{
|
|
|
|
kfree(cachep->random_seq);
|
|
|
|
cachep->random_seq = NULL;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_SLAB_FREELIST_RANDOM */
|
|
|
|
|
2023-10-02 17:43:38 +02:00
|
|
|
#ifdef CONFIG_SLUB_DEBUG
|
2018-06-14 15:27:58 -07:00
|
|
|
#define SLABINFO_RIGHTS (0400)
|
2013-07-04 08:33:24 +08:00
|
|
|
|
2014-12-10 15:44:19 -08:00
|
|
|
static void print_slabinfo_header(struct seq_file *m)
|
2012-10-19 18:20:26 +04:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Output format version, so at least we can change it
|
|
|
|
* without _too_ many complaints.
|
|
|
|
*/
|
|
|
|
seq_puts(m, "slabinfo - version: 2.1\n");
|
2016-03-17 14:19:47 -07:00
|
|
|
seq_puts(m, "# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>");
|
2012-10-19 18:20:26 +04:00
|
|
|
seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
|
|
|
|
seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
|
|
|
|
seq_putc(m, '\n');
|
|
|
|
}
|
|
|
|
|
2022-01-14 14:04:01 -08:00
|
|
|
static void *slab_start(struct seq_file *m, loff_t *pos)
|
2012-10-19 18:20:25 +04:00
|
|
|
{
|
|
|
|
mutex_lock(&slab_mutex);
|
2020-08-06 23:21:20 -07:00
|
|
|
return seq_list_start(&slab_caches, *pos);
|
2012-10-19 18:20:25 +04:00
|
|
|
}
|
|
|
|
|
2022-01-14 14:04:01 -08:00
|
|
|
static void *slab_next(struct seq_file *m, void *p, loff_t *pos)
|
2012-10-19 18:20:25 +04:00
|
|
|
{
|
2020-08-06 23:21:20 -07:00
|
|
|
return seq_list_next(p, &slab_caches, pos);
|
2012-10-19 18:20:25 +04:00
|
|
|
}
|
|
|
|
|
2022-01-14 14:04:01 -08:00
|
|
|
static void slab_stop(struct seq_file *m, void *p)
|
2012-10-19 18:20:25 +04:00
|
|
|
{
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
}
|
|
|
|
|
2014-12-10 15:44:19 -08:00
|
|
|
static void cache_show(struct kmem_cache *s, struct seq_file *m)
|
2012-10-19 18:20:25 +04:00
|
|
|
{
|
2012-10-19 18:20:27 +04:00
|
|
|
struct slabinfo sinfo;
|
|
|
|
|
|
|
|
memset(&sinfo, 0, sizeof(sinfo));
|
|
|
|
get_slabinfo(s, &sinfo);
|
|
|
|
|
|
|
|
seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
|
2020-08-06 23:21:27 -07:00
|
|
|
s->name, sinfo.active_objs, sinfo.num_objs, s->size,
|
2012-10-19 18:20:27 +04:00
|
|
|
sinfo.objects_per_slab, (1 << sinfo.cache_order));
|
|
|
|
|
|
|
|
seq_printf(m, " : tunables %4u %4u %4u",
|
|
|
|
sinfo.limit, sinfo.batchcount, sinfo.shared);
|
|
|
|
seq_printf(m, " : slabdata %6lu %6lu %6lu",
|
|
|
|
sinfo.active_slabs, sinfo.num_slabs, sinfo.shared_avail);
|
|
|
|
seq_putc(m, '\n');
|
2012-10-19 18:20:25 +04:00
|
|
|
}
|
|
|
|
|
2014-12-10 15:42:16 -08:00
|
|
|
static int slab_show(struct seq_file *m, void *p)
|
2012-12-18 14:23:01 -08:00
|
|
|
{
|
2020-08-06 23:21:20 -07:00
|
|
|
struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
|
2012-12-18 14:23:01 -08:00
|
|
|
|
2020-08-06 23:21:20 -07:00
|
|
|
if (p == slab_caches.next)
|
2014-12-10 15:42:16 -08:00
|
|
|
print_slabinfo_header(m);
|
2020-08-06 23:21:27 -07:00
|
|
|
cache_show(s, m);
|
2014-12-10 15:44:19 -08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-11-15 17:32:07 -08:00
|
|
|
void dump_unreclaimable_slab(void)
|
|
|
|
{
|
2020-12-14 19:03:47 -08:00
|
|
|
struct kmem_cache *s;
|
2017-11-15 17:32:07 -08:00
|
|
|
struct slabinfo sinfo;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here acquiring slab_mutex is risky since we don't prefer to get
|
|
|
|
* sleep in oom path. But, without mutex hold, it may introduce a
|
|
|
|
* risk of crash.
|
|
|
|
* Use mutex_trylock to protect the list traverse, dump nothing
|
|
|
|
* without acquiring the mutex.
|
|
|
|
*/
|
|
|
|
if (!mutex_trylock(&slab_mutex)) {
|
|
|
|
pr_warn("excessive unreclaimable slab but cannot dump stats\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
pr_info("Unreclaimable slab info:\n");
|
|
|
|
pr_info("Name Used Total\n");
|
|
|
|
|
2020-12-14 19:03:47 -08:00
|
|
|
list_for_each_entry(s, &slab_caches, list) {
|
2020-08-06 23:21:27 -07:00
|
|
|
if (s->flags & SLAB_RECLAIM_ACCOUNT)
|
2017-11-15 17:32:07 -08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
get_slabinfo(s, &sinfo);
|
|
|
|
|
|
|
|
if (sinfo.num_objs > 0)
|
2020-08-06 23:21:27 -07:00
|
|
|
pr_info("%-17s %10luKB %10luKB\n", s->name,
|
2017-11-15 17:32:07 -08:00
|
|
|
(sinfo.active_objs * s->size) / 1024,
|
|
|
|
(sinfo.num_objs * s->size) / 1024);
|
|
|
|
}
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
}
|
|
|
|
|
2012-10-19 18:20:25 +04:00
|
|
|
/*
|
|
|
|
* slabinfo_op - iterator that generates /proc/slabinfo
|
|
|
|
*
|
|
|
|
* Output layout:
|
|
|
|
* cache-name
|
|
|
|
* num-active-objs
|
|
|
|
* total-objs
|
|
|
|
* object size
|
|
|
|
* num-active-slabs
|
|
|
|
* total-slabs
|
|
|
|
* num-pages-per-slab
|
|
|
|
* + further values on SMP and with statistics enabled
|
|
|
|
*/
|
|
|
|
static const struct seq_operations slabinfo_op = {
|
2014-12-10 15:42:16 -08:00
|
|
|
.start = slab_start,
|
2013-07-08 08:08:28 +08:00
|
|
|
.next = slab_next,
|
|
|
|
.stop = slab_stop,
|
2014-12-10 15:42:16 -08:00
|
|
|
.show = slab_show,
|
2012-10-19 18:20:25 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
static int slabinfo_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
return seq_open(file, &slabinfo_op);
|
|
|
|
}
|
|
|
|
|
2020-02-03 17:37:17 -08:00
|
|
|
static const struct proc_ops slabinfo_proc_ops = {
|
proc: faster open/read/close with "permanent" files
Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...
Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed. Files like /proc/cpuinfo which never
disappear simply do not need such protection.
Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.
Enable "permanent" flag for
/proc/cpuinfo
/proc/kmsg
/proc/modules
/proc/slabinfo
/proc/stat
/proc/sysvipc/*
/proc/swaps
More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.
This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".
N R t, s (before) t, s (after)
-----------------------------------------------------
64 4096 1.582458 1.530502 -3.2%
256 4096 6.371926 6.125168 -3.9%
1024 4096 25.64888 24.47528 -4.6%
Benchmark source:
#include <chrono>
#include <iostream>
#include <thread>
#include <vector>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;
int xxx = 0;
int glue(int n)
{
cpu_set_t m;
CPU_ZERO(&m);
CPU_SET(n, &m);
return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}
void f(int n)
{
glue(n % NR_CPUS);
while (*(volatile int *)&xxx == 0) {
}
for (int i = 0; i < R; i++) {
int fd = open(filename, O_RDONLY);
char buf[4096];
ssize_t rv = read(fd, buf, sizeof(buf));
asm volatile ("" :: "g" (rv));
close(fd);
}
}
int main(int argc, char *argv[])
{
if (argc < 4) {
std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
return 1;
}
N = atoi(argv[1]);
filename = argv[2];
R = atoi(argv[3]);
for (int i = 0; i < NR_CPUS; i++) {
if (glue(i) == 0)
break;
}
std::vector<std::thread> T;
T.reserve(N);
for (int i = 0; i < N; i++) {
T.emplace_back(f, i);
}
auto t0 = std::chrono::system_clock::now();
{
*(volatile int *)&xxx = 1;
for (auto& t: T) {
t.join();
}
}
auto t1 = std::chrono::system_clock::now();
std::chrono::duration<double> dt = t1 - t0;
std::cout << dt.count() << '
';
return 0;
}
P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.
[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-06 20:09:01 -07:00
|
|
|
.proc_flags = PROC_ENTRY_PERMANENT,
|
2020-02-03 17:37:17 -08:00
|
|
|
.proc_open = slabinfo_open,
|
|
|
|
.proc_read = seq_read,
|
|
|
|
.proc_lseek = seq_lseek,
|
|
|
|
.proc_release = seq_release,
|
2012-10-19 18:20:25 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
static int __init slab_proc_init(void)
|
|
|
|
{
|
2020-02-03 17:37:17 -08:00
|
|
|
proc_create("slabinfo", SLABINFO_RIGHTS, NULL, &slabinfo_proc_ops);
|
2012-10-19 18:20:25 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
module_init(slab_proc_init);
|
2019-07-11 20:56:38 -07:00
|
|
|
|
2023-10-02 17:43:38 +02:00
|
|
|
#endif /* CONFIG_SLUB_DEBUG */
|
2014-08-06 16:04:44 -07:00
|
|
|
|
|
|
|
/**
|
2020-08-06 23:18:13 -07:00
|
|
|
* kfree_sensitive - Clear sensitive information in memory before freeing
|
2014-08-06 16:04:44 -07:00
|
|
|
* @p: object to free memory of
|
|
|
|
*
|
|
|
|
* The memory of the object @p points to is zeroed before freed.
|
2020-08-06 23:18:13 -07:00
|
|
|
* If @p is %NULL, kfree_sensitive() does nothing.
|
2014-08-06 16:04:44 -07:00
|
|
|
*
|
|
|
|
* Note: this function zeroes the whole allocated buffer which can be a good
|
|
|
|
* deal bigger than the requested buffer size passed to kmalloc(). So be
|
|
|
|
* careful when using this function in performance sensitive code.
|
|
|
|
*/
|
2020-08-06 23:18:13 -07:00
|
|
|
void kfree_sensitive(const void *p)
|
2014-08-06 16:04:44 -07:00
|
|
|
{
|
|
|
|
size_t ks;
|
|
|
|
void *mem = (void *)p;
|
|
|
|
|
|
|
|
ks = ksize(mem);
|
2022-09-22 13:08:16 -07:00
|
|
|
if (ks) {
|
|
|
|
kasan_unpoison_range(mem, ks);
|
2020-08-06 23:18:17 -07:00
|
|
|
memzero_explicit(mem, ks);
|
2022-09-22 13:08:16 -07:00
|
|
|
}
|
2014-08-06 16:04:44 -07:00
|
|
|
kfree(mem);
|
|
|
|
}
|
2020-08-06 23:18:13 -07:00
|
|
|
EXPORT_SYMBOL(kfree_sensitive);
|
2014-08-06 16:04:44 -07:00
|
|
|
|
2019-07-11 20:54:14 -07:00
|
|
|
size_t ksize(const void *objp)
|
|
|
|
{
|
2019-07-11 20:54:18 -07:00
|
|
|
/*
|
2022-09-22 13:08:16 -07:00
|
|
|
* We need to first check that the pointer to the object is valid.
|
|
|
|
* The KASAN report printed from ksize() is more useful, then when
|
|
|
|
* it's printed later when the behaviour could be undefined due to
|
|
|
|
* a potential use-after-free or double-free.
|
2019-07-11 20:54:18 -07:00
|
|
|
*
|
2021-02-24 12:05:50 -08:00
|
|
|
* We use kasan_check_byte(), which is supported for the hardware
|
|
|
|
* tag-based KASAN mode, unlike kasan_check_read/write().
|
|
|
|
*
|
|
|
|
* If the pointed to memory is invalid, we return 0 to avoid users of
|
2019-07-11 20:54:18 -07:00
|
|
|
* ksize() writing to and potentially corrupting the memory region.
|
|
|
|
*
|
|
|
|
* We want to perform the check before __ksize(), to avoid potentially
|
|
|
|
* crashing in __ksize() due to accessing invalid metadata.
|
|
|
|
*/
|
2021-02-24 12:05:50 -08:00
|
|
|
if (unlikely(ZERO_OR_NULL_PTR(objp)) || !kasan_check_byte(objp))
|
2019-07-11 20:54:18 -07:00
|
|
|
return 0;
|
|
|
|
|
2022-09-22 13:08:16 -07:00
|
|
|
return kfence_ksize(objp) ?: __ksize(objp);
|
2019-07-11 20:54:14 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(ksize);
|
|
|
|
|
2024-10-10 16:25:04 -07:00
|
|
|
#ifdef CONFIG_BPF_SYSCALL
|
|
|
|
#include <linux/btf.h>
|
|
|
|
|
|
|
|
__bpf_kfunc_start_defs();
|
|
|
|
|
|
|
|
__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr)
|
|
|
|
{
|
|
|
|
struct slab *slab;
|
|
|
|
|
|
|
|
if (!virt_addr_valid((void *)(long)addr))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
slab = virt_to_slab((void *)(long)addr);
|
|
|
|
return slab ? slab->slab_cache : NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
__bpf_kfunc_end_defs();
|
|
|
|
#endif /* CONFIG_BPF_SYSCALL */
|
|
|
|
|
2014-08-06 16:04:44 -07:00
|
|
|
/* Tracepoints definitions. */
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmalloc);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kfree);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
|
2018-04-05 16:23:57 -07:00
|
|
|
|