License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2020-06-09 04:32:38 +00:00
|
|
|
#ifndef _LINUX_PGTABLE_H
|
|
|
|
#define _LINUX_PGTABLE_H
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-01-16 00:56:43 +00:00
|
|
|
#include <linux/pfn.h>
|
2020-06-09 04:32:38 +00:00
|
|
|
#include <asm/pgtable.h>
|
2016-01-16 00:56:43 +00:00
|
|
|
|
2023-08-18 20:23:33 +00:00
|
|
|
#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
|
|
|
|
#define PUD_ORDER (PUD_SHIFT - PAGE_SHIFT)
|
|
|
|
|
2006-09-26 06:32:29 +00:00
|
|
|
#ifndef __ASSEMBLY__
|
2007-08-10 20:01:20 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2006-09-26 06:32:29 +00:00
|
|
|
|
2011-02-27 05:41:35 +00:00
|
|
|
#include <linux/mm_types.h>
|
2011-11-24 01:12:59 +00:00
|
|
|
#include <linux/bug.h>
|
2015-04-14 22:47:23 +00:00
|
|
|
#include <linux/errno.h>
|
2020-04-07 03:05:33 +00:00
|
|
|
#include <asm-generic/pgtable_uffd.h>
|
2022-05-13 03:23:06 +00:00
|
|
|
#include <linux/page_table_check.h>
|
2011-02-27 05:41:35 +00:00
|
|
|
|
2017-03-09 14:24:07 +00:00
|
|
|
#if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
|
|
|
|
defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
|
|
|
|
#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{P4D,PUD,PMD}_FOLDED
|
2015-04-14 22:46:17 +00:00
|
|
|
#endif
|
|
|
|
|
2013-04-29 22:07:44 +00:00
|
|
|
/*
|
|
|
|
* On almost all architectures and configurations, 0 can be used as the
|
|
|
|
* upper ceiling to free_pgtables(): on many architectures it has the same
|
|
|
|
* effect as using TASK_SIZE. However, there is one configuration which
|
|
|
|
* must impose a more careful limit, to avoid freeing kernel pgtables.
|
|
|
|
*/
|
|
|
|
#ifndef USER_PGTABLES_CEILING
|
|
|
|
#define USER_PGTABLES_CEILING 0UL
|
|
|
|
#endif
|
2021-07-01 01:53:13 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This defines the first usable user address. Platforms
|
|
|
|
* can override its value with custom FIRST_USER_ADDRESS
|
|
|
|
* defined in their respective <asm/pgtable.h>.
|
|
|
|
*/
|
|
|
|
#ifndef FIRST_USER_ADDRESS
|
|
|
|
#define FIRST_USER_ADDRESS 0UL
|
|
|
|
#endif
|
2021-07-01 01:53:59 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This defines the generic helper for accessing PMD page
|
|
|
|
* table page. Although platforms can still override this
|
|
|
|
* via their respective <asm/pgtable.h>.
|
|
|
|
*/
|
|
|
|
#ifndef pmd_pgtable
|
|
|
|
#define pmd_pgtable(pmd) pmd_page(pmd)
|
|
|
|
#endif
|
2013-04-29 22:07:44 +00:00
|
|
|
|
2024-03-26 20:28:23 +00:00
|
|
|
#define pmd_folio(pmd) page_folio(pmd_page(pmd))
|
|
|
|
|
2020-06-09 04:33:10 +00:00
|
|
|
/*
|
|
|
|
* A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
|
|
|
|
*
|
|
|
|
* The pXx_index() functions return the index of the entry in the page
|
|
|
|
* table page which would control the given virtual address
|
|
|
|
*
|
|
|
|
* As these functions may be used by the same code for different levels of
|
|
|
|
* the page table folding, they are always available, regardless of
|
|
|
|
* CONFIG_PGTABLE_LEVELS value. For the folded levels they simply return 0
|
|
|
|
* because in such cases PTRS_PER_PxD equals 1.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static inline unsigned long pte_index(unsigned long address)
|
|
|
|
{
|
|
|
|
return (address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef pmd_index
|
|
|
|
static inline unsigned long pmd_index(unsigned long address)
|
|
|
|
{
|
|
|
|
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
|
|
|
|
}
|
|
|
|
#define pmd_index pmd_index
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pud_index
|
|
|
|
static inline unsigned long pud_index(unsigned long address)
|
|
|
|
{
|
|
|
|
return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
|
|
|
|
}
|
|
|
|
#define pud_index pud_index
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pgd_index
|
|
|
|
/* Must be a compile-time constant, so implement it as a macro */
|
|
|
|
#define pgd_index(a) (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
|
|
|
|
#endif
|
|
|
|
|
2024-11-04 07:07:12 +00:00
|
|
|
#ifndef kernel_pte_init
|
|
|
|
static inline void kernel_pte_init(void *addr)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#define kernel_pte_init kernel_pte_init
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pmd_init
|
|
|
|
static inline void pmd_init(void *addr)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#define pmd_init pmd_init
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pud_init
|
|
|
|
static inline void pud_init(void *addr)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#define pud_init pud_init
|
|
|
|
#endif
|
|
|
|
|
2020-06-09 04:33:10 +00:00
|
|
|
#ifndef pte_offset_kernel
|
|
|
|
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
|
|
|
|
{
|
|
|
|
return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
|
|
|
|
}
|
|
|
|
#define pte_offset_kernel pte_offset_kernel
|
|
|
|
#endif
|
|
|
|
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
#ifdef CONFIG_HIGHPTE
|
|
|
|
#define __pte_map(pmd, address) \
|
|
|
|
((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
|
|
|
|
#define pte_unmap(pte) do { \
|
|
|
|
kunmap_local((pte)); \
|
2023-07-12 04:30:40 +00:00
|
|
|
rcu_read_unlock(); \
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
} while (0)
|
2020-06-09 04:33:10 +00:00
|
|
|
#else
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
|
|
|
|
{
|
|
|
|
return pte_offset_kernel(pmd, address);
|
|
|
|
}
|
|
|
|
static inline void pte_unmap(pte_t *pte)
|
|
|
|
{
|
2023-07-12 04:30:40 +00:00
|
|
|
rcu_read_unlock();
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
}
|
2020-06-09 04:33:10 +00:00
|
|
|
#endif
|
|
|
|
|
2023-07-12 04:39:48 +00:00
|
|
|
void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
|
|
|
|
|
2020-06-09 04:33:10 +00:00
|
|
|
/* Find an entry in the second-level page table.. */
|
|
|
|
#ifndef pmd_offset
|
|
|
|
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
|
|
|
|
{
|
2021-07-08 01:09:53 +00:00
|
|
|
return pud_pgtable(*pud) + pmd_index(address);
|
2020-06-09 04:33:10 +00:00
|
|
|
}
|
|
|
|
#define pmd_offset pmd_offset
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pud_offset
|
|
|
|
static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
|
|
|
|
{
|
2021-07-08 01:09:56 +00:00
|
|
|
return p4d_pgtable(*p4d) + pud_index(address);
|
2020-06-09 04:33:10 +00:00
|
|
|
}
|
|
|
|
#define pud_offset pud_offset
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static inline pgd_t *pgd_offset_pgd(pgd_t *pgd, unsigned long address)
|
|
|
|
{
|
|
|
|
return (pgd + pgd_index(address));
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* a shortcut to get a pgd_t in a given mm
|
|
|
|
*/
|
|
|
|
#ifndef pgd_offset
|
|
|
|
#define pgd_offset(mm, address) pgd_offset_pgd((mm)->pgd, (address))
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* a shortcut which implies the use of the kernel's pgd, instead
|
|
|
|
* of a process's
|
|
|
|
*/
|
|
|
|
#define pgd_offset_k(address) pgd_offset(&init_mm, (address))
|
|
|
|
|
2020-06-09 04:33:05 +00:00
|
|
|
/*
|
|
|
|
* In many cases it is known that a virtual address is mapped at PMD or PTE
|
|
|
|
* level, so instead of traversing all the page table levels, we can get a
|
|
|
|
* pointer to the PMD entry in user or kernel page table or translate a virtual
|
|
|
|
* address to the pointer in the PTE in the kernel page tables with simple
|
|
|
|
* helpers.
|
|
|
|
*/
|
|
|
|
static inline pmd_t *pmd_off(struct mm_struct *mm, unsigned long va)
|
|
|
|
{
|
|
|
|
return pmd_offset(pud_offset(p4d_offset(pgd_offset(mm, va), va), va), va);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pmd_t *pmd_off_k(unsigned long va)
|
|
|
|
{
|
|
|
|
return pmd_offset(pud_offset(p4d_offset(pgd_offset_k(va), va), va), va);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pte_t *virt_to_kpte(unsigned long vaddr)
|
|
|
|
{
|
|
|
|
pmd_t *pmd = pmd_off_k(vaddr);
|
|
|
|
|
|
|
|
return pmd_none(*pmd) ? NULL : pte_offset_kernel(pmd, vaddr);
|
|
|
|
}
|
|
|
|
|
2022-11-30 22:49:41 +00:00
|
|
|
#ifndef pmd_young
|
|
|
|
static inline int pmd_young(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2023-12-27 14:12:04 +00:00
|
|
|
#ifndef pmd_dirty
|
|
|
|
static inline int pmd_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2023-08-02 15:13:34 +00:00
|
|
|
/*
|
|
|
|
* A facility to provide lazy MMU batching. This allows PTE updates and
|
|
|
|
* page invalidations to be delayed until a call to leave lazy MMU mode
|
|
|
|
* is issued. Some architectures may benefit from doing this, and it is
|
|
|
|
* beneficial for both shadow and direct mode hypervisors, which may batch
|
|
|
|
* the PTE updates which happen during this window. Note that using this
|
|
|
|
* interface requires that read hazards be removed from the code. A read
|
|
|
|
* hazard could result in the direct mode hypervisor case, since the actual
|
|
|
|
* write to the page tables may not yet have taken place, so reads though
|
|
|
|
* a raw PTE pointer after it has been modified are not guaranteed to be
|
|
|
|
* up to date. This mode can only be entered and left under the protection of
|
|
|
|
* the page table locks for all page tables which may be modified. In the UP
|
|
|
|
* case, this is required so that preemption is disabled, and in the SMP case,
|
|
|
|
* it must synchronize the delayed page table writes properly on other CPUs.
|
|
|
|
*/
|
|
|
|
#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
|
|
|
|
#define arch_enter_lazy_mmu_mode() do {} while (0)
|
|
|
|
#define arch_leave_lazy_mmu_mode() do {} while (0)
|
|
|
|
#define arch_flush_lazy_mmu_mode() do {} while (0)
|
|
|
|
#endif
|
|
|
|
|
2024-02-15 10:32:02 +00:00
|
|
|
#ifndef pte_batch_hint
|
|
|
|
/**
|
|
|
|
* pte_batch_hint - Number of pages that can be added to batch without scanning.
|
|
|
|
* @ptep: Page table pointer for the entry.
|
|
|
|
* @pte: Page table entry.
|
|
|
|
*
|
|
|
|
* Some architectures know that a set of contiguous ptes all map the same
|
|
|
|
* contiguous memory with the same permissions. In this case, it can provide a
|
|
|
|
* hint to aid pte batching without the core code needing to scan every pte.
|
|
|
|
*
|
|
|
|
* An architecture implementation may ignore the PTE accessed state. Further,
|
|
|
|
* the dirty state must apply atomically to all the PTEs described by the hint.
|
|
|
|
*
|
|
|
|
* May be overridden by the architecture, else pte_batch_hint is always 1.
|
|
|
|
*/
|
|
|
|
static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
|
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2024-02-15 10:31:50 +00:00
|
|
|
#ifndef pte_advance_pfn
|
|
|
|
static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
|
2023-09-20 04:09:58 +00:00
|
|
|
{
|
2024-02-15 10:31:50 +00:00
|
|
|
return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
|
2023-09-20 04:09:58 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2024-02-15 10:31:50 +00:00
|
|
|
#define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
|
|
|
|
|
2024-01-29 12:46:42 +00:00
|
|
|
#ifndef set_ptes
|
2023-08-02 15:13:34 +00:00
|
|
|
/**
|
|
|
|
* set_ptes - Map consecutive pages to a contiguous range of addresses.
|
|
|
|
* @mm: Address space to map the pages into.
|
|
|
|
* @addr: Address to map the first page at.
|
|
|
|
* @ptep: Page table pointer for the first entry.
|
|
|
|
* @pte: Page table entry for the first page.
|
|
|
|
* @nr: Number of pages to map.
|
|
|
|
*
|
mm: clarify the spec for set_ptes()
Patch series "Transparent Contiguous PTEs for User Mappings", v6.
This is a series to opportunistically and transparently use contpte
mappings (set the contiguous bit in ptes) for user memory when those
mappings meet the requirements. The change benefits arm64, but there is
some (very) minor refactoring for x86 to enable its integration with
core-mm.
It is part of a wider effort to improve performance by allocating and
mapping variable-sized blocks of memory (folios). One aim is for the 4K
kernel to approach the performance of the 16K kernel, but without breaking
compatibility and without the associated increase in memory. Another aim
is to benefit the 16K and 64K kernels by enabling 2M THP, since this is
the contpte size for those kernels. We have good performance data that
demonstrates both aims are being met (see below).
Of course this is only one half of the change. We require the mapped
physical memory to be the correct size and alignment for this to actually
be useful (i.e. 64K for 4K pages, or 2M for 16K/64K pages). Fortunately
folios are solving this problem for us. Filesystems that support it (XFS,
AFS, EROFS, tmpfs, ...) will allocate large folios up to the PMD size
today, and more filesystems are coming. And for anonymous memory,
"multi-size THP" is now upstream.
Patch Layout
============
In this version, I've split the patches to better show each optimization:
- 1-2: mm prep: misc code and docs cleanups
- 3-6: mm,arm64,x86 prep: Add pte_advance_pfn() and make pte_next_pfn() a
generic wrapper around it
- 7-11: arm64 prep: Refactor ptep helpers into new layer
- 12: functional contpte implementation
- 23-18: various optimizations on top of the contpte implementation
Testing
=======
I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
- mm selftests (inc new tests written for multi-size THP); no regressions
- Speedometer Java script benchmark in Chromium web browser; no issues
- Kernel compilation; no issues
- Various tests under high memory pressure with swap enabled; no issues
Performance
===========
High Level Use Cases
~~~~~~~~~~~~~~~~~~~~
First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).
baseline: mm-unstable (mTHP switched off)
mTHP: + enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte: + this series
mTHP + contpte + exefolio: + patch at [6], which series supports
Kernel Compilation with -j8 (negative is faster):
| kernel | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline | 0.0% | 0.0% | 0.0% |
| mTHP | -5.0% | -39.1% | -0.7% |
| mTHP + contpte | -6.0% | -41.4% | -1.5% |
| mTHP + contpte + exefolio | -7.8% | -43.1% | -3.4% |
Kernel Compilation with -j80 (negative is faster):
| kernel | real-time | kern-time | user-time |
|---------------------------|-----------|-----------|-----------|
| baseline | 0.0% | 0.0% | 0.0% |
| mTHP | -5.0% | -36.6% | -0.6% |
| mTHP + contpte | -6.1% | -38.2% | -1.6% |
| mTHP + contpte + exefolio | -7.4% | -39.2% | -3.2% |
Speedometer (positive is faster):
| kernel | runs_per_min |
|:--------------------------|--------------|
| baseline | 0.0% |
| mTHP | 1.5% |
| mTHP + contpte | 3.2% |
| mTHP + contpte + exefolio | 4.5% |
Micro Benchmarks
~~~~~~~~~~~~~~~~
The following microbenchmarks are intended to demonstrate the performance of
fork() and munmap() do not regress. I'm showing results for order-0 (4K)
mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
benchmarks.
baseline: mm-unstable + batch zap [7] series
contpte-basic: + patches 0-19; functional contpte implementation
contpte-batch: + patches 20-23; implement new batched APIs
contpte-inline: + patch 24; __always_inline to help compiler
contpte-fold: + patch 25; fold contpte mapping when sensible
Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
(on top of MacOS) for reference, although experience suggests this might not be
the most reliable for performance numbers of this sort:
| FORK | order-0 | order-9 |
| Ampere Altra |------------------------|------------------------|
| (pte-map) | mean | stdev | mean | stdev |
|----------------|------------|-----------|------------|-----------|
| baseline | 0.0% | 2.7% | 0.0% | 0.2% |
| contpte-basic | 6.3% | 1.4% | 1948.7% | 0.2% |
| contpte-batch | 7.6% | 2.0% | -1.9% | 0.4% |
| contpte-inline | 3.6% | 1.5% | -1.0% | 0.2% |
| contpte-fold | 4.6% | 2.1% | -1.8% | 0.2% |
| MUNMAP | order-0 | order-9 |
| Ampere Altra |------------------------|------------------------|
| (pte-map) | mean | stdev | mean | stdev |
|----------------|------------|-----------|------------|-----------|
| baseline | 0.0% | 0.5% | 0.0% | 0.3% |
| contpte-basic | 1.8% | 0.3% | 1104.8% | 0.1% |
| contpte-batch | -0.3% | 0.4% | 2.7% | 0.1% |
| contpte-inline | -0.1% | 0.6% | 0.9% | 0.1% |
| contpte-fold | 0.1% | 0.6% | 0.8% | 0.1% |
| FORK | order-0 | order-9 |
| Apple M2 VM |------------------------|------------------------|
| (pte-map) | mean | stdev | mean | stdev |
|----------------|------------|-----------|------------|-----------|
| baseline | 0.0% | 1.4% | 0.0% | 0.8% |
| contpte-basic | 6.8% | 1.2% | 469.4% | 1.4% |
| contpte-batch | -7.7% | 2.0% | -8.9% | 0.7% |
| contpte-inline | -6.0% | 2.1% | -6.0% | 2.0% |
| contpte-fold | 5.9% | 1.4% | -6.4% | 1.4% |
| MUNMAP | order-0 | order-9 |
| Apple M2 VM |------------------------|------------------------|
| (pte-map) | mean | stdev | mean | stdev |
|----------------|------------|-----------|------------|-----------|
| baseline | 0.0% | 0.6% | 0.0% | 0.4% |
| contpte-basic | 1.6% | 0.6% | 233.6% | 0.7% |
| contpte-batch | 1.9% | 0.3% | -3.9% | 0.4% |
| contpte-inline | 2.2% | 0.8% | -1.6% | 0.9% |
| contpte-fold | 1.5% | 0.7% | -1.7% | 0.7% |
Misc
~~~~
John Hubbard at Nvidia has indicated dramatic 10x performance improvements
for some workloads at [8], when using 64K base page kernel.
[1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/lkml/20231218105100.172635-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/linux-mm/633af0a7-0823-424f-b6ef-374d99483f05@arm.com/
[6] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647dc3@arm.com/
[7] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/
[8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/
[9] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v6
This patch (of 18):
set_ptes() spec implies that it can only be used to set a present pte
because it interprets the PFN field to increment it. However,
set_pte_at() has been implemented on top of set_ptes() since set_ptes()
was introduced, and set_pte_at() allows setting a pte to a not-present
state. So clarify the spec to state that when nr==1, new state of pte may
be present or not present. When nr>1, new state of all ptes must be
present.
While we are at it, tighten the spec to set requirements around the
initial state of ptes; when nr==1 it may be either present or not-present.
But when nr>1 all ptes must initially be not-present. All set_ptes()
callsites already conform to this requirement. Stating it explicitly is
useful because it allows for a simplification to the upcoming arm64
contpte implementation.
Link: https://lkml.kernel.org/r/20240215103205.2607016-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20240215103205.2607016-2-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-15 10:31:48 +00:00
|
|
|
* When nr==1, initial state of pte may be present or not present, and new state
|
|
|
|
* may be present or not present. When nr>1, initial state of all ptes must be
|
|
|
|
* not present, and new state must be present.
|
|
|
|
*
|
2023-08-02 15:13:34 +00:00
|
|
|
* May be overridden by the architecture, or the architecture can define
|
|
|
|
* set_pte() and PFN_PTE_SHIFT.
|
|
|
|
*
|
|
|
|
* Context: The caller holds the page table lock. The pages all belong
|
|
|
|
* to the same folio. The PTEs are all in the same PMD.
|
|
|
|
*/
|
|
|
|
static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
|
|
|
|
pte_t *ptep, pte_t pte, unsigned int nr)
|
|
|
|
{
|
|
|
|
page_table_check_ptes_set(mm, ptep, pte, nr);
|
|
|
|
|
|
|
|
arch_enter_lazy_mmu_mode();
|
|
|
|
for (;;) {
|
|
|
|
set_pte(ptep, pte);
|
|
|
|
if (--nr == 0)
|
|
|
|
break;
|
|
|
|
ptep++;
|
2023-09-20 04:09:58 +00:00
|
|
|
pte = pte_next_pfn(pte);
|
2023-08-02 15:13:34 +00:00
|
|
|
}
|
|
|
|
arch_leave_lazy_mmu_mode();
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
|
2011-01-13 23:46:40 +00:00
|
|
|
extern int ptep_set_access_flags(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pte_t *ptep,
|
|
|
|
pte_t entry, int dirty);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
|
2015-07-09 11:52:44 +00:00
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
2011-01-13 23:46:40 +00:00
|
|
|
extern int pmdp_set_access_flags(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp,
|
|
|
|
pmd_t entry, int dirty);
|
2017-02-24 22:57:02 +00:00
|
|
|
extern int pudp_set_access_flags(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pud_t *pudp,
|
|
|
|
pud_t entry, int dirty);
|
2015-07-09 11:52:44 +00:00
|
|
|
#else
|
|
|
|
static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp,
|
|
|
|
pmd_t entry, int dirty)
|
|
|
|
{
|
|
|
|
BUILD_BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
2017-02-24 22:57:02 +00:00
|
|
|
static inline int pudp_set_access_flags(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pud_t *pudp,
|
|
|
|
pud_t entry, int dirty)
|
|
|
|
{
|
|
|
|
BUILD_BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
2015-07-09 11:52:44 +00:00
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif
|
|
|
|
|
2023-06-12 15:15:44 +00:00
|
|
|
#ifndef ptep_get
|
|
|
|
static inline pte_t ptep_get(pte_t *ptep)
|
|
|
|
{
|
|
|
|
return READ_ONCE(*ptep);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pmdp_get
|
|
|
|
static inline pmd_t pmdp_get(pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
return READ_ONCE(*pmdp);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2023-12-13 20:29:59 +00:00
|
|
|
#ifndef pudp_get
|
|
|
|
static inline pud_t pudp_get(pud_t *pudp)
|
|
|
|
{
|
|
|
|
return READ_ONCE(*pudp);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef p4dp_get
|
|
|
|
static inline p4d_t p4dp_get(p4d_t *p4dp)
|
|
|
|
{
|
|
|
|
return READ_ONCE(*p4dp);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pgdp_get
|
|
|
|
static inline pgd_t pgdp_get(pgd_t *pgdp)
|
|
|
|
{
|
|
|
|
return READ_ONCE(*pgdp);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
|
2011-01-13 23:46:40 +00:00
|
|
|
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
pte_t *ptep)
|
|
|
|
{
|
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 15:15:45 +00:00
|
|
|
pte_t pte = ptep_get(ptep);
|
2011-01-13 23:46:40 +00:00
|
|
|
int r = 1;
|
|
|
|
if (!pte_young(pte))
|
|
|
|
r = 0;
|
|
|
|
else
|
|
|
|
set_pte_at(vma->vm_mm, address, ptep, pte_mkold(pte));
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
|
2022-09-18 07:59:59 +00:00
|
|
|
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
|
2011-01-13 23:46:40 +00:00
|
|
|
static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
pmd_t pmd = *pmdp;
|
|
|
|
int r = 1;
|
|
|
|
if (!pmd_young(pmd))
|
|
|
|
r = 0;
|
|
|
|
else
|
|
|
|
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mkold(pmd));
|
|
|
|
return r;
|
|
|
|
}
|
2015-07-09 11:52:44 +00:00
|
|
|
#else
|
2011-01-13 23:46:40 +00:00
|
|
|
static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
pmd_t *pmdp)
|
|
|
|
{
|
2015-07-09 11:52:44 +00:00
|
|
|
BUILD_BUG();
|
2011-01-13 23:46:40 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2022-09-18 07:59:59 +00:00
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
|
2011-01-13 23:46:40 +00:00
|
|
|
int ptep_clear_flush_young(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pte_t *ptep);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
|
2015-07-09 11:52:44 +00:00
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp);
|
|
|
|
#else
|
|
|
|
/*
|
|
|
|
* Despite relevant to THP only, this API is called from generic rmap code
|
|
|
|
* under PageTransHuge(), hence needs a dummy implementation for !THP
|
|
|
|
*/
|
|
|
|
static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
BUILD_BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif
|
|
|
|
|
2022-11-23 06:45:10 +00:00
|
|
|
#ifndef arch_has_hw_nonleaf_pmd_young
|
|
|
|
/*
|
|
|
|
* Return whether the accessed bit in non-leaf PMD entries is supported on the
|
|
|
|
* local CPU.
|
|
|
|
*/
|
|
|
|
static inline bool arch_has_hw_nonleaf_pmd_young(void)
|
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
mm: x86, arm64: add arch_has_hw_pte_young()
Patch series "Multi-Gen LRU Framework", v14.
What's new
==========
1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
machines. The old direct reclaim backoff, which tries to enforce a
minimum fairness among all eligible memcgs, over-swapped by about
(total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
pulls the plug on swapping once the target is met, trades some
fairness for curtailed latency:
https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
3. Fixed minior build warnings and conflicts. More comments and nits.
TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.
Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.
03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
its sole caller"
Minor refactors to improve readability for the following patches.
05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.
06. mm: multi-gen LRU: minimal implementation
A minimal implementation without optimizations.
07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.
08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.
09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.
10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.
11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.
13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.
Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
Apache Cassandra Memcached
Apache Hadoop MongoDB
Apache Spark PostgreSQL
MariaDB (MySQL) Redis
An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.
On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
less wall time to sort three billion random integers, respectively,
under the medium- and the high-concurrency conditions, when
overcommitting memory. There were no statistically significant
changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
more transactions per minute (TPM), respectively, under the medium-
and the high-concurrency conditions, when overcommitting memory.
There were no statistically significant changes in TPM for the rest
of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
and [21.59, 30.02]% more operations per second (OPS), respectively,
for sequential access, random access and Gaussian (distribution)
access, when THP=always; 95% CIs [13.85, 15.97]% and
[23.94, 29.92]% more OPS, respectively, for random access and
Gaussian access, when THP=never. There were no statistically
significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
[2.16, 3.55]% more operations per second (OPS), respectively, for
exponential (distribution) access, random access and Zipfian
(distribution) access, when underutilizing memory; 95% CIs
[8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
respectively, for exponential access, random access and Zipfian
access, when overcommitting memory.
On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
and [4.11, 7.50]% more operations per second (OPS), respectively,
for exponential (distribution) access, random access and Zipfian
(distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
[6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
exponential access, random access and Zipfian access, when swap was
on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
less average wall time to finish twelve parallel TeraSort jobs,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in average wall time for the rest of the
benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
minute (TPM) under the high-concurrency condition, when swap was
off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
[11.47, 19.36]% more total operations per second (OPS),
respectively, for sequential access, random access and Gaussian
(distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
[10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
for sequential access, random access and Gaussian access, when
THP=never.
Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10].
fs_fio_bench_hdd_mq pft
fs_lmbench pgsql-hammerdb
fs_parallelio redis
fs_postmark stream
hackbench sysbenchthread
kernbench tpcc_spark
memcached unixbench
multichase vm-scalability
mutilate will-it-scale
nginx
[01] https://trends.google.com
[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
I have Archlinux with 8G RAM + zswap + swap. While developing, I
have lots of apps opened such as multiple LSP-servers for different
langs, chats, two browsers, etc... Usually, my system gets quickly
to a point of SWAP-storms, where I have to kill LSP-servers,
restart browsers to free memory, etc, otherwise the system lags
heavily and is barely usable.
1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
patchset, and I started up by opening lots of apps to create memory
pressure, and worked for a day like this. Till now I had not a
single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.
Vaibhav from IBM reported [12]:
In a synthetic MongoDB Benchmark, seeing an average of ~19%
throughput improvement on POWER10(Radix MMU + 64K Page Size) with
MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
three different request distributions, namely, Exponential, Uniform
and Zipfan.
Shuang from U of Rochester reported [13]:
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
and [9.26, 10.36]% higher throughput, respectively, for random
access, Zipfian (distribution) access and Gaussian (distribution)
access, when the average number of jobs per CPU is 1; 95% CIs
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
throughput, respectively, for random access, Zipfian access and
Gaussian access, when the average number of jobs per CPU is 2.
Daniel from Michigan Tech reported [14]:
With Memcached allocating ~100GB of byte-addressable Optante,
performance improvement in terms of throughput (measured as queries
per second) was about 10% for a series of workloads.
Large-scale deployments
-----------------------
We've rolled out MGLRU to tens of millions of ChromeOS users and
about a million Android users. Google's fleetwide profiling [15] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
app launch time at the 50th percentile.
The downstream kernels that have been using MGLRU include:
1. Android [16]
2. Arch Linux Zen [17]
3. Armbian [18]
4. ChromeOS [19]
5. Liquorix [20]
6. OpenWrt [21]
7. post-factum [22]
8. XanMod [23]
[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://dl.acm.org/doi/10.1145/2749469.2750392
[16] https://android.com
[17] https://archlinux.org
[18] https://armbian.com
[19] https://chromium.org
[20] https://liquorix.net
[21] https://openwrt.org
[22] https://codeberg.org/pf-kernel
[23] https://xanmod.org
Summary
=======
The facts are:
1. The independent lab results and the real-world applications
indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no smaller changes have been
demonstrated similar effects.
Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
materialize for a wide range of workloads.
2. Gauging the interest from the past discussions, the new features
will likely be put to use for both personal computers and data
centers.
3. Based on Google's track record, the new code will likely be well
maintained in the long term. It'd be more difficult if not
impossible to achieve similar effects with other approaches.
This patch (of 14):
Some architectures automatically set the accessed bit in PTEs, e.g., x86
and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault following
the TLB miss of this PTE (to emulate the accessed bit).
Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty page
faults when trying to clear the accessed bit in many PTEs.
Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it should
not be used in architecture-independent code that involves correctness,
e.g., to determine whether TLB flushes are required (in combination with
the accessed bit).
Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 07:59:58 +00:00
|
|
|
#ifndef arch_has_hw_pte_young
|
|
|
|
/*
|
|
|
|
* Return whether the accessed bit is supported on the local CPU.
|
|
|
|
*
|
|
|
|
* This stub assumes accessing through an old PTE triggers a page fault.
|
|
|
|
* Architectures that automatically set the access bit should overwrite it.
|
|
|
|
*/
|
|
|
|
static inline bool arch_has_hw_pte_young(void)
|
|
|
|
{
|
2023-12-27 14:12:01 +00:00
|
|
|
return IS_ENABLED(CONFIG_ARCH_HAS_HW_PTE_YOUNG);
|
mm: x86, arm64: add arch_has_hw_pte_young()
Patch series "Multi-Gen LRU Framework", v14.
What's new
==========
1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
machines. The old direct reclaim backoff, which tries to enforce a
minimum fairness among all eligible memcgs, over-swapped by about
(total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
pulls the plug on swapping once the target is met, trades some
fairness for curtailed latency:
https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
3. Fixed minior build warnings and conflicts. More comments and nits.
TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.
Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.
03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
its sole caller"
Minor refactors to improve readability for the following patches.
05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.
06. mm: multi-gen LRU: minimal implementation
A minimal implementation without optimizations.
07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.
08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.
09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.
10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.
11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.
13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.
Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
Apache Cassandra Memcached
Apache Hadoop MongoDB
Apache Spark PostgreSQL
MariaDB (MySQL) Redis
An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.
On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
less wall time to sort three billion random integers, respectively,
under the medium- and the high-concurrency conditions, when
overcommitting memory. There were no statistically significant
changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
more transactions per minute (TPM), respectively, under the medium-
and the high-concurrency conditions, when overcommitting memory.
There were no statistically significant changes in TPM for the rest
of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
and [21.59, 30.02]% more operations per second (OPS), respectively,
for sequential access, random access and Gaussian (distribution)
access, when THP=always; 95% CIs [13.85, 15.97]% and
[23.94, 29.92]% more OPS, respectively, for random access and
Gaussian access, when THP=never. There were no statistically
significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
[2.16, 3.55]% more operations per second (OPS), respectively, for
exponential (distribution) access, random access and Zipfian
(distribution) access, when underutilizing memory; 95% CIs
[8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
respectively, for exponential access, random access and Zipfian
access, when overcommitting memory.
On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
and [4.11, 7.50]% more operations per second (OPS), respectively,
for exponential (distribution) access, random access and Zipfian
(distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
[6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
exponential access, random access and Zipfian access, when swap was
on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
less average wall time to finish twelve parallel TeraSort jobs,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in average wall time for the rest of the
benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
minute (TPM) under the high-concurrency condition, when swap was
off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
respectively, under the medium- and the high-concurrency
conditions, when swap was on. There were no statistically
significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
[11.47, 19.36]% more total operations per second (OPS),
respectively, for sequential access, random access and Gaussian
(distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
[10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
for sequential access, random access and Gaussian access, when
THP=never.
Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10].
fs_fio_bench_hdd_mq pft
fs_lmbench pgsql-hammerdb
fs_parallelio redis
fs_postmark stream
hackbench sysbenchthread
kernbench tpcc_spark
memcached unixbench
multichase vm-scalability
mutilate will-it-scale
nginx
[01] https://trends.google.com
[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
I have Archlinux with 8G RAM + zswap + swap. While developing, I
have lots of apps opened such as multiple LSP-servers for different
langs, chats, two browsers, etc... Usually, my system gets quickly
to a point of SWAP-storms, where I have to kill LSP-servers,
restart browsers to free memory, etc, otherwise the system lags
heavily and is barely usable.
1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
patchset, and I started up by opening lots of apps to create memory
pressure, and worked for a day like this. Till now I had not a
single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
getting to the point of 3G in SWAP before without a single
SWAP-storm.
Vaibhav from IBM reported [12]:
In a synthetic MongoDB Benchmark, seeing an average of ~19%
throughput improvement on POWER10(Radix MMU + 64K Page Size) with
MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
three different request distributions, namely, Exponential, Uniform
and Zipfan.
Shuang from U of Rochester reported [13]:
With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
and [9.26, 10.36]% higher throughput, respectively, for random
access, Zipfian (distribution) access and Gaussian (distribution)
access, when the average number of jobs per CPU is 1; 95% CIs
[42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
throughput, respectively, for random access, Zipfian access and
Gaussian access, when the average number of jobs per CPU is 2.
Daniel from Michigan Tech reported [14]:
With Memcached allocating ~100GB of byte-addressable Optante,
performance improvement in terms of throughput (measured as queries
per second) was about 10% for a series of workloads.
Large-scale deployments
-----------------------
We've rolled out MGLRU to tens of millions of ChromeOS users and
about a million Android users. Google's fleetwide profiling [15] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
app launch time at the 50th percentile.
The downstream kernels that have been using MGLRU include:
1. Android [16]
2. Arch Linux Zen [17]
3. Armbian [18]
4. ChromeOS [19]
5. Liquorix [20]
6. OpenWrt [21]
7. post-factum [22]
8. XanMod [23]
[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://dl.acm.org/doi/10.1145/2749469.2750392
[16] https://android.com
[17] https://archlinux.org
[18] https://armbian.com
[19] https://chromium.org
[20] https://liquorix.net
[21] https://openwrt.org
[22] https://codeberg.org/pf-kernel
[23] https://xanmod.org
Summary
=======
The facts are:
1. The independent lab results and the real-world applications
indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
work out of the box; there are no equivalent solutions.
3. There is a lot of new code; no smaller changes have been
demonstrated similar effects.
Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
materialize for a wide range of workloads.
2. Gauging the interest from the past discussions, the new features
will likely be put to use for both personal computers and data
centers.
3. Based on Google's track record, the new code will likely be well
maintained in the long term. It'd be more difficult if not
impossible to achieve similar effects with other approaches.
This patch (of 14):
Some architectures automatically set the accessed bit in PTEs, e.g., x86
and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault following
the TLB miss of this PTE (to emulate the accessed bit).
Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty page
faults when trying to clear the accessed bit in many PTEs.
Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it should
not be used in architecture-independent code that involves correctness,
e.g., to determine whether TLB flushes are required (in combination with
the accessed bit).
Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 07:59:58 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2023-06-13 00:10:43 +00:00
|
|
|
#ifndef arch_check_zapped_pte
|
|
|
|
static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
|
|
|
|
pte_t pte)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef arch_check_zapped_pmd
|
|
|
|
static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
|
|
|
|
pmd_t pmd)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2024-08-12 18:12:23 +00:00
|
|
|
#ifndef arch_check_zapped_pud
|
|
|
|
static inline void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
|
2011-01-13 23:46:40 +00:00
|
|
|
static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
|
|
|
pte_t *ptep)
|
|
|
|
{
|
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 15:15:45 +00:00
|
|
|
pte_t pte = ptep_get(ptep);
|
2011-01-13 23:46:40 +00:00
|
|
|
pte_clear(mm, address, ptep);
|
2023-07-13 17:26:31 +00:00
|
|
|
page_table_check_pte_clear(mm, pte);
|
2011-01-13 23:46:40 +00:00
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2024-04-18 13:44:32 +00:00
|
|
|
#ifndef clear_young_dirty_ptes
|
|
|
|
/**
|
|
|
|
* clear_young_dirty_ptes - Mark PTEs that map consecutive pages of the
|
|
|
|
* same folio as old/clean.
|
|
|
|
* @mm: Address space the pages are mapped into.
|
|
|
|
* @addr: Address the first page is mapped at.
|
|
|
|
* @ptep: Page table pointer for the first entry.
|
|
|
|
* @nr: Number of entries to mark old/clean.
|
|
|
|
* @flags: Flags to modify the PTE batch semantics.
|
|
|
|
*
|
|
|
|
* May be overridden by the architecture; otherwise, implemented by
|
|
|
|
* get_and_clear/modify/set for each pte in the range.
|
|
|
|
*
|
|
|
|
* Note that PTE bits in the PTE range besides the PFN can differ. For example,
|
|
|
|
* some PTEs might be write-protected.
|
|
|
|
*
|
|
|
|
* Context: The caller holds the page table lock. The PTEs map consecutive
|
|
|
|
* pages that belong to the same folio. The PTEs are all in the same PMD.
|
|
|
|
*/
|
|
|
|
static inline void clear_young_dirty_ptes(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, pte_t *ptep,
|
|
|
|
unsigned int nr, cydp_t flags)
|
|
|
|
{
|
|
|
|
pte_t pte;
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
if (flags == CYDP_CLEAR_YOUNG)
|
|
|
|
ptep_test_and_clear_young(vma, addr, ptep);
|
|
|
|
else {
|
|
|
|
pte = ptep_get_and_clear(vma->vm_mm, addr, ptep);
|
|
|
|
if (flags & CYDP_CLEAR_YOUNG)
|
|
|
|
pte = pte_mkold(pte);
|
|
|
|
if (flags & CYDP_CLEAR_DIRTY)
|
|
|
|
pte = pte_mkclean(pte);
|
|
|
|
set_pte_at(vma->vm_mm, addr, ptep, pte);
|
|
|
|
}
|
|
|
|
if (--nr == 0)
|
|
|
|
break;
|
|
|
|
ptep++;
|
|
|
|
addr += PAGE_SIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2022-05-13 03:23:06 +00:00
|
|
|
static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
|
|
|
|
pte_t *ptep)
|
|
|
|
{
|
mm: pgtable: make ptep_clear() non-atomic
In the generic ptep_get_and_clear() implementation, it is just a simple
combination of ptep_get() and pte_clear(). But for some architectures
(such as x86 and arm64, etc), the hardware will modify the A/D bits of the
page table entry, so the ptep_get_and_clear() needs to be overwritten
and implemented as an atomic operation to avoid contention, which has a
performance cost.
The commit d283d422c6c4 ("x86: mm: add x86_64 support for page table
check") adds the ptep_clear() on the x86, and makes it call
ptep_get_and_clear() when CONFIG_PAGE_TABLE_CHECK is enabled. The page
table check feature does not actually care about the A/D bits, so only
ptep_get() + pte_clear() should be called. But considering that the page
table check is a debug option, this should not have much of an impact.
But then the commit de8c8e52836d ("mm: page_table_check: add hooks to
public helpers") changed ptep_clear() to unconditionally call
ptep_get_and_clear(), so that the CONFIG_PAGE_TABLE_CHECK check can be
put into the page table check stubs (in include/linux/page_table_check.h).
This also cause performance loss to the kernel without
CONFIG_PAGE_TABLE_CHECK enabled, which doesn't make sense.
Currently ptep_clear() is only used in debug code and in khugepaged
collapse paths, which are fairly expensive. So the cost of an extra atomic
RMW operation does not matter. But this may be used for other paths in the
future. After all, for the present pte entry, we need to call ptep_clear()
instead of pte_clear() to ensure that PAGE_TABLE_CHECK works properly.
So to be more precise, just calling ptep_get() and pte_clear() in the
ptep_clear().
Link: https://lkml.kernel.org/r/20241122073652.54030-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-22 07:36:52 +00:00
|
|
|
pte_t pte = ptep_get(ptep);
|
|
|
|
|
|
|
|
pte_clear(mm, addr, ptep);
|
|
|
|
/*
|
|
|
|
* No need for ptep_get_and_clear(): page table check doesn't care about
|
|
|
|
* any bits that could have been set by HW concurrently.
|
|
|
|
*/
|
|
|
|
page_table_check_pte_clear(mm, pte);
|
2022-05-13 03:23:06 +00:00
|
|
|
}
|
|
|
|
|
2022-10-21 12:51:44 +00:00
|
|
|
#ifdef CONFIG_GUP_GET_PXX_LOW_HIGH
|
2020-11-13 10:41:40 +00:00
|
|
|
/*
|
2020-11-26 13:04:46 +00:00
|
|
|
* For walking the pagetables without holding any locks. Some architectures
|
|
|
|
* (eg x86-32 PAE) cannot load the entries atomically without using expensive
|
|
|
|
* instructions. We are guaranteed that a PTE will only either go from not
|
|
|
|
* present to present, or present to not present -- it will not switch to a
|
|
|
|
* completely different present page without a TLB flush inbetween; which we
|
|
|
|
* are blocking by holding interrupts off.
|
2020-11-13 10:41:40 +00:00
|
|
|
*
|
|
|
|
* Setting ptes from not present to present goes:
|
|
|
|
*
|
|
|
|
* ptep->pte_high = h;
|
|
|
|
* smp_wmb();
|
|
|
|
* ptep->pte_low = l;
|
|
|
|
*
|
|
|
|
* And present to not present goes:
|
|
|
|
*
|
|
|
|
* ptep->pte_low = 0;
|
|
|
|
* smp_wmb();
|
|
|
|
* ptep->pte_high = 0;
|
|
|
|
*
|
|
|
|
* We must ensure here that the load of pte_low sees 'l' IFF pte_high sees 'h'.
|
|
|
|
* We load pte_high *after* loading pte_low, which ensures we don't see an older
|
|
|
|
* value of pte_high. *Then* we recheck pte_low, which ensures that we haven't
|
|
|
|
* picked up a changed pte high. We might have gotten rubbish values from
|
|
|
|
* pte_low and pte_high, but we are guaranteed that pte_low will not have the
|
|
|
|
* present bit set *unless* it is 'l'. Because get_user_pages_fast() only
|
|
|
|
* operates on present ptes we're safe.
|
|
|
|
*/
|
|
|
|
static inline pte_t ptep_get_lockless(pte_t *ptep)
|
|
|
|
{
|
|
|
|
pte_t pte;
|
|
|
|
|
|
|
|
do {
|
|
|
|
pte.pte_low = ptep->pte_low;
|
|
|
|
smp_rmb();
|
|
|
|
pte.pte_high = ptep->pte_high;
|
|
|
|
smp_rmb();
|
|
|
|
} while (unlikely(pte.pte_low != ptep->pte_low));
|
|
|
|
|
|
|
|
return pte;
|
|
|
|
}
|
2020-11-26 16:16:22 +00:00
|
|
|
#define ptep_get_lockless ptep_get_lockless
|
|
|
|
|
|
|
|
#if CONFIG_PGTABLE_LEVELS > 2
|
|
|
|
static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
pmd_t pmd;
|
|
|
|
|
|
|
|
do {
|
|
|
|
pmd.pmd_low = pmdp->pmd_low;
|
|
|
|
smp_rmb();
|
|
|
|
pmd.pmd_high = pmdp->pmd_high;
|
|
|
|
smp_rmb();
|
|
|
|
} while (unlikely(pmd.pmd_low != pmdp->pmd_low));
|
|
|
|
|
|
|
|
return pmd;
|
|
|
|
}
|
|
|
|
#define pmdp_get_lockless pmdp_get_lockless
|
mm/pgtable: add PAE safety to __pte_offset_map()
There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on a
pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn. pmdp_get_lockless() is not enough to prevent that.
Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests. It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.
Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.
CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm. It is
not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.
Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).
Link: https://lkml.kernel.org/r/3adcd8f-9191-2df1-d7ea-c4877698aad@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-12 04:32:05 +00:00
|
|
|
#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
|
2020-11-26 16:16:22 +00:00
|
|
|
#endif /* CONFIG_PGTABLE_LEVELS > 2 */
|
2022-10-21 12:51:44 +00:00
|
|
|
#endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
|
2020-11-26 16:16:22 +00:00
|
|
|
|
2020-11-13 10:41:40 +00:00
|
|
|
/*
|
|
|
|
* We require that the PTE can be read atomically.
|
|
|
|
*/
|
2020-11-26 16:16:22 +00:00
|
|
|
#ifndef ptep_get_lockless
|
2020-11-13 10:41:40 +00:00
|
|
|
static inline pte_t ptep_get_lockless(pte_t *ptep)
|
|
|
|
{
|
|
|
|
return ptep_get(ptep);
|
|
|
|
}
|
2020-11-26 16:16:22 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pmdp_get_lockless
|
|
|
|
static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
return pmdp_get(pmdp);
|
|
|
|
}
|
mm/pgtable: add PAE safety to __pte_offset_map()
There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on a
pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn. pmdp_get_lockless() is not enough to prevent that.
Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests. It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.
Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.
CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm. It is
not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.
Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).
Link: https://lkml.kernel.org/r/3adcd8f-9191-2df1-d7ea-c4877698aad@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-12 04:32:05 +00:00
|
|
|
static inline void pmdp_get_lockless_sync(void)
|
|
|
|
{
|
|
|
|
}
|
2020-11-26 16:16:22 +00:00
|
|
|
#endif
|
2020-11-13 10:41:40 +00:00
|
|
|
|
2011-01-13 23:46:40 +00:00
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
2017-02-24 22:57:02 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
|
2015-06-24 23:57:44 +00:00
|
|
|
static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
|
|
|
pmd_t *pmdp)
|
2011-01-13 23:46:40 +00:00
|
|
|
{
|
|
|
|
pmd_t pmd = *pmdp;
|
2022-05-13 03:23:06 +00:00
|
|
|
|
2012-10-08 23:32:59 +00:00
|
|
|
pmd_clear(pmdp);
|
2023-07-13 17:26:32 +00:00
|
|
|
page_table_check_pmd_clear(mm, pmd);
|
2022-05-13 03:23:06 +00:00
|
|
|
|
2011-01-13 23:46:40 +00:00
|
|
|
return pmd;
|
2011-06-15 22:08:34 +00:00
|
|
|
}
|
2017-02-24 22:57:02 +00:00
|
|
|
#endif /* __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR */
|
|
|
|
#ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR
|
|
|
|
static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
|
|
|
pud_t *pudp)
|
|
|
|
{
|
|
|
|
pud_t pud = *pudp;
|
|
|
|
|
|
|
|
pud_clear(pudp);
|
2023-07-13 17:26:33 +00:00
|
|
|
page_table_check_pud_clear(mm, pud);
|
2022-05-13 03:23:06 +00:00
|
|
|
|
2017-02-24 22:57:02 +00:00
|
|
|
return pud;
|
|
|
|
}
|
|
|
|
#endif /* __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR */
|
2011-01-13 23:46:40 +00:00
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-10-24 08:52:29 +00:00
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
2017-02-24 22:57:02 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR_FULL
|
2020-05-05 07:17:28 +00:00
|
|
|
static inline pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
|
2014-10-24 08:52:29 +00:00
|
|
|
unsigned long address, pmd_t *pmdp,
|
|
|
|
int full)
|
|
|
|
{
|
2020-05-05 07:17:28 +00:00
|
|
|
return pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
|
2014-10-24 08:52:29 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2017-02-24 22:57:02 +00:00
|
|
|
#ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
|
2023-07-24 19:07:48 +00:00
|
|
|
static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
|
2017-02-24 22:57:02 +00:00
|
|
|
unsigned long address, pud_t *pudp,
|
|
|
|
int full)
|
|
|
|
{
|
2023-07-24 19:07:48 +00:00
|
|
|
return pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
|
2017-02-24 22:57:02 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
|
|
|
[PATCH] x86: ptep_clear optimization
Add a new accessor for PTEs, which passes the full hint from the mmu_gather
struct; this allows architectures with hardware pagetables to optimize away
atomic PTE operations when destroying an address space. Removing the
locked operation should allow better pipelining of memory access in this
loop. I measured an average savings of 30-35 cycles per zap_pte_range on
the first 500 destructions on Pentium-M, but I believe the optimization
would win more on older processors which still assert the bus lock on xchg
for an exclusive cacheline.
Update: I made some new measurements, and this saves exactly 26 cycles over
ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180
cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
full address space destruction.
pte_clear_full is not yet used, but is provided for future optimizations
(in particular, when running inside of a hypervisor that queues page table
updates, the full hint allows us to avoid queueing unnecessary page table
update for an address space in the process of being destroyed.
This is not a huge win, but it does help a bit, and sets the stage for
further hypervisor optimization of the mm layer on all architectures.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-03 22:55:04 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
|
2011-01-13 23:46:40 +00:00
|
|
|
static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
|
|
|
|
unsigned long address, pte_t *ptep,
|
|
|
|
int full)
|
|
|
|
{
|
2022-11-28 13:07:43 +00:00
|
|
|
return ptep_get_and_clear(mm, address, ptep);
|
2011-01-13 23:46:40 +00:00
|
|
|
}
|
[PATCH] x86: ptep_clear optimization
Add a new accessor for PTEs, which passes the full hint from the mmu_gather
struct; this allows architectures with hardware pagetables to optimize away
atomic PTE operations when destroying an address space. Removing the
locked operation should allow better pipelining of memory access in this
loop. I measured an average savings of 30-35 cycles per zap_pte_range on
the first 500 destructions on Pentium-M, but I believe the optimization
would win more on older processors which still assert the bus lock on xchg
for an exclusive cacheline.
Update: I made some new measurements, and this saves exactly 26 cycles over
ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180
cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
full address space destruction.
pte_clear_full is not yet used, but is provided for future optimizations
(in particular, when running inside of a hypervisor that queues page table
updates, the full hint allows us to avoid queueing unnecessary page table
update for an address space in the process of being destroyed.
This is not a huge win, but it does help a bit, and sets the stage for
further hypervisor optimization of the mm layer on all architectures.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-03 22:55:04 +00:00
|
|
|
#endif
|
|
|
|
|
2024-02-14 20:44:35 +00:00
|
|
|
#ifndef get_and_clear_full_ptes
|
|
|
|
/**
|
|
|
|
* get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
|
|
|
|
* the same folio, collecting dirty/accessed bits.
|
|
|
|
* @mm: Address space the pages are mapped into.
|
|
|
|
* @addr: Address the first page is mapped at.
|
|
|
|
* @ptep: Page table pointer for the first entry.
|
|
|
|
* @nr: Number of entries to clear.
|
|
|
|
* @full: Whether we are clearing a full mm.
|
|
|
|
*
|
|
|
|
* May be overridden by the architecture; otherwise, implemented as a simple
|
|
|
|
* loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
|
|
|
|
* returned PTE.
|
|
|
|
*
|
|
|
|
* Note that PTE bits in the PTE range besides the PFN can differ. For example,
|
|
|
|
* some PTEs might be write-protected.
|
|
|
|
*
|
|
|
|
* Context: The caller holds the page table lock. The PTEs map consecutive
|
|
|
|
* pages that belong to the same folio. The PTEs are all in the same PMD.
|
|
|
|
*/
|
|
|
|
static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
|
|
|
|
unsigned long addr, pte_t *ptep, unsigned int nr, int full)
|
|
|
|
{
|
|
|
|
pte_t pte, tmp_pte;
|
|
|
|
|
|
|
|
pte = ptep_get_and_clear_full(mm, addr, ptep, full);
|
|
|
|
while (--nr) {
|
|
|
|
ptep++;
|
|
|
|
addr += PAGE_SIZE;
|
|
|
|
tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
|
|
|
|
if (pte_dirty(tmp_pte))
|
|
|
|
pte = pte_mkdirty(pte);
|
|
|
|
if (pte_young(tmp_pte))
|
|
|
|
pte = pte_mkyoung(pte);
|
|
|
|
}
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef clear_full_ptes
|
|
|
|
/**
|
|
|
|
* clear_full_ptes - Clear present PTEs that map consecutive pages of the same
|
|
|
|
* folio.
|
|
|
|
* @mm: Address space the pages are mapped into.
|
|
|
|
* @addr: Address the first page is mapped at.
|
|
|
|
* @ptep: Page table pointer for the first entry.
|
|
|
|
* @nr: Number of entries to clear.
|
|
|
|
* @full: Whether we are clearing a full mm.
|
|
|
|
*
|
|
|
|
* May be overridden by the architecture; otherwise, implemented as a simple
|
|
|
|
* loop over ptep_get_and_clear_full().
|
|
|
|
*
|
|
|
|
* Note that PTE bits in the PTE range besides the PFN can differ. For example,
|
|
|
|
* some PTEs might be write-protected.
|
|
|
|
*
|
|
|
|
* Context: The caller holds the page table lock. The PTEs map consecutive
|
|
|
|
* pages that belong to the same folio. The PTEs are all in the same PMD.
|
|
|
|
*/
|
|
|
|
static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
|
|
|
|
pte_t *ptep, unsigned int nr, int full)
|
|
|
|
{
|
|
|
|
for (;;) {
|
|
|
|
ptep_get_and_clear_full(mm, addr, ptep, full);
|
|
|
|
if (--nr == 0)
|
|
|
|
break;
|
|
|
|
ptep++;
|
|
|
|
addr += PAGE_SIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2020-05-27 02:25:18 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If two threads concurrently fault at the same page, the thread that
|
|
|
|
* won the race updates the PTE and its local TLB/Cache. The other thread
|
|
|
|
* gives up, simply does nothing, and continues; on architectures where
|
|
|
|
* software can update TLB, local TLB can be updated here to avoid next page
|
|
|
|
* fault. This function updates TLB only, do nothing with cache or others.
|
|
|
|
* It is the difference with function update_mmu_cache.
|
|
|
|
*/
|
2024-05-22 06:12:02 +00:00
|
|
|
#ifndef update_mmu_tlb_range
|
|
|
|
static inline void update_mmu_tlb_range(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pte_t *ptep, unsigned int nr)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-05-27 02:25:18 +00:00
|
|
|
static inline void update_mmu_tlb(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pte_t *ptep)
|
|
|
|
{
|
2024-05-22 06:12:03 +00:00
|
|
|
update_mmu_tlb_range(vma, address, ptep, 1);
|
2020-05-27 02:25:18 +00:00
|
|
|
}
|
|
|
|
|
2006-10-01 06:29:31 +00:00
|
|
|
/*
|
|
|
|
* Some architectures may be able to avoid expensive synchronization
|
|
|
|
* primitives when modifications are made to PTE's which are already
|
|
|
|
* not present, or in the process of an address space destruction.
|
|
|
|
*/
|
|
|
|
#ifndef __HAVE_ARCH_PTE_CLEAR_NOT_PRESENT_FULL
|
2011-01-13 23:46:40 +00:00
|
|
|
static inline void pte_clear_not_present_full(struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
|
|
|
pte_t *ptep,
|
|
|
|
int full)
|
|
|
|
{
|
|
|
|
pte_clear(mm, address, ptep);
|
|
|
|
}
|
[PATCH] x86: ptep_clear optimization
Add a new accessor for PTEs, which passes the full hint from the mmu_gather
struct; this allows architectures with hardware pagetables to optimize away
atomic PTE operations when destroying an address space. Removing the
locked operation should allow better pipelining of memory access in this
loop. I measured an average savings of 30-35 cycles per zap_pte_range on
the first 500 destructions on Pentium-M, but I believe the optimization
would win more on older processors which still assert the bus lock on xchg
for an exclusive cacheline.
Update: I made some new measurements, and this saves exactly 26 cycles over
ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180
cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
full address space destruction.
pte_clear_full is not yet used, but is provided for future optimizations
(in particular, when running inside of a hypervisor that queues page table
updates, the full hint allows us to avoid queueing unnecessary page table
update for an address space in the process of being destroyed.
This is not a huge win, but it does help a bit, and sets the stage for
further hypervisor optimization of the mm layer on all architectures.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-03 22:55:04 +00:00
|
|
|
#endif
|
|
|
|
|
mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
Now that we no longer have a convenient flag in the cluster to determine
if a folio is large, free_swap_and_cache() will take a reference and lock
a large folio much more often, which could lead to contention and (e.g.)
failure to split large folios, etc.
Let's solve that problem by batch freeing swap and cache with a new
function, free_swap_and_cache_nr(), to free a contiguous range of swap
entries together. This allows us to first drop a reference to each swap
slot before we try to release the cache folio. This means we only try to
release the folio once, only taking the reference and lock once - much
better than the previous 512 times for the 2M THP case.
Contiguous swap entries are gathered in zap_pte_range() and
madvise_free_pte_range() in a similar way to how present ptes are already
gathered in zap_pte_range().
While we are at it, let's simplify by converting the return type of both
functions to void. The return value was used only by zap_pte_range() to
print a bad pte, and was ignored by everyone else, so the extra reporting
wasn't exactly guaranteed. We will still get the warning with most of the
information from get_swap_device(). With the batch version, we wouldn't
know which pte was bad anyway so could print the wrong one.
[ryan.roberts@arm.com: fix a build warning on parisc]
Link: https://lkml.kernel.org/r/20240409111840.3173122-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20240408183946.2991168-3-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Gao Xiang <xiang@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-08 18:39:41 +00:00
|
|
|
#ifndef clear_not_present_full_ptes
|
|
|
|
/**
|
|
|
|
* clear_not_present_full_ptes - Clear multiple not present PTEs which are
|
|
|
|
* consecutive in the pgtable.
|
|
|
|
* @mm: Address space the ptes represent.
|
|
|
|
* @addr: Address of the first pte.
|
|
|
|
* @ptep: Page table pointer for the first entry.
|
|
|
|
* @nr: Number of entries to clear.
|
|
|
|
* @full: Whether we are clearing a full mm.
|
|
|
|
*
|
|
|
|
* May be overridden by the architecture; otherwise, implemented as a simple
|
|
|
|
* loop over pte_clear_not_present_full().
|
|
|
|
*
|
|
|
|
* Context: The caller holds the page table lock. The PTEs are all not present.
|
|
|
|
* The PTEs are all in the same PMD.
|
|
|
|
*/
|
|
|
|
static inline void clear_not_present_full_ptes(struct mm_struct *mm,
|
|
|
|
unsigned long addr, pte_t *ptep, unsigned int nr, int full)
|
|
|
|
{
|
|
|
|
for (;;) {
|
|
|
|
pte_clear_not_present_full(mm, addr, ptep, full);
|
|
|
|
if (--nr == 0)
|
|
|
|
break;
|
|
|
|
ptep++;
|
|
|
|
addr += PAGE_SIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
|
2011-01-13 23:46:40 +00:00
|
|
|
extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
pte_t *ptep);
|
|
|
|
#endif
|
|
|
|
|
2015-06-24 23:57:44 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
|
|
|
|
extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
|
2011-01-13 23:46:40 +00:00
|
|
|
unsigned long address,
|
|
|
|
pmd_t *pmdp);
|
2017-02-24 22:57:02 +00:00
|
|
|
extern pud_t pudp_huge_clear_flush(struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
pud_t *pudp);
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif
|
|
|
|
|
2023-06-13 00:10:27 +00:00
|
|
|
#ifndef pte_mkwrite
|
2023-06-13 00:10:29 +00:00
|
|
|
static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
|
2023-06-13 00:10:27 +00:00
|
|
|
{
|
|
|
|
return pte_mkwrite_novma(pte);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#if defined(CONFIG_ARCH_WANT_PMD_MKWRITE) && !defined(pmd_mkwrite)
|
2023-06-13 00:10:29 +00:00
|
|
|
static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
|
2023-06-13 00:10:27 +00:00
|
|
|
{
|
|
|
|
return pmd_mkwrite_novma(pmd);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
|
2005-11-07 08:59:43 +00:00
|
|
|
struct mm_struct;
|
2005-04-16 22:20:36 +00:00
|
|
|
static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
|
|
|
|
{
|
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 15:15:45 +00:00
|
|
|
pte_t old_pte = ptep_get(ptep);
|
2005-04-16 22:20:36 +00:00
|
|
|
set_pte_at(mm, address, ptep, pte_wrprotect(old_pte));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2024-01-29 12:46:47 +00:00
|
|
|
#ifndef wrprotect_ptes
|
|
|
|
/**
|
|
|
|
* wrprotect_ptes - Write-protect PTEs that map consecutive pages of the same
|
|
|
|
* folio.
|
|
|
|
* @mm: Address space the pages are mapped into.
|
|
|
|
* @addr: Address the first page is mapped at.
|
|
|
|
* @ptep: Page table pointer for the first entry.
|
|
|
|
* @nr: Number of entries to write-protect.
|
|
|
|
*
|
|
|
|
* May be overridden by the architecture; otherwise, implemented as a simple
|
|
|
|
* loop over ptep_set_wrprotect().
|
|
|
|
*
|
|
|
|
* Note that PTE bits in the PTE range besides the PFN can differ. For example,
|
|
|
|
* some PTEs might be write-protected.
|
|
|
|
*
|
|
|
|
* Context: The caller holds the page table lock. The PTEs map consecutive
|
|
|
|
* pages that belong to the same folio. The PTEs are all in the same PMD.
|
|
|
|
*/
|
|
|
|
static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
|
|
|
|
pte_t *ptep, unsigned int nr)
|
|
|
|
{
|
|
|
|
for (;;) {
|
|
|
|
ptep_set_wrprotect(mm, addr, ptep);
|
|
|
|
if (--nr == 0)
|
|
|
|
break;
|
|
|
|
ptep++;
|
|
|
|
addr += PAGE_SIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-05-27 02:25:19 +00:00
|
|
|
/*
|
|
|
|
* On some architectures hardware does not set page access bit when accessing
|
2021-05-07 01:06:24 +00:00
|
|
|
* memory page, it is responsibility of software setting this bit. It brings
|
2020-05-27 02:25:19 +00:00
|
|
|
* out extra page fault penalty to track page access bit. For optimization page
|
|
|
|
* access bit can be set during all page fault flow on these arches.
|
|
|
|
* To be differentiate with macro pte_mkyoung, this macro is used on platforms
|
|
|
|
* where software maintains page access bit.
|
|
|
|
*/
|
2021-06-05 03:01:08 +00:00
|
|
|
#ifndef pte_sw_mkyoung
|
|
|
|
static inline pte_t pte_sw_mkyoung(pte_t pte)
|
|
|
|
{
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
#define pte_sw_mkyoung pte_sw_mkyoung
|
|
|
|
#endif
|
|
|
|
|
2011-01-13 23:46:40 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
|
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
|
|
|
|
unsigned long address, pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
pmd_t old_pmd = *pmdp;
|
|
|
|
set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
|
|
|
|
}
|
2015-07-09 11:52:44 +00:00
|
|
|
#else
|
2011-01-13 23:46:40 +00:00
|
|
|
static inline void pmdp_set_wrprotect(struct mm_struct *mm,
|
|
|
|
unsigned long address, pmd_t *pmdp)
|
|
|
|
{
|
2015-07-09 11:52:44 +00:00
|
|
|
BUILD_BUG();
|
2011-01-13 23:46:40 +00:00
|
|
|
}
|
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
|
|
#endif
|
2017-02-24 22:57:02 +00:00
|
|
|
#ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT
|
|
|
|
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
|
2023-07-24 19:07:52 +00:00
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
2017-02-24 22:57:02 +00:00
|
|
|
static inline void pudp_set_wrprotect(struct mm_struct *mm,
|
|
|
|
unsigned long address, pud_t *pudp)
|
|
|
|
{
|
|
|
|
pud_t old_pud = *pudp;
|
|
|
|
|
|
|
|
set_pud_at(mm, address, pudp, pud_wrprotect(old_pud));
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline void pudp_set_wrprotect(struct mm_struct *mm,
|
|
|
|
unsigned long address, pud_t *pudp)
|
|
|
|
{
|
|
|
|
BUILD_BUG();
|
|
|
|
}
|
2023-07-24 19:07:52 +00:00
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
2017-02-24 22:57:02 +00:00
|
|
|
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
|
|
|
|
#endif
|
2011-01-13 23:46:40 +00:00
|
|
|
|
2015-06-24 23:57:39 +00:00
|
|
|
#ifndef pmdp_collapse_flush
|
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
2015-06-24 23:57:42 +00:00
|
|
|
extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp);
|
2015-06-24 23:57:39 +00:00
|
|
|
#else
|
|
|
|
static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
BUILD_BUG();
|
|
|
|
return *pmdp;
|
|
|
|
}
|
|
|
|
#define pmdp_collapse_flush pmdp_collapse_flush
|
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
|
|
#endif
|
|
|
|
|
2012-10-08 23:30:07 +00:00
|
|
|
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
|
2013-06-06 00:14:02 +00:00
|
|
|
extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
|
|
|
|
pgtable_t pgtable);
|
2012-10-08 23:30:07 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
|
2013-06-06 00:14:02 +00:00
|
|
|
extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
|
2012-10-08 23:30:07 +00:00
|
|
|
#endif
|
|
|
|
|
mm/pgtable: delete pmd_trans_unstable() and friends
Delete pmd_trans_unstable, pmd_none_or_trans_huge_or_clear_bad() and
pmd_devmap_trans_unstable(), all now unused.
With mixed feelings, delete all the comments on pmd_trans_unstable().
That was very good documentation of a subtle state, and this series does
not even eliminate that state: but rather, normalizes and extends it,
asking pte_offset_map[_lock]() callers to anticipate failure, without
regard for whether mmap_read_lock() or mmap_write_lock() is held.
Retain pud_trans_unstable(), which has one use in __handle_mm_fault(), but
delete its equivalent pud_none_or_trans_huge_or_dev_or_clear_bad(). While
there, move the default arch_needs_pgtable_deposit() definition up near
where pgtable_trans_huge_deposit() and withdraw() are declared.
Link: https://lkml.kernel.org/r/5abdab3-3136-b42e-274d-9c6281bfb79@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:50:37 +00:00
|
|
|
#ifndef arch_needs_pgtable_deposit
|
|
|
|
#define arch_needs_pgtable_deposit() (false)
|
|
|
|
#endif
|
|
|
|
|
2018-02-01 00:17:43 +00:00
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
/*
|
|
|
|
* This is an implementation of pmdp_establish() that is only suitable for an
|
|
|
|
* architecture that doesn't have hardware dirty/accessed bits. In this case we
|
2021-05-07 01:06:24 +00:00
|
|
|
* can't race with CPU which sets these bits and non-atomic approach is fine.
|
2018-02-01 00:17:43 +00:00
|
|
|
*/
|
|
|
|
static inline pmd_t generic_pmdp_establish(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp, pmd_t pmd)
|
|
|
|
{
|
|
|
|
pmd_t old_pmd = *pmdp;
|
|
|
|
set_pmd_at(vma->vm_mm, address, pmdp, pmd);
|
|
|
|
return old_pmd;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2012-10-08 23:30:09 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMDP_INVALIDATE
|
2018-02-01 00:18:16 +00:00
|
|
|
extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
|
2012-10-08 23:30:09 +00:00
|
|
|
pmd_t *pmdp);
|
|
|
|
#endif
|
|
|
|
|
2022-05-10 01:20:50 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
|
|
|
|
|
|
|
|
/*
|
|
|
|
* pmdp_invalidate_ad() invalidates the PMD while changing a transparent
|
|
|
|
* hugepage mapping in the page tables. This function is similar to
|
|
|
|
* pmdp_invalidate(), but should only be used if the access and dirty bits would
|
|
|
|
* not be cleared by the software in the new PMD value. The function ensures
|
|
|
|
* that hardware changes of the access and dirty bits updates would not be lost.
|
|
|
|
*
|
|
|
|
* Doing so can allow in certain architectures to avoid a TLB flush in most
|
|
|
|
* cases. Yet, another TLB flush might be necessary later if the PMD update
|
|
|
|
* itself requires such flush (e.g., if protection was set to be stricter). Yet,
|
|
|
|
* even when a TLB flush is needed because of the update, the caller may be able
|
|
|
|
* to batch these TLB flushing operations, so fewer TLB flush operations are
|
|
|
|
* needed.
|
|
|
|
*/
|
|
|
|
extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp);
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTE_SAME
|
2011-01-13 23:46:40 +00:00
|
|
|
static inline int pte_same(pte_t pte_a, pte_t pte_b)
|
|
|
|
{
|
|
|
|
return pte_val(pte_a) == pte_val(pte_b);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-04-17 11:59:32 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTE_UNUSED
|
|
|
|
/*
|
|
|
|
* Some architectures provide facilities to virtualization guests
|
|
|
|
* so that they can flag allocated pages as unused. This allows the
|
|
|
|
* host to transparently reclaim unused pages. This function returns
|
|
|
|
* whether the pte's page is unused.
|
|
|
|
*/
|
|
|
|
static inline int pte_unused(pte_t pte)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2017-03-16 15:26:50 +00:00
|
|
|
#ifndef pte_access_permitted
|
|
|
|
#define pte_access_permitted(pte, write) \
|
|
|
|
(pte_present(pte) && (!(write) || pte_write(pte)))
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pmd_access_permitted
|
|
|
|
#define pmd_access_permitted(pmd, write) \
|
|
|
|
(pmd_present(pmd) && (!(write) || pmd_write(pmd)))
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pud_access_permitted
|
|
|
|
#define pud_access_permitted(pud, write) \
|
|
|
|
(pud_present(pud) && (!(write) || pud_write(pud)))
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef p4d_access_permitted
|
|
|
|
#define p4d_access_permitted(p4d, write) \
|
|
|
|
(p4d_present(p4d) && (!(write) || p4d_write(p4d)))
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pgd_access_permitted
|
|
|
|
#define pgd_access_permitted(pgd, write) \
|
|
|
|
(pgd_present(pgd) && (!(write) || pgd_write(pgd)))
|
|
|
|
#endif
|
|
|
|
|
2011-01-13 23:46:40 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMD_SAME
|
|
|
|
static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
|
|
|
|
{
|
|
|
|
return pmd_val(pmd_a) == pmd_val(pmd_b);
|
|
|
|
}
|
2023-07-24 19:07:51 +00:00
|
|
|
#endif
|
2017-02-24 22:57:02 +00:00
|
|
|
|
2023-07-24 19:07:51 +00:00
|
|
|
#ifndef pud_same
|
2017-02-24 22:57:02 +00:00
|
|
|
static inline int pud_same(pud_t pud_a, pud_t pud_b)
|
|
|
|
{
|
|
|
|
return pud_val(pud_a) == pud_val(pud_b);
|
|
|
|
}
|
2023-07-24 19:07:51 +00:00
|
|
|
#define pud_same pud_same
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif
|
|
|
|
|
2018-12-04 21:37:11 +00:00
|
|
|
#ifndef __HAVE_ARCH_P4D_SAME
|
|
|
|
static inline int p4d_same(p4d_t p4d_a, p4d_t p4d_b)
|
|
|
|
{
|
|
|
|
return p4d_val(p4d_a) == p4d_val(p4d_b);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PGD_SAME
|
|
|
|
static inline int pgd_same(pgd_t pgd_a, pgd_t pgd_b)
|
|
|
|
{
|
|
|
|
return pgd_val(pgd_a) == pgd_val(pgd_b);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-02-21 17:15:44 +00:00
|
|
|
#ifndef __HAVE_ARCH_DO_SWAP_PAGE
|
2024-05-29 08:28:22 +00:00
|
|
|
static inline void arch_do_swap_page_nr(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long addr,
|
|
|
|
pte_t pte, pte_t oldpte,
|
|
|
|
int nr)
|
|
|
|
{
|
|
|
|
|
|
|
|
}
|
|
|
|
#else
|
2018-02-21 17:15:44 +00:00
|
|
|
/*
|
|
|
|
* Some architectures support metadata associated with a page. When a
|
|
|
|
* page is being swapped out, this metadata must be saved so it can be
|
|
|
|
* restored when the page is swapped back in. SPARC M7 and newer
|
|
|
|
* processors support an ADI (Application Data Integrity) tag for the
|
|
|
|
* page as metadata for the page. arch_do_swap_page() can restore this
|
|
|
|
* metadata when a page is swapped back in.
|
|
|
|
*/
|
2024-05-29 08:28:22 +00:00
|
|
|
static inline void arch_do_swap_page_nr(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long addr,
|
|
|
|
pte_t pte, pte_t oldpte,
|
|
|
|
int nr)
|
|
|
|
{
|
|
|
|
for (int i = 0; i < nr; i++) {
|
|
|
|
arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
|
|
|
|
pte_advance_pfn(pte, i),
|
|
|
|
pte_advance_pfn(oldpte, i));
|
|
|
|
}
|
2018-02-21 17:15:44 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_UNMAP_ONE
|
|
|
|
/*
|
|
|
|
* Some architectures support metadata associated with a page. When a
|
|
|
|
* page is being swapped out, this metadata must be saved so it can be
|
|
|
|
* restored when the page is swapped back in. SPARC M7 and newer
|
|
|
|
* processors support an ADI (Application Data Integrity) tag for the
|
|
|
|
* page as metadata for the page. arch_unmap_one() can save this
|
|
|
|
* metadata on a swap-out of a page.
|
|
|
|
*/
|
|
|
|
static inline int arch_unmap_one(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long addr,
|
|
|
|
pte_t orig_pte)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-05-13 15:37:49 +00:00
|
|
|
/*
|
|
|
|
* Allow architectures to preserve additional metadata associated with
|
|
|
|
* swapped-out pages. The corresponding __HAVE_ARCH_SWAP_* macros and function
|
|
|
|
* prototypes must be defined in the arch-specific asm/pgtable.h file.
|
|
|
|
*/
|
|
|
|
#ifndef __HAVE_ARCH_PREPARE_TO_SWAP
|
arm64: mm: swap: support THP_SWAP on hardware with MTE
Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with MTE as
the MTE code works with the assumption tags save/restore is always
handling a folio with only one page.
The limitation should be removed as more and more ARM64 SoCs have this
feature. Co-existence of MTE and THP_SWAP becomes more and more
important.
This patch makes MTE tags saving support large folios, then we don't need
to split large folios into base pages for swapping out on ARM64 SoCs with
MTE any more.
arch_prepare_to_swap() should take folio rather than page as parameter
because we support THP swap-out as a whole. It saves tags for all pages
in a large folio.
As now we are restoring tags based-on folio, in arch_swap_restore(), we
may increase some extra loops and early-exitings while refaulting a large
folio which is still in swapcache in do_swap_page(). In case a large
folio has nr pages, do_swap_page() will only set the PTE of the particular
page which is causing the page fault. Thus do_swap_page() runs nr times,
and each time, arch_swap_restore() will loop nr times for those subpages
in the folio. So right now the algorithmic complexity becomes O(nr^2).
Once we support mapping large folios in do_swap_page(), extra loops and
early-exitings will decrease while not being completely removed as a large
folio might get partially tagged in corner cases such as, 1. a large
folio in swapcache can be partially unmapped, thus, MTE tags for the
unmapped pages will be invalidated; 2. users might use mprotect() to set
MTEs on a part of a large folio.
arch_thp_swp_supported() is dropped since ARM64 MTE was the only one who
needed it.
Link: https://lkml.kernel.org/r/20240322114136.61386-2-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-22 11:41:36 +00:00
|
|
|
static inline int arch_prepare_to_swap(struct folio *folio)
|
2020-05-13 15:37:49 +00:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_SWAP_INVALIDATE
|
|
|
|
static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void arch_swap_invalidate_area(int type)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_SWAP_RESTORE
|
2022-05-13 03:23:05 +00:00
|
|
|
static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
|
2020-05-13 15:37:49 +00:00
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifndef __HAVE_ARCH_PGD_OFFSET_GATE
|
|
|
|
#define pgd_offset_gate(mm, addr) pgd_offset(mm, addr)
|
|
|
|
#endif
|
|
|
|
|
2006-06-02 00:47:25 +00:00
|
|
|
#ifndef __HAVE_ARCH_MOVE_PTE
|
2024-03-27 14:33:01 +00:00
|
|
|
#define move_pte(pte, old_addr, new_addr) (pte)
|
2005-09-28 04:45:18 +00:00
|
|
|
#endif
|
|
|
|
|
2012-10-09 13:31:12 +00:00
|
|
|
#ifndef pte_accessible
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 01:08:44 +00:00
|
|
|
# define pte_accessible(mm, pte) ((void)(pte), 1)
|
2012-10-09 13:31:12 +00:00
|
|
|
#endif
|
|
|
|
|
x86, mm: Avoid unnecessary TLB flush
In x86, access and dirty bits are set automatically by CPU when CPU accesses
memory. When we go into the code path of below flush_tlb_fix_spurious_fault(),
we already set dirty bit for pte and don't need flush tlb. This might mean
tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
the CPUs do page write, they will automatically check the bit and no software
involved.
On the other hand, flush tlb in below position is harmful. Test creates CPU
number of threads, each thread writes to a same but random address in same vma
range and we measure the total time. Under a 4 socket system, original time is
1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
tlb flush.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <20100816011655.GA362@sli10-desk.sh.intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andrea Archangeli <aarcange@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-08-16 01:16:55 +00:00
|
|
|
#ifndef flush_tlb_fix_spurious_fault
|
2023-03-06 16:15:48 +00:00
|
|
|
#define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
|
x86, mm: Avoid unnecessary TLB flush
In x86, access and dirty bits are set automatically by CPU when CPU accesses
memory. When we go into the code path of below flush_tlb_fix_spurious_fault(),
we already set dirty bit for pte and don't need flush tlb. This might mean
tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
the CPUs do page write, they will automatically check the bit and no software
involved.
On the other hand, flush tlb in below position is harmful. Test creates CPU
number of threads, each thread writes to a same but random address in same vma
range and we measure the total time. Under a 4 socket system, original time is
1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
tlb flush.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <20100816011655.GA362@sli10-desk.sh.intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andrea Archangeli <aarcange@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-08-16 01:16:55 +00:00
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
2005-04-19 20:29:17 +00:00
|
|
|
* When walking page tables, get the address of the next boundary,
|
|
|
|
* or the end address of the range if that comes earlier. Although no
|
|
|
|
* vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
#define pgd_addr_end(addr, end) \
|
|
|
|
({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK; \
|
|
|
|
(__boundary - 1 < (end) - 1)? __boundary: (end); \
|
|
|
|
})
|
|
|
|
|
2017-03-09 14:24:07 +00:00
|
|
|
#ifndef p4d_addr_end
|
|
|
|
#define p4d_addr_end(addr, end) \
|
|
|
|
({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK; \
|
|
|
|
(__boundary - 1 < (end) - 1)? __boundary: (end); \
|
|
|
|
})
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifndef pud_addr_end
|
|
|
|
#define pud_addr_end(addr, end) \
|
|
|
|
({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK; \
|
|
|
|
(__boundary - 1 < (end) - 1)? __boundary: (end); \
|
|
|
|
})
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pmd_addr_end
|
|
|
|
#define pmd_addr_end(addr, end) \
|
|
|
|
({ unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK; \
|
|
|
|
(__boundary - 1 < (end) - 1)? __boundary: (end); \
|
|
|
|
})
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When walking page tables, we usually want to skip any p?d_none entries;
|
|
|
|
* and any p?d_bad entries - reporting the error before resetting to none.
|
|
|
|
* Do the tests inline, but report and clear the bad entry in mm/memory.c.
|
|
|
|
*/
|
|
|
|
void pgd_clear_bad(pgd_t *);
|
2019-12-01 01:51:20 +00:00
|
|
|
|
|
|
|
#ifndef __PAGETABLE_P4D_FOLDED
|
2017-03-09 14:24:07 +00:00
|
|
|
void p4d_clear_bad(p4d_t *);
|
2019-12-01 01:51:20 +00:00
|
|
|
#else
|
|
|
|
#define p4d_clear_bad(p4d) do { } while (0)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __PAGETABLE_PUD_FOLDED
|
2005-04-16 22:20:36 +00:00
|
|
|
void pud_clear_bad(pud_t *);
|
2019-12-01 01:51:20 +00:00
|
|
|
#else
|
|
|
|
#define pud_clear_bad(p4d) do { } while (0)
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
void pmd_clear_bad(pmd_t *);
|
|
|
|
|
|
|
|
static inline int pgd_none_or_clear_bad(pgd_t *pgd)
|
|
|
|
{
|
|
|
|
if (pgd_none(*pgd))
|
|
|
|
return 1;
|
|
|
|
if (unlikely(pgd_bad(*pgd))) {
|
|
|
|
pgd_clear_bad(pgd);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-03-09 14:24:07 +00:00
|
|
|
static inline int p4d_none_or_clear_bad(p4d_t *p4d)
|
|
|
|
{
|
|
|
|
if (p4d_none(*p4d))
|
|
|
|
return 1;
|
|
|
|
if (unlikely(p4d_bad(*p4d))) {
|
|
|
|
p4d_clear_bad(p4d);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
static inline int pud_none_or_clear_bad(pud_t *pud)
|
|
|
|
{
|
|
|
|
if (pud_none(*pud))
|
|
|
|
return 1;
|
|
|
|
if (unlikely(pud_bad(*pud))) {
|
|
|
|
pud_clear_bad(pud);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pmd_none_or_clear_bad(pmd_t *pmd)
|
|
|
|
{
|
|
|
|
if (pmd_none(*pmd))
|
|
|
|
return 1;
|
|
|
|
if (unlikely(pmd_bad(*pmd))) {
|
|
|
|
pmd_clear_bad(pmd);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2007-08-10 20:01:20 +00:00
|
|
|
|
2019-03-05 23:46:26 +00:00
|
|
|
static inline pte_t __ptep_modify_prot_start(struct vm_area_struct *vma,
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
unsigned long addr,
|
|
|
|
pte_t *ptep)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Get the current pte state, but zero it out to make it
|
|
|
|
* non-present, preventing the hardware from asynchronously
|
|
|
|
* updating it.
|
|
|
|
*/
|
2019-03-05 23:46:26 +00:00
|
|
|
return ptep_get_and_clear(vma->vm_mm, addr, ptep);
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
}
|
|
|
|
|
2019-03-05 23:46:26 +00:00
|
|
|
static inline void __ptep_modify_prot_commit(struct vm_area_struct *vma,
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
unsigned long addr,
|
|
|
|
pte_t *ptep, pte_t pte)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The pte is non-present, so there's no hardware state to
|
|
|
|
* preserve.
|
|
|
|
*/
|
2019-03-05 23:46:26 +00:00
|
|
|
set_pte_at(vma->vm_mm, addr, ptep, pte);
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
|
|
|
|
/*
|
|
|
|
* Start a pte protection read-modify-write transaction, which
|
|
|
|
* protects against asynchronous hardware modifications to the pte.
|
|
|
|
* The intention is not to prevent the hardware from making pte
|
|
|
|
* updates, but to prevent any updates it may make from being lost.
|
|
|
|
*
|
|
|
|
* This does not protect against other software modifications of the
|
2021-05-07 01:06:24 +00:00
|
|
|
* pte; the appropriate pte lock must be held over the transaction.
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
*
|
|
|
|
* Note that this interface is intended to be batchable, meaning that
|
|
|
|
* ptep_modify_prot_commit may not actually update the pte, but merely
|
|
|
|
* queue the update to be done at some later time. The update must be
|
|
|
|
* actually committed before the pte lock is released, however.
|
|
|
|
*/
|
2019-03-05 23:46:26 +00:00
|
|
|
static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
unsigned long addr,
|
|
|
|
pte_t *ptep)
|
|
|
|
{
|
2019-03-05 23:46:26 +00:00
|
|
|
return __ptep_modify_prot_start(vma, addr, ptep);
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Commit an update to a pte, leaving any hardware-controlled bits in
|
|
|
|
* the PTE unmodified.
|
|
|
|
*/
|
2019-03-05 23:46:26 +00:00
|
|
|
static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
unsigned long addr,
|
2019-03-05 23:46:29 +00:00
|
|
|
pte_t *ptep, pte_t old_pte, pte_t pte)
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
{
|
2019-03-05 23:46:26 +00:00
|
|
|
__ptep_modify_prot_commit(vma, addr, ptep, pte);
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
}
|
|
|
|
#endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
|
2008-07-15 20:28:46 +00:00
|
|
|
#endif /* CONFIG_MMU */
|
mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-16 11:30:00 +00:00
|
|
|
|
2017-07-17 21:10:07 +00:00
|
|
|
/*
|
|
|
|
* No-op macros that just return the current protection value. Defined here
|
2020-08-12 01:32:27 +00:00
|
|
|
* because these macros can be used even if CONFIG_MMU is not defined.
|
2017-07-17 21:10:07 +00:00
|
|
|
*/
|
2020-07-15 05:33:39 +00:00
|
|
|
|
|
|
|
#ifndef pgprot_nx
|
|
|
|
#define pgprot_nx(prot) (prot)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pgprot_noncached
|
|
|
|
#define pgprot_noncached(prot) (prot)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pgprot_writecombine
|
|
|
|
#define pgprot_writecombine pgprot_noncached
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pgprot_writethrough
|
|
|
|
#define pgprot_writethrough pgprot_noncached
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pgprot_device
|
|
|
|
#define pgprot_device pgprot_noncached
|
|
|
|
#endif
|
|
|
|
|
2021-03-09 12:26:01 +00:00
|
|
|
#ifndef pgprot_mhp
|
|
|
|
#define pgprot_mhp(prot) (prot)
|
|
|
|
#endif
|
|
|
|
|
2020-07-15 05:33:39 +00:00
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
#ifndef pgprot_modify
|
|
|
|
#define pgprot_modify pgprot_modify
|
|
|
|
static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
|
|
|
|
{
|
|
|
|
if (pgprot_val(oldprot) == pgprot_val(pgprot_noncached(oldprot)))
|
|
|
|
newprot = pgprot_noncached(newprot);
|
|
|
|
if (pgprot_val(oldprot) == pgprot_val(pgprot_writecombine(oldprot)))
|
|
|
|
newprot = pgprot_writecombine(newprot);
|
|
|
|
if (pgprot_val(oldprot) == pgprot_val(pgprot_device(oldprot)))
|
|
|
|
newprot = pgprot_device(newprot);
|
|
|
|
return newprot;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#endif /* CONFIG_MMU */
|
|
|
|
|
2017-07-17 21:10:07 +00:00
|
|
|
#ifndef pgprot_encrypted
|
|
|
|
#define pgprot_encrypted(prot) (prot)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pgprot_decrypted
|
|
|
|
#define pgprot_decrypted(prot) (prot)
|
|
|
|
#endif
|
|
|
|
|
2007-08-10 20:01:20 +00:00
|
|
|
/*
|
2009-02-18 07:24:03 +00:00
|
|
|
* A facility to provide batching of the reload of page tables and
|
|
|
|
* other process state with the actual context switch code for
|
|
|
|
* paravirtualized guests. By convention, only one of the batched
|
|
|
|
* update (lazy) modes (CPU, MMU) should be active at any given time,
|
|
|
|
* entry should never be nested, and entry and exits should always be
|
|
|
|
* paired. This is for sanity of maintaining and reasoning about the
|
|
|
|
* kernel code. In this case, the exit (end of the context switch) is
|
|
|
|
* in architecture-specific code, and so doesn't need a generic
|
|
|
|
* definition.
|
2007-08-10 20:01:20 +00:00
|
|
|
*/
|
2009-02-18 07:24:03 +00:00
|
|
|
#ifndef __HAVE_ARCH_START_CONTEXT_SWITCH
|
2009-02-18 19:18:57 +00:00
|
|
|
#define arch_start_context_switch(prev) do {} while (0)
|
2007-08-10 20:01:20 +00:00
|
|
|
#endif
|
|
|
|
|
2017-09-08 23:11:04 +00:00
|
|
|
#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
|
|
|
|
#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
|
|
|
|
static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return pmd;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pmd_swp_soft_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return pmd;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#else /* !CONFIG_HAVE_ARCH_SOFT_DIRTY */
|
mm: soft-dirty bits for user memory changes tracking
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should
1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
2. Wait some time.
3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is. Thus, after this, when the task tries to modify a
page at some virtual address the #PF occurs and the kernel sets the
soft-dirty bit on the respective PTE.
Note, that although all the task's address space is marked as r/o after
the soft-dirty bits clear, the #PF-s that occur after that are processed
fast. This is so, since the pages are still mapped to physical memory,
and thus all the kernel does is finds this fact out and puts back
writable, dirty and soft-dirty bits on the PTE.
Another thing to note, is that when mremap moves PTEs they are marked
with soft-dirty as well, since from the user perspective mremap modifies
the virtual memory at mremap's new address.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 22:01:20 +00:00
|
|
|
static inline int pte_soft_dirty(pte_t pte)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pmd_soft_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pte_t pte_mksoft_dirty(pte_t pte)
|
|
|
|
{
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pmd_t pmd_mksoft_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return pmd;
|
|
|
|
}
|
2013-08-13 23:00:49 +00:00
|
|
|
|
2015-04-22 12:20:47 +00:00
|
|
|
static inline pte_t pte_clear_soft_dirty(pte_t pte)
|
|
|
|
{
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pmd_t pmd_clear_soft_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return pmd;
|
|
|
|
}
|
|
|
|
|
2013-08-13 23:00:49 +00:00
|
|
|
static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
|
|
|
|
{
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pte_swp_soft_dirty(pte_t pte)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
|
|
|
|
{
|
|
|
|
return pte;
|
|
|
|
}
|
2017-09-08 23:11:04 +00:00
|
|
|
|
|
|
|
static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return pmd;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pmd_swp_soft_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return pmd;
|
|
|
|
}
|
mm: soft-dirty bits for user memory changes tracking
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should
1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
2. Wait some time.
3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is. Thus, after this, when the task tries to modify a
page at some virtual address the #PF occurs and the kernel sets the
soft-dirty bit on the respective PTE.
Note, that although all the task's address space is marked as r/o after
the soft-dirty bits clear, the #PF-s that occur after that are processed
fast. This is so, since the pages are still mapped to physical memory,
and thus all the kernel does is finds this fact out and puts back
writable, dirty and soft-dirty bits on the PTE.
Another thing to note, is that when mremap moves PTEs they are marked
with soft-dirty as well, since from the user perspective mremap modifies
the virtual memory at mremap's new address.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 22:01:20 +00:00
|
|
|
#endif
|
|
|
|
|
2008-12-19 21:47:29 +00:00
|
|
|
#ifndef __HAVE_PFNMAP_TRACKING
|
|
|
|
/*
|
2012-10-08 23:28:29 +00:00
|
|
|
* Interfaces that can be used by architecture code to keep track of
|
|
|
|
* memory type of pfn mappings specified by the remap_pfn_range,
|
2018-10-26 22:04:26 +00:00
|
|
|
* vmf_insert_pfn.
|
2012-10-08 23:28:29 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* track_pfn_remap is called when a _new_ pfn mapping is being established
|
|
|
|
* by remap_pfn_range() for physical range indicated by pfn and size.
|
2008-12-19 21:47:29 +00:00
|
|
|
*/
|
2012-10-08 23:28:29 +00:00
|
|
|
static inline int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
|
2012-10-08 23:28:34 +00:00
|
|
|
unsigned long pfn, unsigned long addr,
|
|
|
|
unsigned long size)
|
2008-12-19 21:47:29 +00:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2012-10-08 23:28:29 +00:00
|
|
|
* track_pfn_insert is called when a _new_ single pfn is established
|
2018-10-26 22:04:26 +00:00
|
|
|
* by vmf_insert_pfn().
|
2012-10-08 23:28:29 +00:00
|
|
|
*/
|
2016-10-26 17:43:43 +00:00
|
|
|
static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
|
|
|
|
pfn_t pfn)
|
2012-10-08 23:28:29 +00:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* track_pfn_copy is called when vma that is covering the pfnmap gets
|
2008-12-19 21:47:29 +00:00
|
|
|
* copied through copy_page_range().
|
|
|
|
*/
|
2012-10-08 23:28:29 +00:00
|
|
|
static inline int track_pfn_copy(struct vm_area_struct *vma)
|
2008-12-19 21:47:29 +00:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2015-12-23 00:54:23 +00:00
|
|
|
* untrack_pfn is called while unmapping a pfnmap for a region.
|
2008-12-19 21:47:29 +00:00
|
|
|
* untrack can be called for a specific region indicated by pfn and size or
|
2012-10-08 23:28:29 +00:00
|
|
|
* can be for the entire vma (in which case pfn, size are zero).
|
2008-12-19 21:47:29 +00:00
|
|
|
*/
|
2012-10-08 23:28:29 +00:00
|
|
|
static inline void untrack_pfn(struct vm_area_struct *vma,
|
2023-01-26 19:37:51 +00:00
|
|
|
unsigned long pfn, unsigned long size,
|
|
|
|
bool mm_wr_locked)
|
2008-12-19 21:47:29 +00:00
|
|
|
{
|
|
|
|
}
|
2015-12-23 00:54:23 +00:00
|
|
|
|
|
|
|
/*
|
2023-02-17 02:56:15 +00:00
|
|
|
* untrack_pfn_clear is called while mremapping a pfnmap for a new region
|
|
|
|
* or fails to copy pgtable during duplicate vm area.
|
2015-12-23 00:54:23 +00:00
|
|
|
*/
|
2023-02-17 02:56:15 +00:00
|
|
|
static inline void untrack_pfn_clear(struct vm_area_struct *vma)
|
2015-12-23 00:54:23 +00:00
|
|
|
{
|
|
|
|
}
|
2008-12-19 21:47:29 +00:00
|
|
|
#else
|
2012-10-08 23:28:29 +00:00
|
|
|
extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
|
2012-10-08 23:28:34 +00:00
|
|
|
unsigned long pfn, unsigned long addr,
|
|
|
|
unsigned long size);
|
2016-10-26 17:43:43 +00:00
|
|
|
extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
|
|
|
|
pfn_t pfn);
|
2012-10-08 23:28:29 +00:00
|
|
|
extern int track_pfn_copy(struct vm_area_struct *vma);
|
|
|
|
extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
|
2023-01-26 19:37:51 +00:00
|
|
|
unsigned long size, bool mm_wr_locked);
|
2023-02-17 02:56:15 +00:00
|
|
|
extern void untrack_pfn_clear(struct vm_area_struct *vma);
|
2008-12-19 21:47:29 +00:00
|
|
|
#endif
|
|
|
|
|
2021-05-05 01:39:04 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2012-12-12 21:52:36 +00:00
|
|
|
#ifdef __HAVE_COLOR_ZERO_PAGE
|
|
|
|
static inline int is_zero_pfn(unsigned long pfn)
|
|
|
|
{
|
|
|
|
extern unsigned long zero_pfn;
|
|
|
|
unsigned long offset_from_zero_pfn = pfn - zero_pfn;
|
|
|
|
return offset_from_zero_pfn <= (zero_page_mask >> PAGE_SHIFT);
|
|
|
|
}
|
|
|
|
|
2012-12-26 00:19:55 +00:00
|
|
|
#define my_zero_pfn(addr) page_to_pfn(ZERO_PAGE(addr))
|
|
|
|
|
2012-12-12 21:52:36 +00:00
|
|
|
#else
|
|
|
|
static inline int is_zero_pfn(unsigned long pfn)
|
|
|
|
{
|
|
|
|
extern unsigned long zero_pfn;
|
|
|
|
return pfn == zero_pfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long my_zero_pfn(unsigned long addr)
|
|
|
|
{
|
|
|
|
extern unsigned long zero_pfn;
|
|
|
|
return zero_pfn;
|
|
|
|
}
|
|
|
|
#endif
|
2021-05-05 01:39:04 +00:00
|
|
|
#else
|
|
|
|
static inline int is_zero_pfn(unsigned long pfn)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long my_zero_pfn(unsigned long addr)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_MMU */
|
2012-12-12 21:52:36 +00:00
|
|
|
|
mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.
It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().
Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.
Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).
The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).
All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.
The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.
====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!
At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:
mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.
143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }
After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.
1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));
The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.
virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |<-- B(fault)
| | |
2 MB | |/////////////////////|-.
huge < |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'
- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.
sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.
...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)
The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.
====== end quote =======
[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 23:33:42 +00:00
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
|
2011-01-13 23:46:40 +00:00
|
|
|
#ifndef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
static inline int pmd_trans_huge(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2017-11-30 00:10:10 +00:00
|
|
|
#ifndef pmd_write
|
2011-01-13 23:46:40 +00:00
|
|
|
static inline int pmd_write(pmd_t pmd)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
2017-11-30 00:10:10 +00:00
|
|
|
#endif /* pmd_write */
|
mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.
It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().
Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.
Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).
The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).
All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.
The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.
====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!
At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:
mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.
143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }
After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.
1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));
The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.
virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |<-- B(fault)
| | |
2 MB | |/////////////////////|-.
huge < |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'
- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.
sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.
...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)
The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.
====== end quote =======
[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 23:33:42 +00:00
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
|
|
|
2017-11-30 00:10:06 +00:00
|
|
|
#ifndef pud_write
|
|
|
|
static inline int pud_write(pud_t pud)
|
|
|
|
{
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif /* pud_write */
|
|
|
|
|
2019-12-01 01:51:29 +00:00
|
|
|
#if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
|
|
|
|
static inline int pmd_devmap(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int pud_devmap(pud_t pud)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int pgd_devmap(pgd_t pgd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2017-02-24 22:57:02 +00:00
|
|
|
#if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
|
2022-08-29 09:51:25 +00:00
|
|
|
!defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
|
2017-02-24 22:57:02 +00:00
|
|
|
static inline int pud_trans_huge(pud_t pud)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
mm/pgtable: delete pmd_trans_unstable() and friends
Delete pmd_trans_unstable, pmd_none_or_trans_huge_or_clear_bad() and
pmd_devmap_trans_unstable(), all now unused.
With mixed feelings, delete all the comments on pmd_trans_unstable().
That was very good documentation of a subtle state, and this series does
not even eliminate that state: but rather, normalizes and extends it,
asking pte_offset_map[_lock]() callers to anticipate failure, without
regard for whether mmap_read_lock() or mmap_write_lock() is held.
Retain pud_trans_unstable(), which has one use in __handle_mm_fault(), but
delete its equivalent pud_none_or_trans_huge_or_dev_or_clear_bad(). While
there, move the default arch_needs_pgtable_deposit() definition up near
where pgtable_trans_huge_deposit() and withdraw() are declared.
Link: https://lkml.kernel.org/r/5abdab3-3136-b42e-274d-9c6281bfb79@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:50:37 +00:00
|
|
|
static inline int pud_trans_unstable(pud_t *pud)
|
2019-12-01 01:51:32 +00:00
|
|
|
{
|
mm/pgtable: delete pmd_trans_unstable() and friends
Delete pmd_trans_unstable, pmd_none_or_trans_huge_or_clear_bad() and
pmd_devmap_trans_unstable(), all now unused.
With mixed feelings, delete all the comments on pmd_trans_unstable().
That was very good documentation of a subtle state, and this series does
not even eliminate that state: but rather, normalizes and extends it,
asking pte_offset_map[_lock]() callers to anticipate failure, without
regard for whether mmap_read_lock() or mmap_write_lock() is held.
Retain pud_trans_unstable(), which has one use in __handle_mm_fault(), but
delete its equivalent pud_none_or_trans_huge_or_dev_or_clear_bad(). While
there, move the default arch_needs_pgtable_deposit() definition up near
where pgtable_trans_huge_deposit() and withdraw() are declared.
Link: https://lkml.kernel.org/r/5abdab3-3136-b42e-274d-9c6281bfb79@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:50:37 +00:00
|
|
|
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
|
|
|
|
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
|
2019-12-01 01:51:32 +00:00
|
|
|
pud_t pudval = READ_ONCE(*pud);
|
|
|
|
|
|
|
|
if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
|
|
|
|
return 1;
|
|
|
|
if (unlikely(pud_bad(pudval))) {
|
|
|
|
pud_clear_bad(pud);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
#endif
|
mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.
It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().
Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.
Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).
The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).
All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.
The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.
====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!
At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:
mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.
143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }
After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.
1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));
The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.
virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |<-- B(fault)
| | |
2 MB | |/////////////////////|-.
huge < |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'
- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.
sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.
...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)
The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.
====== end quote =======
[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 23:33:42 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-02-12 22:58:19 +00:00
|
|
|
#ifndef CONFIG_NUMA_BALANCING
|
|
|
|
/*
|
2023-08-03 14:32:06 +00:00
|
|
|
* In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is
|
|
|
|
* perfectly valid to indicate "no" in that case, which is why our default
|
|
|
|
* implementation defaults to "always no".
|
|
|
|
*
|
|
|
|
* In an accessible VMA, however, pte_protnone() reliably indicates PROT_NONE
|
|
|
|
* page protection due to NUMA hinting. NUMA hinting faults only apply in
|
|
|
|
* accessible VMAs.
|
|
|
|
*
|
|
|
|
* So, to reliably identify PROT_NONE PTEs that require a NUMA hinting fault,
|
|
|
|
* looking at the VMA accessibility is sufficient.
|
2015-02-12 22:58:19 +00:00
|
|
|
*/
|
|
|
|
static inline int pte_protnone(pte_t pte)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int pmd_protnone(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_NUMA_BALANCING */
|
|
|
|
|
mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.
It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().
Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.
Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).
The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).
All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.
The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.
====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!
At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:
mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.
143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }
After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.
1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));
The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.
virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |<-- B(fault)
| | |
2 MB | |/////////////////////|-.
huge < |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'
- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.
sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.
...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)
The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.
====== end quote =======
[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 23:33:42 +00:00
|
|
|
#endif /* CONFIG_MMU */
|
2011-01-13 23:46:40 +00:00
|
|
|
|
2015-04-14 22:47:23 +00:00
|
|
|
#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
|
2017-03-09 14:24:07 +00:00
|
|
|
|
|
|
|
#ifndef __PAGETABLE_P4D_FOLDED
|
|
|
|
int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot);
|
2022-05-13 03:23:07 +00:00
|
|
|
void p4d_clear_huge(p4d_t *p4d);
|
2017-03-09 14:24:07 +00:00
|
|
|
#else
|
|
|
|
static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2022-05-13 03:23:07 +00:00
|
|
|
static inline void p4d_clear_huge(p4d_t *p4d) { }
|
2017-03-09 14:24:07 +00:00
|
|
|
#endif /* !__PAGETABLE_P4D_FOLDED */
|
|
|
|
|
2015-04-14 22:47:23 +00:00
|
|
|
int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
|
2021-07-01 01:48:03 +00:00
|
|
|
int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
|
2021-07-21 07:02:13 +00:00
|
|
|
int pud_clear_huge(pud_t *pud);
|
2015-04-14 22:47:26 +00:00
|
|
|
int pmd_clear_huge(pmd_t *pmd);
|
2018-12-28 08:37:53 +00:00
|
|
|
int p4d_free_pud_page(p4d_t *p4d, unsigned long addr);
|
2018-06-27 14:13:47 +00:00
|
|
|
int pud_free_pmd_page(pud_t *pud, unsigned long addr);
|
|
|
|
int pmd_free_pte_page(pmd_t *pmd, unsigned long addr);
|
2015-04-14 22:47:23 +00:00
|
|
|
#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
|
2017-03-09 14:24:07 +00:00
|
|
|
static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2015-04-14 22:47:23 +00:00
|
|
|
static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2022-05-13 03:23:07 +00:00
|
|
|
static inline void p4d_clear_huge(p4d_t *p4d) { }
|
2015-04-14 22:47:26 +00:00
|
|
|
static inline int pud_clear_huge(pud_t *pud)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int pmd_clear_huge(pmd_t *pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2018-12-28 08:37:53 +00:00
|
|
|
static inline int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2018-06-27 14:13:47 +00:00
|
|
|
static inline int pud_free_pmd_page(pud_t *pud, unsigned long addr)
|
mm/vmalloc: add interfaces to free unmapped page table
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
create pud/pmd mappings. A kernel panic was observed on arm64 systems
with Cortex-A75 in the following steps as described by Hanjun Guo.
1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
which will lead to kernel panic.
This panic is not reproducible on x86. INVLPG, called from iounmap,
purges all levels of entries associated with purged address on x86. x86
still has memory leak.
The patch changes the ioremap path to free unmapped page table(s) since
doing so in the unmap path has the following issues:
- The iounmap() path is shared with vunmap(). Since vmap() only
supports pte mappings, making vunmap() to free a pte page is an
overhead for regular vmap users as they do not need a pte page freed
up.
- Checking if all entries in a pte page are cleared in the unmap path
is racy, and serializing this check is expensive.
- The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
purge.
Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
clear a given pud/pmd entry and free up a page for the lower level
entries.
This patch implements their stub functions on x86 and arm64, which work
as workaround.
[akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
Reported-by: Lei Li <lious.lilei@hisilicon.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22 23:17:20 +00:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2018-06-27 14:13:47 +00:00
|
|
|
static inline int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
|
mm/vmalloc: add interfaces to free unmapped page table
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
create pud/pmd mappings. A kernel panic was observed on arm64 systems
with Cortex-A75 in the following steps as described by Hanjun Guo.
1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
which will lead to kernel panic.
This panic is not reproducible on x86. INVLPG, called from iounmap,
purges all levels of entries associated with purged address on x86. x86
still has memory leak.
The patch changes the ioremap path to free unmapped page table(s) since
doing so in the unmap path has the following issues:
- The iounmap() path is shared with vunmap(). Since vmap() only
supports pte mappings, making vunmap() to free a pte page is an
overhead for regular vmap users as they do not need a pte page freed
up.
- Checking if all entries in a pte page are cleared in the unmap path
is racy, and serializing this check is expensive.
- The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
purge.
Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
clear a given pud/pmd entry and free up a page for the lower level
entries.
This patch implements their stub functions on x86 and arm64, which work
as workaround.
[akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
Fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
Reported-by: Lei Li <lious.lilei@hisilicon.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22 23:17:20 +00:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2015-04-14 22:47:23 +00:00
|
|
|
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
|
|
|
|
|
2016-03-17 21:18:56 +00:00
|
|
|
#ifndef __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
|
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
/*
|
|
|
|
* ARCHes with special requirements for evicting THP backing TLB entries can
|
|
|
|
* implement this. Otherwise also, it can help optimize normal TLB flush in
|
2020-08-12 01:32:27 +00:00
|
|
|
* THP regime. Stock flush_tlb_range() typically has optimization to nuke the
|
|
|
|
* entire TLB if flush span is greater than a threshold, which will
|
|
|
|
* likely be true for a single huge page. Thus a single THP flush will
|
|
|
|
* invalidate the entire TLB which is not desirable.
|
2016-03-17 21:18:56 +00:00
|
|
|
* e.g. see arch/arc: flush_pmd_tlb_range
|
|
|
|
*/
|
|
|
|
#define flush_pmd_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
|
2017-02-24 22:57:02 +00:00
|
|
|
#define flush_pud_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
|
2016-03-17 21:18:56 +00:00
|
|
|
#else
|
|
|
|
#define flush_pmd_tlb_range(vma, addr, end) BUILD_BUG()
|
2017-02-24 22:57:02 +00:00
|
|
|
#define flush_pud_tlb_range(vma, addr, end) BUILD_BUG()
|
2016-03-17 21:18:56 +00:00
|
|
|
#endif
|
|
|
|
#endif
|
|
|
|
|
2016-10-08 00:00:55 +00:00
|
|
|
struct file;
|
|
|
|
int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
|
|
|
|
unsigned long size, pgprot_t *vma_prot);
|
2017-12-17 09:56:29 +00:00
|
|
|
|
|
|
|
#ifndef CONFIG_X86_ESPFIX64
|
|
|
|
static inline void init_espfix_bsp(void) { }
|
|
|
|
#endif
|
|
|
|
|
2019-09-23 22:35:31 +00:00
|
|
|
extern void __init pgtable_cache_init(void);
|
2019-05-05 01:11:24 +00:00
|
|
|
|
2018-07-14 19:56:13 +00:00
|
|
|
#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED
|
|
|
|
static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool arch_has_pfn_modify_check(void)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif /* !_HAVE_ARCH_PFN_MODIFY_ALLOWED */
|
|
|
|
|
2018-08-17 22:46:29 +00:00
|
|
|
/*
|
|
|
|
* Architecture PAGE_KERNEL_* fallbacks
|
|
|
|
*
|
|
|
|
* Some architectures don't define certain PAGE_KERNEL_* flags. This is either
|
|
|
|
* because they really don't support them, or the port needs to be updated to
|
|
|
|
* reflect the required functionality. Below are a set of relatively safe
|
|
|
|
* fallbacks, as best effort, which we can count on in lieu of the architectures
|
|
|
|
* not defining them on their own yet.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef PAGE_KERNEL_RO
|
|
|
|
# define PAGE_KERNEL_RO PAGE_KERNEL
|
|
|
|
#endif
|
|
|
|
|
2018-08-17 22:46:32 +00:00
|
|
|
#ifndef PAGE_KERNEL_EXEC
|
|
|
|
# define PAGE_KERNEL_EXEC PAGE_KERNEL
|
|
|
|
#endif
|
|
|
|
|
mm: add functions to track page directory modifications
Patch series "mm: Get rid of vmalloc_sync_(un)mappings()", v3.
After the recent issue with vmalloc and tracing code[1] on x86 and a
long history of previous issues related to the vmalloc_sync_mappings()
interface, I thought the time has come to remove it. Please see [2],
[3], and [4] for some other issues in the past.
The patches add tracking of page-table directory changes to the vmalloc
and ioremap code. Depending on which page-table levels changes have
been made, a new per-arch function is called:
arch_sync_kernel_mappings().
On x86-64 with 4-level paging, this function will not be called more
than 64 times in a systems runtime (because vmalloc-space takes 64 PGD
entries which are only populated, but never cleared).
As a side effect this also allows to get rid of vmalloc faults on x86,
making it safe to touch vmalloc'ed memory in the page-fault handler.
Note that this potentially includes per-cpu memory.
This patch (of 7):
Add page-table allocation functions which will keep track of changed
directory entries. They are needed for new PGD, P4D, PUD, and PMD
entries and will be used in vmalloc and ioremap code to decide whether
any changes in the kernel mappings need to be synchronized between
page-tables in the system.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20200515140023.25469-1-joro@8bytes.org
Link: http://lkml.kernel.org/r/20200515140023.25469-2-joro@8bytes.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 04:52:18 +00:00
|
|
|
/*
|
|
|
|
* Page Table Modification bits for pgtbl_mod_mask.
|
|
|
|
*
|
|
|
|
* These are used by the p?d_alloc_track*() set of functions an in the generic
|
|
|
|
* vmalloc/ioremap code to track at which page-table levels entries have been
|
|
|
|
* modified. Based on that the code can better decide when vmalloc and ioremap
|
|
|
|
* mapping changes need to be synchronized to other page-tables in the system.
|
|
|
|
*/
|
|
|
|
#define __PGTBL_PGD_MODIFIED 0
|
|
|
|
#define __PGTBL_P4D_MODIFIED 1
|
|
|
|
#define __PGTBL_PUD_MODIFIED 2
|
|
|
|
#define __PGTBL_PMD_MODIFIED 3
|
|
|
|
#define __PGTBL_PTE_MODIFIED 4
|
|
|
|
|
|
|
|
#define PGTBL_PGD_MODIFIED BIT(__PGTBL_PGD_MODIFIED)
|
|
|
|
#define PGTBL_P4D_MODIFIED BIT(__PGTBL_P4D_MODIFIED)
|
|
|
|
#define PGTBL_PUD_MODIFIED BIT(__PGTBL_PUD_MODIFIED)
|
|
|
|
#define PGTBL_PMD_MODIFIED BIT(__PGTBL_PMD_MODIFIED)
|
|
|
|
#define PGTBL_PTE_MODIFIED BIT(__PGTBL_PTE_MODIFIED)
|
|
|
|
|
|
|
|
/* Page-Table Modification Mask */
|
|
|
|
typedef unsigned int pgtbl_mod_mask;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif /* !__ASSEMBLY__ */
|
|
|
|
|
arch: pgtable: define MAX_POSSIBLE_PHYSMEM_BITS where needed
Stefan Agner reported a bug when using zsram on 32-bit Arm machines
with RAM above the 4GB address boundary:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = a27bd01c
[00000000] *pgd=236a0003, *pmd=1ffa64003
Internal error: Oops: 207 [#1] SMP ARM
Modules linked in: mdio_bcm_unimac(+) brcmfmac cfg80211 brcmutil raspberrypi_hwmon hci_uart crc32_arm_ce bcm2711_thermal phy_generic genet
CPU: 0 PID: 123 Comm: mkfs.ext4 Not tainted 5.9.6 #1
Hardware name: BCM2711
PC is at zs_map_object+0x94/0x338
LR is at zram_bvec_rw.constprop.0+0x330/0xa64
pc : [<c0602b38>] lr : [<c0bda6a0>] psr: 60000013
sp : e376bbe0 ip : 00000000 fp : c1e2921c
r10: 00000002 r9 : c1dda730 r8 : 00000000
r7 : e8ff7a00 r6 : 00000000 r5 : 02f9ffa0 r4 : e3710000
r3 : 000fdffe r2 : c1e0ce80 r1 : ebf979a0 r0 : 00000000
Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 30c5383d Table: 235c2a80 DAC: fffffffd
Process mkfs.ext4 (pid: 123, stack limit = 0x495a22e6)
Stack: (0xe376bbe0 to 0xe376c000)
As it turns out, zsram needs to know the maximum memory size, which
is defined in MAX_PHYSMEM_BITS when CONFIG_SPARSEMEM is set, or in
MAX_POSSIBLE_PHYSMEM_BITS on the x86 architecture.
The same problem will be hit on all 32-bit architectures that have a
physical address space larger than 4GB and happen to not enable sparsemem
and include asm/sparsemem.h from asm/pgtable.h.
After the initial discussion, I suggested just always defining
MAX_POSSIBLE_PHYSMEM_BITS whenever CONFIG_PHYS_ADDR_T_64BIT is
set, or provoking a build error otherwise. This addresses all
configurations that can currently have this runtime bug, but
leaves all other configurations unchanged.
I looked up the possible number of bits in source code and
datasheets, here is what I found:
- on ARC, CONFIG_ARC_HAS_PAE40 controls whether 32 or 40 bits are used
- on ARM, CONFIG_LPAE enables 40 bit addressing, without it we never
support more than 32 bits, even though supersections in theory allow
up to 40 bits as well.
- on MIPS, some MIPS32r1 or later chips support 36 bits, and MIPS32r5
XPA supports up to 60 bits in theory, but 40 bits are more than
anyone will ever ship
- On PowerPC, there are three different implementations of 36 bit
addressing, but 32-bit is used without CONFIG_PTE_64BIT
- On RISC-V, the normal page table format can support 34 bit
addressing. There is no highmem support on RISC-V, so anything
above 2GB is unused, but it might be useful to eventually support
CONFIG_ZRAM for high pages.
Fixes: 61989a80fb3a ("staging: zsmalloc: zsmalloc memory allocation library")
Fixes: 02390b87a945 ("mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS")
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Tested-by: Stefan Agner <stefan@agner.ch>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/linux-mm/bdfa44bf1c570b05d6c70898e2bbb0acf234ecdf.1604762181.git.stefan@agner.ch/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-11-11 16:52:58 +00:00
|
|
|
#if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
|
|
|
|
#ifdef CONFIG_PHYS_ADDR_T_64BIT
|
|
|
|
/*
|
|
|
|
* ZSMALLOC needs to know the highest PFN on 32-bit architectures
|
|
|
|
* with physical address space extension, but falls back to
|
|
|
|
* BITS_PER_LONG otherwise.
|
|
|
|
*/
|
|
|
|
#error Missing MAX_POSSIBLE_PHYSMEM_BITS definition
|
|
|
|
#else
|
|
|
|
#define MAX_POSSIBLE_PHYSMEM_BITS 32
|
|
|
|
#endif
|
|
|
|
#endif
|
|
|
|
|
arch: fix has_transparent_hugepage()
I've just discovered that the useful-sounding has_transparent_hugepage()
is actually an architecture-dependent minefield: on some arches it only
builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
not, but on some of those (arm and arm64) it then gives the wrong
answer; and on mips alone it's marked __init, which would crash if
called later (but so far it has not been called later).
Straighten this out: make it available to all configs, with a sensible
default in asm-generic/pgtable.h, removing its definitions from those
arches (arc, arm, arm64, sparc, tile) which are served by the default,
adding #define has_transparent_hugepage has_transparent_hugepage to
those (mips, powerpc, s390, x86) which need to override the default at
runtime, and removing the __init from mips (but maybe that kind of code
should be avoided after init: set a static variable the first time it's
called).
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [arch/s390]
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 00:13:00 +00:00
|
|
|
#ifndef has_transparent_hugepage
|
2022-08-29 09:57:09 +00:00
|
|
|
#define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE)
|
arch: fix has_transparent_hugepage()
I've just discovered that the useful-sounding has_transparent_hugepage()
is actually an architecture-dependent minefield: on some arches it only
builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
not, but on some of those (arm and arm64) it then gives the wrong
answer; and on mips alone it's marked __init, which would crash if
called later (but so far it has not been called later).
Straighten this out: make it available to all configs, with a sensible
default in asm-generic/pgtable.h, removing its definitions from those
arches (arc, arm, arm64, sparc, tile) which are served by the default,
adding #define has_transparent_hugepage has_transparent_hugepage to
those (mips, powerpc, s390, x86) which need to override the default at
runtime, and removing the __init from mips (but maybe that kind of code
should be avoided after init: set a static variable the first time it's
called).
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [arch/s390]
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 00:13:00 +00:00
|
|
|
#endif
|
|
|
|
|
2023-07-24 19:07:47 +00:00
|
|
|
#ifndef has_transparent_pud_hugepage
|
|
|
|
#define has_transparent_pud_hugepage() IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
|
|
|
|
#endif
|
2018-10-15 08:25:57 +00:00
|
|
|
/*
|
|
|
|
* On some architectures it depends on the mm if the p4d/pud or pmd
|
|
|
|
* layer of the page table hierarchy is folded or not.
|
|
|
|
*/
|
|
|
|
#ifndef mm_p4d_folded
|
|
|
|
#define mm_p4d_folded(mm) __is_defined(__PAGETABLE_P4D_FOLDED)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef mm_pud_folded
|
|
|
|
#define mm_pud_folded(mm) __is_defined(__PAGETABLE_PUD_FOLDED)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef mm_pmd_folded
|
|
|
|
#define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED)
|
|
|
|
#endif
|
|
|
|
|
mm/gup: fix gup_fast with dynamic page table folding
Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
...
pudp = pud_offset(&p4d, addr);
This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be
iterated.
On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.
Here is an example of what happens with gup_fast on s390, for a task
with 3-level paging, crossing a 2 GB pud boundary:
// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;
// pud_offset returns &p4d itself (a pointer to a value on stack)
pudp = pud_offset(&p4d, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);
// next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
return 1;
}
This happens since s390 moved to common gup code with commit
d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code").
s390 tried to mimic static level folding by changing pXd_offset
primitives to always calculate top level page table offset in pgd_offset
and just return the value passed when pXd_offset has to act as folded.
What is crucial for gup_fast and what has been overlooked is that
PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
And the latter is not possible with dynamic folding.
To fix the issue in addition to pXd values pass original pXdp pointers
down to gup_pXd_range functions. And introduce pXd_offset_lockless
helpers, which take an additional pXd entry value parameter. This has
already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: <stable@vger.kernel.org> [5.2+]
Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-26 04:19:10 +00:00
|
|
|
#ifndef p4d_offset_lockless
|
|
|
|
#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
|
|
|
|
#endif
|
|
|
|
#ifndef pud_offset_lockless
|
|
|
|
#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
|
|
|
|
#endif
|
|
|
|
#ifndef pmd_offset_lockless
|
|
|
|
#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
|
|
|
|
#endif
|
|
|
|
|
2020-02-04 01:35:01 +00:00
|
|
|
/*
|
2024-03-18 20:04:04 +00:00
|
|
|
* pXd_leaf() is the API to check whether a pgtable entry is a huge page
|
|
|
|
* mapping. It should work globally across all archs, without any
|
|
|
|
* dependency on CONFIG_* options. For architectures that do not support
|
|
|
|
* huge mappings on specific levels, below fallbacks will be used.
|
|
|
|
*
|
|
|
|
* A leaf pgtable entry should always imply the following:
|
|
|
|
*
|
|
|
|
* - It is a "present" entry. IOW, before using this API, please check it
|
|
|
|
* with pXd_present() first. NOTE: it may not always mean the "present
|
|
|
|
* bit" is set. For example, PROT_NONE entries are always "present".
|
|
|
|
*
|
|
|
|
* - It should _never_ be a swap entry of any type. Above "present" check
|
|
|
|
* should have guarded this, but let's be crystal clear on this.
|
|
|
|
*
|
|
|
|
* - It should contain a huge PFN, which points to a huge page larger than
|
|
|
|
* PAGE_SIZE of the platform. The PFN format isn't important here.
|
|
|
|
*
|
|
|
|
* - It should cover all kinds of huge mappings (e.g., pXd_trans_huge(),
|
|
|
|
* pXd_devmap(), or hugetlb mappings).
|
2020-02-04 01:35:01 +00:00
|
|
|
*/
|
|
|
|
#ifndef pgd_leaf
|
2024-03-05 04:37:50 +00:00
|
|
|
#define pgd_leaf(x) false
|
2020-02-04 01:35:01 +00:00
|
|
|
#endif
|
|
|
|
#ifndef p4d_leaf
|
2024-03-05 04:37:50 +00:00
|
|
|
#define p4d_leaf(x) false
|
2020-02-04 01:35:01 +00:00
|
|
|
#endif
|
|
|
|
#ifndef pud_leaf
|
2024-03-05 04:37:50 +00:00
|
|
|
#define pud_leaf(x) false
|
2020-02-04 01:35:01 +00:00
|
|
|
#endif
|
|
|
|
#ifndef pmd_leaf
|
2024-03-05 04:37:50 +00:00
|
|
|
#define pmd_leaf(x) false
|
2020-02-04 01:35:01 +00:00
|
|
|
#endif
|
|
|
|
|
2020-11-13 10:45:36 +00:00
|
|
|
#ifndef pgd_leaf_size
|
|
|
|
#define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT)
|
|
|
|
#endif
|
|
|
|
#ifndef p4d_leaf_size
|
|
|
|
#define p4d_leaf_size(x) P4D_SIZE
|
|
|
|
#endif
|
|
|
|
#ifndef pud_leaf_size
|
|
|
|
#define pud_leaf_size(x) PUD_SIZE
|
|
|
|
#endif
|
|
|
|
#ifndef pmd_leaf_size
|
|
|
|
#define pmd_leaf_size(x) PMD_SIZE
|
|
|
|
#endif
|
2024-07-02 13:51:19 +00:00
|
|
|
#ifndef __pte_leaf_size
|
2020-11-13 10:45:36 +00:00
|
|
|
#ifndef pte_leaf_size
|
|
|
|
#define pte_leaf_size(x) PAGE_SIZE
|
|
|
|
#endif
|
2024-07-02 13:51:19 +00:00
|
|
|
#define __pte_leaf_size(x,y) pte_leaf_size(y)
|
|
|
|
#endif
|
2020-11-13 10:45:36 +00:00
|
|
|
|
2024-03-27 15:23:24 +00:00
|
|
|
/*
|
|
|
|
* We always define pmd_pfn for all archs as it's used in lots of generic
|
|
|
|
* code. Now it happens too for pud_pfn (and can happen for larger
|
|
|
|
* mappings too in the future; we're not there yet). Instead of defining
|
|
|
|
* it for all archs (like pmd_pfn), provide a fallback.
|
|
|
|
*
|
|
|
|
* Note that returning 0 here means any arch that didn't define this can
|
|
|
|
* get severely wrong when it hits a real pud leaf. It's arch's
|
|
|
|
* responsibility to properly define it when a huge pud is possible.
|
|
|
|
*/
|
|
|
|
#ifndef pud_pfn
|
|
|
|
#define pud_pfn(x) 0
|
|
|
|
#endif
|
|
|
|
|
2021-06-29 02:40:46 +00:00
|
|
|
/*
|
|
|
|
* Some architectures have MMUs that are configurable or selectable at boot
|
|
|
|
* time. These lead to variable PTRS_PER_x. For statically allocated arrays it
|
|
|
|
* helps to have a static maximum value.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef MAX_PTRS_PER_PTE
|
|
|
|
#define MAX_PTRS_PER_PTE PTRS_PER_PTE
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef MAX_PTRS_PER_PMD
|
|
|
|
#define MAX_PTRS_PER_PMD PTRS_PER_PMD
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef MAX_PTRS_PER_PUD
|
|
|
|
#define MAX_PTRS_PER_PUD PTRS_PER_PUD
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef MAX_PTRS_PER_P4D
|
|
|
|
#define MAX_PTRS_PER_P4D PTRS_PER_P4D
|
|
|
|
#endif
|
|
|
|
|
2024-08-26 20:43:42 +00:00
|
|
|
#ifndef pte_pgprot
|
|
|
|
#define pte_pgprot(x) ((pgprot_t) {0})
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pmd_pgprot
|
|
|
|
#define pmd_pgprot(x) ((pgprot_t) {0})
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef pud_pgprot
|
|
|
|
#define pud_pgprot(x) ((pgprot_t) {0})
|
|
|
|
#endif
|
|
|
|
|
2022-07-11 07:05:36 +00:00
|
|
|
/* description of effects of mapping type and prot in current implementation.
|
|
|
|
* this is due to the limited x86 page protection hardware. The expected
|
|
|
|
* behavior is in parens:
|
|
|
|
*
|
|
|
|
* map_type prot
|
|
|
|
* PROT_NONE PROT_READ PROT_WRITE PROT_EXEC
|
|
|
|
* MAP_SHARED r: (no) no r: (yes) yes r: (no) yes r: (no) yes
|
|
|
|
* w: (no) no w: (no) no w: (yes) yes w: (no) no
|
|
|
|
* x: (no) no x: (no) yes x: (no) yes x: (yes) yes
|
|
|
|
*
|
|
|
|
* MAP_PRIVATE r: (no) no r: (yes) yes r: (no) yes r: (no) yes
|
|
|
|
* w: (no) no w: (no) no w: (copy) copy w: (no) no
|
|
|
|
* x: (no) no x: (no) yes x: (no) yes x: (yes) yes
|
|
|
|
*
|
|
|
|
* On arm64, PROT_EXEC has the following behaviour for both MAP_SHARED and
|
|
|
|
* MAP_PRIVATE (with Enhanced PAN supported):
|
|
|
|
* r: (no) no
|
|
|
|
* w: (no) no
|
|
|
|
* x: (yes) yes
|
|
|
|
*/
|
|
|
|
#define DECLARE_VM_GET_PAGE_PROT \
|
|
|
|
pgprot_t vm_get_page_prot(unsigned long vm_flags) \
|
|
|
|
{ \
|
|
|
|
return protection_map[vm_flags & \
|
|
|
|
(VM_READ | VM_WRITE | VM_EXEC | VM_SHARED)]; \
|
|
|
|
} \
|
|
|
|
EXPORT_SYMBOL(vm_get_page_prot);
|
|
|
|
|
2020-06-09 04:32:38 +00:00
|
|
|
#endif /* _LINUX_PGTABLE_H */
|