License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2011-01-13 23:46:40 +00:00
|
|
|
/*
|
|
|
|
* mm/pgtable-generic.c
|
|
|
|
*
|
2020-06-09 04:32:38 +00:00
|
|
|
* Generic pgtable methods declared in linux/pgtable.h
|
2011-01-13 23:46:40 +00:00
|
|
|
*
|
|
|
|
* Copyright (C) 2010 Linus Torvalds
|
|
|
|
*/
|
|
|
|
|
2011-01-25 23:07:11 +00:00
|
|
|
#include <linux/pagemap.h>
|
2018-08-26 10:56:48 +00:00
|
|
|
#include <linux/hugetlb.h>
|
2020-06-09 04:32:38 +00:00
|
|
|
#include <linux/pgtable.h>
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/swapops.h>
|
2022-01-14 22:06:10 +00:00
|
|
|
#include <linux/mm_inline.h>
|
2023-07-12 04:39:48 +00:00
|
|
|
#include <asm/pgalloc.h>
|
2011-01-13 23:46:40 +00:00
|
|
|
#include <asm/tlb.h>
|
|
|
|
|
2013-09-11 21:21:28 +00:00
|
|
|
/*
|
|
|
|
* If a p?d_bad entry is found while walking page tables, report
|
|
|
|
* the error, before resetting entry to p?d_none. Usually (but
|
|
|
|
* very seldom) called out from the p?d_none_or_clear_bad macros.
|
|
|
|
*/
|
|
|
|
|
|
|
|
void pgd_clear_bad(pgd_t *pgd)
|
|
|
|
{
|
|
|
|
pgd_ERROR(*pgd);
|
|
|
|
pgd_clear(pgd);
|
|
|
|
}
|
|
|
|
|
2019-12-01 01:51:20 +00:00
|
|
|
#ifndef __PAGETABLE_P4D_FOLDED
|
2017-03-09 14:24:07 +00:00
|
|
|
void p4d_clear_bad(p4d_t *p4d)
|
|
|
|
{
|
|
|
|
p4d_ERROR(*p4d);
|
|
|
|
p4d_clear(p4d);
|
|
|
|
}
|
2019-12-01 01:51:20 +00:00
|
|
|
#endif
|
2017-03-09 14:24:07 +00:00
|
|
|
|
2019-12-01 01:51:20 +00:00
|
|
|
#ifndef __PAGETABLE_PUD_FOLDED
|
2013-09-11 21:21:28 +00:00
|
|
|
void pud_clear_bad(pud_t *pud)
|
|
|
|
{
|
|
|
|
pud_ERROR(*pud);
|
|
|
|
pud_clear(pud);
|
|
|
|
}
|
2019-12-01 01:51:20 +00:00
|
|
|
#endif
|
2013-09-11 21:21:28 +00:00
|
|
|
|
2019-12-01 01:51:20 +00:00
|
|
|
/*
|
|
|
|
* Note that the pmd variant below can't be stub'ed out just as for p4d/pud
|
|
|
|
* above. pmd folding is special and typically pmd_* macros refer to upper
|
|
|
|
* level even when folded
|
|
|
|
*/
|
2013-09-11 21:21:28 +00:00
|
|
|
void pmd_clear_bad(pmd_t *pmd)
|
|
|
|
{
|
|
|
|
pmd_ERROR(*pmd);
|
|
|
|
pmd_clear(pmd);
|
|
|
|
}
|
|
|
|
|
2011-01-13 23:46:40 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
|
|
|
|
/*
|
2020-06-09 04:32:38 +00:00
|
|
|
* Only sets the access flags (dirty, accessed), as well as write
|
2012-11-06 09:56:01 +00:00
|
|
|
* permission. Furthermore, we know it always gets set to a "more
|
2011-01-13 23:46:40 +00:00
|
|
|
* permissive" setting, which allows most architectures to optimize
|
|
|
|
* this. We return whether the PTE actually changed, which in turn
|
|
|
|
* instructs the caller to do things like update__mmu_cache. This
|
|
|
|
* used to be done in the caller, but sparc needs minor faults to
|
|
|
|
* force that call on sun4c so we changed this macro slightly
|
|
|
|
*/
|
|
|
|
int ptep_set_access_flags(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pte_t *ptep,
|
|
|
|
pte_t entry, int dirty)
|
|
|
|
{
|
mm: ptep_get() conversion
Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper. This means that by default, the accesses change from a
C dereference to a READ_ONCE(). This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.
But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte. Arch code
is deliberately not converted, as the arch code knows best. It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.
Conversion was done using Coccinelle:
----
// $ make coccicheck \
// COCCI=ptepget.cocci \
// SPFLAGS="--include-headers" \
// MODE=patch
virtual patch
@ depends on patch @
pte_t *v;
@@
- *v
+ ptep_get(v)
----
Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so. This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.
Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot. The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get(). HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined. Fix by continuing to do a direct dereference
when MMU=n. This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.
Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 15:15:45 +00:00
|
|
|
int changed = !pte_same(ptep_get(ptep), entry);
|
2011-01-13 23:46:40 +00:00
|
|
|
if (changed) {
|
|
|
|
set_pte_at(vma->vm_mm, address, ptep, entry);
|
2023-03-06 16:15:48 +00:00
|
|
|
flush_tlb_fix_spurious_fault(vma, address, ptep);
|
2011-01-13 23:46:40 +00:00
|
|
|
}
|
|
|
|
return changed;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-07-09 11:49:30 +00:00
|
|
|
#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
|
|
|
|
int ptep_clear_flush_young(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pte_t *ptep)
|
|
|
|
{
|
|
|
|
int young;
|
|
|
|
young = ptep_test_and_clear_young(vma, address, ptep);
|
|
|
|
if (young)
|
|
|
|
flush_tlb_page(vma, address);
|
|
|
|
return young;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
|
|
|
|
pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
|
|
|
|
pte_t *ptep)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = (vma)->vm_mm;
|
|
|
|
pte_t pte;
|
|
|
|
pte = ptep_get_and_clear(mm, address, ptep);
|
|
|
|
if (pte_accessible(mm, pte))
|
|
|
|
flush_tlb_page(vma, address);
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-07-09 11:52:44 +00:00
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
|
2011-01-13 23:46:40 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
|
|
|
|
int pmdp_set_access_flags(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp,
|
|
|
|
pmd_t entry, int dirty)
|
|
|
|
{
|
|
|
|
int changed = !pmd_same(*pmdp, entry);
|
|
|
|
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
|
|
|
|
if (changed) {
|
|
|
|
set_pmd_at(vma->vm_mm, address, pmdp, entry);
|
2015-02-20 05:06:28 +00:00
|
|
|
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
|
2011-01-13 23:46:40 +00:00
|
|
|
}
|
|
|
|
return changed;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
|
|
|
|
int pmdp_clear_flush_young(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, pmd_t *pmdp)
|
|
|
|
{
|
|
|
|
int young;
|
2012-03-21 23:34:27 +00:00
|
|
|
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
|
2011-01-13 23:46:40 +00:00
|
|
|
young = pmdp_test_and_clear_young(vma, address, pmdp);
|
|
|
|
if (young)
|
2015-02-20 05:06:28 +00:00
|
|
|
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
|
2011-01-13 23:46:40 +00:00
|
|
|
return young;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-06-24 23:57:44 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMDP_HUGE_CLEAR_FLUSH
|
|
|
|
pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
|
|
|
|
pmd_t *pmdp)
|
2011-01-13 23:46:40 +00:00
|
|
|
{
|
|
|
|
pmd_t pmd;
|
|
|
|
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
|
mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.
Here is v2 batch of long-standing THP bug fixes that I had not got
around to sending before, but prompted now by Wang Yugui's report
https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
they have done no harm, but have *not* fixed that issue: something more
is needed and I have no idea of what.
This patch (of 7):
Stressing huge tmpfs page migration racing hole punch often crashed on
the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
kernel; or shortly afterwards, on a bad dereference in
__split_huge_pmd_locked() when DEBUG_VM=n. They forgot to allow for pmd
migration entries in the non-anonymous case.
Full disclosure: those particular experiments were on a kernel with more
relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
vanilla kernel: it is conceivable that stricter locking happens to avoid
those cases, or makes them less likely; but __split_huge_pmd_locked()
already allowed for pmd migration entries when handling anonymous THPs,
so this commit brings the shmem and file THP handling into line.
And while there: use old_pmd rather than _pmd, as in the following
blocks; and make it clearer to the eye that the !vma_is_anonymous()
block is self-contained, making an early return after accounting for
unmapping.
Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jue Wang <juew@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 01:23:45 +00:00
|
|
|
VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
|
|
|
|
!pmd_devmap(*pmdp));
|
2015-06-24 23:57:44 +00:00
|
|
|
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
|
2015-02-20 05:06:28 +00:00
|
|
|
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
|
2011-01-13 23:46:40 +00:00
|
|
|
return pmd;
|
|
|
|
}
|
2017-02-24 22:57:02 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
|
|
|
|
pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
|
|
|
|
pud_t *pudp)
|
|
|
|
{
|
|
|
|
pud_t pud;
|
|
|
|
|
|
|
|
VM_BUG_ON(address & ~HPAGE_PUD_MASK);
|
|
|
|
VM_BUG_ON(!pud_trans_huge(*pudp) && !pud_devmap(*pudp));
|
|
|
|
pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
|
|
|
|
flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
|
|
|
|
return pud;
|
|
|
|
}
|
|
|
|
#endif
|
2011-01-13 23:46:40 +00:00
|
|
|
#endif
|
|
|
|
|
2012-10-08 23:30:07 +00:00
|
|
|
#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
|
2013-06-06 00:14:02 +00:00
|
|
|
void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
|
|
|
|
pgtable_t pgtable)
|
2012-10-08 23:30:07 +00:00
|
|
|
{
|
2013-11-14 22:31:04 +00:00
|
|
|
assert_spin_locked(pmd_lockptr(mm, pmdp));
|
2012-10-08 23:30:07 +00:00
|
|
|
|
|
|
|
/* FIFO */
|
2013-11-14 22:30:59 +00:00
|
|
|
if (!pmd_huge_pte(mm, pmdp))
|
2012-10-08 23:30:07 +00:00
|
|
|
INIT_LIST_HEAD(&pgtable->lru);
|
|
|
|
else
|
2013-11-14 22:30:59 +00:00
|
|
|
list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
|
|
|
|
pmd_huge_pte(mm, pmdp) = pgtable;
|
2012-10-08 23:30:07 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
|
|
|
|
/* no "address" argument so destroys page coloring of some arch */
|
2013-06-06 00:14:02 +00:00
|
|
|
pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
|
2012-10-08 23:30:07 +00:00
|
|
|
{
|
|
|
|
pgtable_t pgtable;
|
|
|
|
|
2013-11-14 22:31:04 +00:00
|
|
|
assert_spin_locked(pmd_lockptr(mm, pmdp));
|
2012-10-08 23:30:07 +00:00
|
|
|
|
|
|
|
/* FIFO */
|
2013-11-14 22:30:59 +00:00
|
|
|
pgtable = pmd_huge_pte(mm, pmdp);
|
2016-01-14 23:19:32 +00:00
|
|
|
pmd_huge_pte(mm, pmdp) = list_first_entry_or_null(&pgtable->lru,
|
|
|
|
struct page, lru);
|
|
|
|
if (pmd_huge_pte(mm, pmdp))
|
2012-10-08 23:30:07 +00:00
|
|
|
list_del(&pgtable->lru);
|
|
|
|
return pgtable;
|
|
|
|
}
|
|
|
|
#endif
|
2012-10-08 23:30:09 +00:00
|
|
|
|
|
|
|
#ifndef __HAVE_ARCH_PMDP_INVALIDATE
|
2018-02-01 00:18:16 +00:00
|
|
|
pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
|
2012-10-08 23:30:09 +00:00
|
|
|
pmd_t *pmdp)
|
|
|
|
{
|
2024-05-01 14:33:10 +00:00
|
|
|
VM_WARN_ON_ONCE(!pmd_present(*pmdp));
|
2020-06-03 23:03:45 +00:00
|
|
|
pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
|
2015-02-20 05:06:28 +00:00
|
|
|
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
|
2018-02-01 00:18:16 +00:00
|
|
|
return old;
|
2012-10-08 23:30:09 +00:00
|
|
|
}
|
|
|
|
#endif
|
2015-06-24 23:57:42 +00:00
|
|
|
|
2022-05-10 01:20:50 +00:00
|
|
|
#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
|
|
|
|
pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
|
|
|
|
pmd_t *pmdp)
|
|
|
|
{
|
2024-05-01 14:33:10 +00:00
|
|
|
VM_WARN_ON_ONCE(!pmd_present(*pmdp));
|
2022-05-10 01:20:50 +00:00
|
|
|
return pmdp_invalidate(vma, address, pmdp);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-06-24 23:57:42 +00:00
|
|
|
#ifndef pmdp_collapse_flush
|
|
|
|
pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
|
|
|
|
pmd_t *pmdp)
|
|
|
|
{
|
2015-06-24 23:57:44 +00:00
|
|
|
/*
|
|
|
|
* pmd and hugepage pte format are same. So we could
|
|
|
|
* use the same function.
|
|
|
|
*/
|
2015-06-24 23:57:42 +00:00
|
|
|
pmd_t pmd;
|
|
|
|
|
|
|
|
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
|
|
|
|
VM_BUG_ON(pmd_trans_huge(*pmdp));
|
2015-06-24 23:57:44 +00:00
|
|
|
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
|
2016-02-12 00:13:09 +00:00
|
|
|
|
|
|
|
/* collapse entails shooting down ptes not pmd */
|
|
|
|
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
|
2015-06-24 23:57:42 +00:00
|
|
|
return pmd;
|
|
|
|
}
|
|
|
|
#endif
|
2023-07-12 04:39:48 +00:00
|
|
|
|
|
|
|
/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
|
|
|
|
#ifndef pte_free_defer
|
|
|
|
static void pte_free_now(struct rcu_head *head)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
page = container_of(head, struct page, rcu_head);
|
|
|
|
pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
|
|
|
|
}
|
|
|
|
|
|
|
|
void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
page = pgtable;
|
|
|
|
call_rcu(&page->rcu_head, pte_free_now);
|
|
|
|
}
|
|
|
|
#endif /* pte_free_defer */
|
2015-07-09 11:52:44 +00:00
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
|
mm/pgtable: add PAE safety to __pte_offset_map()
There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on a
pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn. pmdp_get_lockless() is not enough to prevent that.
Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests. It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.
Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.
CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm. It is
not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.
Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).
Link: https://lkml.kernel.org/r/3adcd8f-9191-2df1-d7ea-c4877698aad@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-12 04:32:05 +00:00
|
|
|
#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
|
|
|
|
(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
|
|
|
|
/*
|
|
|
|
* See the comment above ptep_get_lockless() in include/linux/pgtable.h:
|
|
|
|
* the barriers in pmdp_get_lockless() cannot guarantee that the value in
|
|
|
|
* pmd_high actually belongs with the value in pmd_low; but holding interrupts
|
|
|
|
* off blocks the TLB flush between present updates, which guarantees that a
|
|
|
|
* successful __pte_offset_map() points to a page from matched halves.
|
|
|
|
*/
|
|
|
|
static unsigned long pmdp_get_lockless_start(void)
|
|
|
|
{
|
|
|
|
unsigned long irqflags;
|
|
|
|
|
|
|
|
local_irq_save(irqflags);
|
|
|
|
return irqflags;
|
|
|
|
}
|
|
|
|
static void pmdp_get_lockless_end(unsigned long irqflags)
|
|
|
|
{
|
|
|
|
local_irq_restore(irqflags);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static unsigned long pmdp_get_lockless_start(void) { return 0; }
|
|
|
|
static void pmdp_get_lockless_end(unsigned long irqflags) { }
|
|
|
|
#endif
|
|
|
|
|
2024-12-10 00:06:04 +00:00
|
|
|
pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
{
|
mm/pgtable: add PAE safety to __pte_offset_map()
There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on a
pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn. pmdp_get_lockless() is not enough to prevent that.
Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests. It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.
Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.
CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm. It is
not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.
Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).
Link: https://lkml.kernel.org/r/3adcd8f-9191-2df1-d7ea-c4877698aad@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-12 04:32:05 +00:00
|
|
|
unsigned long irqflags;
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
pmd_t pmdval;
|
|
|
|
|
2023-07-12 04:30:40 +00:00
|
|
|
rcu_read_lock();
|
mm/pgtable: add PAE safety to __pte_offset_map()
There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on a
pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn. pmdp_get_lockless() is not enough to prevent that.
Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests. It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.
Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.
CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm. It is
not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.
Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).
Link: https://lkml.kernel.org/r/3adcd8f-9191-2df1-d7ea-c4877698aad@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-12 04:32:05 +00:00
|
|
|
irqflags = pmdp_get_lockless_start();
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
pmdval = pmdp_get_lockless(pmd);
|
mm/pgtable: add PAE safety to __pte_offset_map()
There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on a
pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn. pmdp_get_lockless() is not enough to prevent that.
Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests. It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.
Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.
CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm. It is
not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.
Limit the IRQ disablement to CONFIG_HIGHPTE? Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).
Link: https://lkml.kernel.org/r/3adcd8f-9191-2df1-d7ea-c4877698aad@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-12 04:32:05 +00:00
|
|
|
pmdp_get_lockless_end(irqflags);
|
|
|
|
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
if (pmdvalp)
|
|
|
|
*pmdvalp = pmdval;
|
|
|
|
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
|
|
|
|
goto nomap;
|
|
|
|
if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
|
|
|
|
goto nomap;
|
|
|
|
if (unlikely(pmd_bad(pmdval))) {
|
|
|
|
pmd_clear_bad(pmd);
|
|
|
|
goto nomap;
|
|
|
|
}
|
|
|
|
return __pte_map(&pmdval, addr);
|
|
|
|
nomap:
|
2023-07-12 04:30:40 +00:00
|
|
|
rcu_read_unlock();
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
Patch series "introduce pte_offset_map_{ro|rw}_nolock()", v5.
As proposed by David Hildenbrand [1], this series introduces the following
two new helper functions to replace pte_offset_map_nolock().
1. pte_offset_map_ro_nolock()
2. pte_offset_map_rw_nolock()
As the name suggests, pte_offset_map_ro_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE
page after the PTL is held. The RCU lock in pte_offset_map_nolock() will
ensure that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified. Therefore
pte_offset_map_ro_nolock() is just a renamed version of
pte_offset_map_nolock().
pte_offset_map_rw_nolock() is used for may-write case. In this case, the
pte or pmd entry may be modified after the PTL is held, so we need to
ensure that the pmd entry has not been modified concurrently. So in
addition to the name change, it also outputs the pmdval when successful.
The users should make sure the page table is stable like checking
pte_same() or checking pmd_same() by using the output pmdval before
performing the write operations.
This series will convert all pte_offset_map_nolock() into the above two
helper functions one by one, and finally completely delete it.
This also a preparation for reclaiming the empty user PTE page table
pages.
This patch (of 13):
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Link: https://lkml.kernel.org/r/cover.1727332572.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/5aeecfa131600a454b1f3a038a1a54282ca3b856.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26 06:46:14 +00:00
|
|
|
pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
|
|
|
|
unsigned long addr, spinlock_t **ptlp)
|
|
|
|
{
|
|
|
|
pmd_t pmdval;
|
|
|
|
pte_t *pte;
|
|
|
|
|
|
|
|
pte = __pte_offset_map(pmd, addr, &pmdval);
|
|
|
|
if (likely(pte))
|
|
|
|
*ptlp = pte_lockptr(mm, &pmdval);
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
|
|
|
|
pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
|
|
|
|
unsigned long addr, pmd_t *pmdvalp,
|
|
|
|
spinlock_t **ptlp)
|
|
|
|
{
|
|
|
|
pte_t *pte;
|
|
|
|
|
|
|
|
VM_WARN_ON_ONCE(!pmdvalp);
|
|
|
|
pte = __pte_offset_map(pmd, addr, pmdvalp);
|
|
|
|
if (likely(pte))
|
|
|
|
*ptlp = pte_lockptr(mm, pmdvalp);
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
|
2023-07-12 04:46:23 +00:00
|
|
|
/*
|
|
|
|
* pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
|
|
|
|
* __pte_offset_map_lock() below, is usually called with the pmd pointer for
|
|
|
|
* addr, reached by walking down the mm's pgd, p4d, pud for addr: either while
|
|
|
|
* holding mmap_lock or vma lock for read or for write; or in truncate or rmap
|
|
|
|
* context, while holding file's i_mmap_lock or anon_vma lock for read (or for
|
|
|
|
* write). In a few cases, it may be used with pmd pointing to a pmd_t already
|
|
|
|
* copied to or constructed on the stack.
|
|
|
|
*
|
|
|
|
* When successful, it returns the pte pointer for addr, with its page table
|
|
|
|
* kmapped if necessary (when CONFIG_HIGHPTE), and locked against concurrent
|
|
|
|
* modification by software, with a pointer to that spinlock in ptlp (in some
|
|
|
|
* configs mm->page_table_lock, in SPLIT_PTLOCK configs a spinlock in table's
|
|
|
|
* struct page). pte_unmap_unlock(pte, ptl) to unlock and unmap afterwards.
|
|
|
|
*
|
|
|
|
* But it is unsuccessful, returning NULL with *ptlp unchanged, if there is no
|
|
|
|
* page table at *pmd: if, for example, the page table has just been removed,
|
|
|
|
* or replaced by the huge pmd of a THP. (When successful, *pmd is rechecked
|
|
|
|
* after acquiring the ptlock, and retried internally if it changed: so that a
|
|
|
|
* page table can be safely removed or replaced by THP while holding its lock.)
|
|
|
|
*
|
|
|
|
* pte_offset_map(pmd, addr), and its internal helper __pte_offset_map() above,
|
|
|
|
* just returns the pte pointer for addr, its page table kmapped if necessary;
|
|
|
|
* or NULL if there is no page table at *pmd. It does not attempt to lock the
|
|
|
|
* page table, so cannot normally be used when the page table is to be updated,
|
|
|
|
* or when entries read must be stable. But it does take rcu_read_lock(): so
|
|
|
|
* that even when page table is racily removed, it remains a valid though empty
|
|
|
|
* and disconnected table. Until pte_unmap(pte) unmaps and rcu_read_unlock()s
|
|
|
|
* afterwards.
|
|
|
|
*
|
mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
Patch series "introduce pte_offset_map_{ro|rw}_nolock()", v5.
As proposed by David Hildenbrand [1], this series introduces the following
two new helper functions to replace pte_offset_map_nolock().
1. pte_offset_map_ro_nolock()
2. pte_offset_map_rw_nolock()
As the name suggests, pte_offset_map_ro_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE
page after the PTL is held. The RCU lock in pte_offset_map_nolock() will
ensure that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified. Therefore
pte_offset_map_ro_nolock() is just a renamed version of
pte_offset_map_nolock().
pte_offset_map_rw_nolock() is used for may-write case. In this case, the
pte or pmd entry may be modified after the PTL is held, so we need to
ensure that the pmd entry has not been modified concurrently. So in
addition to the name change, it also outputs the pmdval when successful.
The users should make sure the page table is stable like checking
pte_same() or checking pmd_same() by using the output pmdval before
performing the write operations.
This series will convert all pte_offset_map_nolock() into the above two
helper functions one by one, and finally completely delete it.
This also a preparation for reclaiming the empty user PTE page table
pages.
This patch (of 13):
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Link: https://lkml.kernel.org/r/cover.1727332572.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/5aeecfa131600a454b1f3a038a1a54282ca3b856.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26 06:46:14 +00:00
|
|
|
* pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
|
|
|
|
* but when successful, it also outputs a pointer to the spinlock in ptlp - as
|
|
|
|
* pte_offset_map_lock() does, but in this case without locking it. This helps
|
|
|
|
* the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
|
|
|
|
* act on a changed *pmd: pte_offset_map_ro_nolock() provides the correct spinlock
|
|
|
|
* pointer for the page table that it returns. Even after grabbing the spinlock,
|
|
|
|
* we might be looking either at a page table that is still mapped or one that
|
|
|
|
* was unmapped and is about to get freed. But for R/O access this is sufficient.
|
|
|
|
* So it is only applicable for read-only cases where any modification operations
|
|
|
|
* to the page table are not allowed even if the corresponding spinlock is held
|
|
|
|
* afterwards.
|
|
|
|
*
|
|
|
|
* pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
|
|
|
|
* pte_offset_map_ro_nolock(); but when successful, it also outputs the pdmval.
|
|
|
|
* It is applicable for may-write cases where any modification operations to the
|
|
|
|
* page table may happen after the corresponding spinlock is held afterwards.
|
|
|
|
* But the users should make sure the page table is stable like checking pte_same()
|
|
|
|
* or checking pmd_same() by using the output pmdval before performing the write
|
|
|
|
* operations.
|
|
|
|
*
|
|
|
|
* Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will
|
|
|
|
* be read-only/read-write protected.
|
|
|
|
*
|
2023-07-12 04:46:23 +00:00
|
|
|
* Note that free_pgtables(), used after unmapping detached vmas, or when
|
|
|
|
* exiting the whole mm, does not take page table lock before freeing a page
|
|
|
|
* table, and may not use RCU at all: "outsiders" like khugepaged should avoid
|
|
|
|
* pte_offset_map() and co once the vma is detached from mm or mm_users is zero.
|
|
|
|
*/
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 01:10:32 +00:00
|
|
|
pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
|
|
|
|
unsigned long addr, spinlock_t **ptlp)
|
|
|
|
{
|
|
|
|
spinlock_t *ptl;
|
|
|
|
pmd_t pmdval;
|
|
|
|
pte_t *pte;
|
|
|
|
again:
|
|
|
|
pte = __pte_offset_map(pmd, addr, &pmdval);
|
|
|
|
if (unlikely(!pte))
|
|
|
|
return pte;
|
|
|
|
ptl = pte_lockptr(mm, &pmdval);
|
|
|
|
spin_lock(ptl);
|
|
|
|
if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
|
|
|
|
*ptlp = ptl;
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
pte_unmap_unlock(pte, ptl);
|
|
|
|
goto again;
|
|
|
|
}
|