- Rewrite how munlock works to massively reduce the contention
on i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
=n9Ad
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Rewrite how munlock works to massively reduce the contention on
i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew
Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
mm/damon: minor cleanup for damon_pa_young
selftests/vm/transhuge-stress: Support file-backed PMD folios
mm/filemap: Support VM_HUGEPAGE for file mappings
mm/readahead: Switch to page_cache_ra_order
mm/readahead: Align file mappings for non-DAX
mm/readahead: Add large folio readahead
mm: Support arbitrary THP sizes
mm: Make large folios depend on THP
mm: Fix READ_ONLY_THP warning
mm/filemap: Allow large folios to be added to the page cache
mm: Turn can_split_huge_page() into can_split_folio()
mm/vmscan: Convert pageout() to take a folio
mm/vmscan: Turn page_check_references() into folio_check_references()
mm/vmscan: Account large folios correctly
mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
mm/vmscan: Free non-shmem folios without splitting them
mm/rmap: Constify the rmap_walk_control argument
mm/rmap: Convert rmap_walk() to take a folio
mm: Turn page_anon_vma() into folio_anon_vma()
mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
...
PageDoubleMap is maintained differently for anon and for shmem+file: the
shmem+file one was never cleared, because a safe place to do so could
not be found; so it would blight future use of the cached hugepage until
evicted.
See https://lore.kernel.org/lkml/1571938066-29031-1-git-send-email-yang.shi@linux.alibaba.com/
But page_add_file_rmap() does provide a safe place to do so (though later
than one might wish): allowing testing to return to an initial state
without a damaging drop_caches.
Link: https://lkml.kernel.org/r/61c5cf99-a962-9a25-597a-53ab1bd8fbc0@google.com
Fixes: 9a73f61bdb ("thp, mlock: do not mlock PTE-mapped file huge pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hardware poison is tracked on a per-page basis, not on the head page.
Link: https://lkml.kernel.org/r/20220130013042.1906881-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The rmap walking functions do not modify the rmap_walk_control, and
page_idle_clear_pte_refs() takes advantage of that to move construction
of the rmap_walk_control to compile time. This lets us remove an
unclean cast.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Add back page_lock_anon_vma_read() as a wrapper. This saves a few calls
to compound_head(). If any callers were passing a tail page before,
this would have failed to lock the anon VMA as page->mapping is not
valid for tail pages.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Move the PageTail check earlier so we can avoid even taking the folio
lock on tail pages. Otherwise, this is a straightforward use of
folios throughout.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Convert the callers to pass a folio and the try_to_migrate_one()
worker to use a folio throughout. Fixes an assumption that a
folio must be <= PMD size.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Convert split_huge_pmd_address() at the same time since it only passes
the folio through, and its two callers already have a folio on hand.
Removes numerous calls to compound_head() and removes an assumption
that a page cannot be larger than a PMD.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Both its callers pass a page which was previously on an LRU list,
so were passing a folio by definition. Use the type system to enforce
that and remove a few calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
folio_mkclean() already passes down a head page, so convert it
back to a folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
page_mapped_in_vma() really just wants to walk one page, but as the
code stands, if passed the head page of a compound page, it will
walk every page in the compound page. Extract pfn/nr_pages/pgoff
from the struct page early, so they can be overridden by
page_mapped_in_vma().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Instead of declaring a struct page_vma_mapped_walk directly,
use these helpers to allow us to transition to a PFN approach in the
following patches.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Move compound_pincount from the third page to the second page, which
means it's available for all compound pages. That lets us delete
hpage_pincount_available().
On 32-bit systems, there isn't enough space for both compound_pincount
and compound_nr in the second page (it would collide with page->private,
which is in use for pages in the swap cache), so revert the optimisation
of storing both compound_order and compound_nr on 32-bit systems.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
4.8 commit 7751b2da6b ("vmscan: split file huge pages before paging
them out") inserted a split_huge_page_to_list() into shrink_page_list()
without considering the mlock case: no problem if the page has already
been marked as Mlocked (the !page_evictable check much higher up will
have skipped all this), but it has always been the case that races or
omissions in setting Mlocked can rely on page reclaim to detect this
and correct it before actually reclaiming - and that remains so, but
what a shame if a hugepage is needlessly split before discovering it.
It is surprising that page_check_references() returns PAGEREF_RECLAIM
when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
is where the condition is detected and corrected; and until now it could
not be done in page_referenced_one(), because that does not always have
the page locked. Now that mlock's requirement for page lock has gone,
copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
and let page_check_references() return PAGEREF_ACTIVATE in this case.
But page_referenced_one() may find a pte mapping one part of a hugepage:
what hold should a pte mapped in a VM_LOCKED area exert over the entire
huge page? That's debatable. The approach taken here is to treat that
pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
VM_LOCKED pmd mapping is found later in the walk, and lack of reference
permits, then PAGEREF_RECLAIM take it to attempted splitting as before.
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.
At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.
So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.
Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
of the munlocking to clear_page_mlock(), since PageMlocked is typically
still set when mapcount has fallen to 0. That is not what we want: we
want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
on the integrity of of the mlock/munlock protocol - small numbers are
not surprising, but big numbers mean the protocol is not working.
That could be easily fixed by placing munlock_vma_page() at the start of
page_remove_rmap(); but later in the series we shall want to batch the
munlocking, and that too would tend to leave PageMlocked still set at
the point when it is checked.
So delete clear_page_mlock() now: leave it instead to release_pages()
(and __page_cache_release()) to do this backstop clearing of Mlocked,
when page refcount has fallen to 0. If a pinned page occasionally gets
counted as Mlocked and Unevictable until it is unpinned, that's okay.
A slightly regrettable side-effect of this change is that, since
release_pages() and __page_cache_release() may be called at interrupt
time, those places which update NR_MLOCK with interrupts enabled
had better use mod_zone_page_state() than __mod_zone_page_state()
(but holding the lruvec lock always has interrupts disabled).
This change, forcing Mlocked off when refcount 0 instead of earlier
when mapcount 0, is not fundamental: it can be reversed if performance
or something else is found to suffer; but this is the easiest way to
separate the stats - let's not complicate that without good reason.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.
Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.
Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).
No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.
Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points. Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
We have recommended some applications to mlock their userspace, but that
turns out to be counter-productive: when many processes mlock the same
file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
is needed for write, to remove any vma mapping that file from rmap's tree;
but hogged for read by those with mlocks calling page_mlock() (formerly
known as try_to_munlock()) on *each* page mapped from the file (the
purpose being to find out whether another process has the page mlocked,
so therefore it should not be unmlocked yet).
Several optimizations have been made in the past: one is to skip
page_mlock() when mapcount tells that nothing else has this page
mapped; but that doesn't help at all when others do have it mapped.
This time around, I initially intended to add a preliminary search
of the rmap tree for overlapping VM_LOCKED ranges; but that gets
messy with locking order, when in doubt whether a page is actually
present; and risks adding even more contention on the i_mmap_rwsem.
A solution would be much easier, if only there were space in struct page
for an mlock_count... but actually, most of the time, there is space for
it - an mlocked page spends most of its life on an unevictable LRU, but
since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
been redundant. Let's try to reuse its page->lru.
But leave that until a later patch: in this patch, clear the ground by
removing page_mlock(), and all the infrastructure that has gathered
around it - which mostly hinders understanding, and will make reviewing
new additions harder. Don't mind those old comments about THPs, they
date from before 4.5's refcounting rework: splitting is not a risk here.
Just keep a minimal version of munlock_vma_page(), as reminder of what it
should attend to (in particular, the odd way PGSTRANDED is counted out of
PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range(). Move
unchanged __mlock_posix_error_return() out of the way, down to above its
caller: this series then makes no further change after mlock_fixup().
After this and each following commit, the kernel builds, boots and runs;
but with deficiencies which may show up in testing of mlock and munlock.
The system calls succeed or fail as before, and mlock remains effective
in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
may be shown too low after mlock, grow, then stay too high after munlock:
with previously mlocked pages remaining unevictable for too long, until
finally unmapped and freed and counts corrected. Normal service will be
resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
In theory, the following race is possible for batched TLB flushing.
CPU0 CPU1
---- ----
shrink_page_list()
unmap
zap_pte_range()
flush_tlb_batched_pending()
flush_tlb_mm()
try_to_unmap()
set_tlb_ubc_flush_pending()
mm->tlb_flush_batched = true
mm->tlb_flush_batched = false
After the TLB is flushed on CPU1 via flush_tlb_mm() and before
mm->tlb_flush_batched is set to false, some PTE is unmapped on CPU0 and
the TLB flushing is pended. Then the pended TLB flushing will be lost.
Although both set_tlb_ubc_flush_pending() and
flush_tlb_batched_pending() are called with PTL locked, different PTL
instances may be used.
Because the race window is really small, and the lost TLB flushing will
cause problem only if a TLB entry is inserted before the unmapping in
the race window, the race is only theoretical. But the fix is simple
and cheap too.
Syzbot has reported this too as follows:
==================================================================
BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
write to 0xffff8881072cfbbc of 1 bytes by task 17406 on cpu 1:
flush_tlb_batched_pending+0x5f/0x80 mm/rmap.c:691
madvise_free_pte_range+0xee/0x7d0 mm/madvise.c:594
walk_pmd_range mm/pagewalk.c:128 [inline]
walk_pud_range mm/pagewalk.c:205 [inline]
walk_p4d_range mm/pagewalk.c:240 [inline]
walk_pgd_range mm/pagewalk.c:277 [inline]
__walk_page_range+0x981/0x1160 mm/pagewalk.c:379
walk_page_range+0x131/0x300 mm/pagewalk.c:475
madvise_free_single_vma mm/madvise.c:734 [inline]
madvise_dontneed_free mm/madvise.c:822 [inline]
madvise_vma mm/madvise.c:996 [inline]
do_madvise+0xe4a/0x1140 mm/madvise.c:1202
__do_sys_madvise mm/madvise.c:1228 [inline]
__se_sys_madvise mm/madvise.c:1226 [inline]
__x64_sys_madvise+0x5d/0x70 mm/madvise.c:1226
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
write to 0xffff8881072cfbbc of 1 bytes by task 71 on cpu 0:
set_tlb_ubc_flush_pending mm/rmap.c:636 [inline]
try_to_unmap_one+0x60e/0x1220 mm/rmap.c:1515
rmap_walk_anon+0x2fb/0x470 mm/rmap.c:2301
try_to_unmap+0xec/0x110
shrink_page_list+0xe91/0x2620 mm/vmscan.c:1719
shrink_inactive_list+0x3fb/0x730 mm/vmscan.c:2394
shrink_list mm/vmscan.c:2621 [inline]
shrink_lruvec+0x3c9/0x710 mm/vmscan.c:2940
shrink_node_memcgs+0x23e/0x410 mm/vmscan.c:3129
shrink_node+0x8f6/0x1190 mm/vmscan.c:3252
kswapd_shrink_node mm/vmscan.c:4022 [inline]
balance_pgdat+0x702/0xd30 mm/vmscan.c:4213
kswapd+0x200/0x340 mm/vmscan.c:4473
kthread+0x2c7/0x2e0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30
value changed: 0x01 -> 0x00
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 71 Comm: kswapd0 Not tainted 5.16.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
==================================================================
[akpm@linux-foundation.org: tweak comments]
Link: https://lkml.kernel.org/r/20211201021104.126469-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: syzbot+aa5bebed695edaccf0df@syzkaller.appspotmail.com
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
During migration special page table entries are installed for each page
being migrated. These entries store the pfn and associated permissions
of ptes mapping the page being migarted.
Device-private pages use special swap pte entries to distinguish
read-only vs. writeable pages which the migration code checks when
creating migration entries. Normally this follows a fast path in
migrate_vma_collect_pmd() which correctly copies the permissions of
device-private pages over to migration entries when migrating pages back
to the CPU.
However the slow-path falls back to using try_to_migrate() which
unconditionally creates read-only migration entries for device-private
pages. This leads to unnecessary double faults on the CPU as the new
pages are always mapped read-only even when they could be mapped
writeable. Fix this by correctly copying device-private permissions in
try_to_migrate_one().
Link: https://lkml.kernel.org/r/20211018045247.3128058-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Transform page_mkclean() into folio_mkclean() and add a page_mkclean()
wrapper around folio_mkclean().
folio_mkclean is 15 bytes smaller than page_mkclean, but the kernel
is enlarged by 33 bytes due to inlining page_folio() into each caller.
This will go away once the callers are converted to use folio_mkclean().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
These are the folio equivalents of lock_page_lruvec() and similar
functions. Also convert lruvec_memcg_debug() to take a folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769.
Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...
There is a READ_ONCE() in the macro of compound_head(), which will prevent
compiler from optimizing the code when there are more than once calling of
it in a function. Remove the redundant calling of compound_head() from
page_to_index() and page_add_file_rmap() for better code generation.
Link: https://lkml.kernel.org/r/20210811101431.83940-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmEmTZcACgkQnJ2qBz9k
QNkkmAgArW6XoF1CePds/ZaC9vfg/nk66/zVo0n+J8xXjMWAPxcKbWFfV0uWVixq
yk4lcLV47a2Mu/B/1oLNd3vrSmhwU+srWqNwOFn1nv+lP/6wJqr8oztRHn/0L9Q3
ZSRrukSejbQ6AvTL/WzTNnCjjCc2ne3Kyko6W41aU6uyJuzhSM32wbx7qlV6t54Z
iint9OrB4gM0avLohNafTUq6I+tEGzBMNwpCG/tqCmkcvDcv3rTDVAnPSCTm0Tx2
hdrYDcY/rLxo93pDBaW1rYA/fohR+mIVye6k2TjkPAL6T1x+rxeT5qnc+YijH5yF
sFPDhlD+ZsfOLi8stWXLOJ+8+gLODg==
=pDBR
-----END PGP SIGNATURE-----
Merge tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fs hole punching vs cache filling race fixes from Jan Kara:
"Fix races leading to possible data corruption or stale data exposure
in multiple filesystems when hole punching races with operations such
as readahead.
This is the series I was sending for the last merge window but with
your objection fixed - now filemap_fault() has been modified to take
invalidate_lock only when we need to create new page in the page cache
and / or bring it uptodate"
* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
filesystems/locking: fix Malformed table warning
cifs: Fix race between hole punch and page fault
ceph: Fix race between hole punch and page fault
fuse: Convert to using invalidate_lock
f2fs: Convert to using invalidate_lock
zonefs: Convert to using invalidate_lock
xfs: Convert double locking of MMAPLOCK to use VFS helpers
xfs: Convert to use invalidate_lock
xfs: Refactor xfs_isilocked()
ext2: Convert to using invalidate_lock
ext4: Convert to use mapping->invalidate_lock
mm: Add functions to lock invalidate_lock for two mappings
mm: Protect operations adding pages to page cache with invalidate_lock
documentation: Sync file_operations members with reality
mm: Fix comments mentioning i_mutex
Currently, serializing operations such as page fault, read, or readahead
against hole punching is rather difficult. The basic race scheme is
like:
fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / ..
truncate_inode_pages_range()
<create pages in page
cache here>
<update fs block mapping and free blocks>
Now the problem is in this way read / page fault / readahead can
instantiate pages in page cache with potentially stale data (if blocks
get quickly reused). Avoiding this race is not simple - page locks do
not work because we want to make sure there are *no* pages in given
range. inode->i_rwsem does not work because page fault happens under
mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
the performance for mixed read-write workloads suffer.
So create a new rw_semaphore in the address_space - invalidate_lock -
that protects adding of pages to page cache for page faults / reads /
readahead.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Many thanks to Kirill for reminding that PageDoubleMap cannot be relied on
to warn of pte mappings in the Anon THP case; and a scan of subpages does
not seem appropriate here. Note how follow_trans_huge_pmd() does not even
mark an Anon THP as mlocked when compound_mapcount != 1: multiple mlocking
of Anon THP is avoided, so simply return from page_mlock() in this case.
Link: https://lore.kernel.org/lkml/cfa154c-d595-406-eb7d-eb9df730f944@google.com/
Fixes: d9770fcc1c ("mm/rmap: fix old bug: munlocking THP missed other mlocks")
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
comments still mentioning i_mutex.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
I know nothing about zone_device pages and !device_private pages; but if
try_to_migrate_one() will do nothing for them, then it's better that
try_to_migrate() filter them first, than trawl through all their vmas.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Link: https://lore.kernel.org/lkml/1241d356-8ec9-f47b-a5ec-9b2bf66d242@google.com/
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the unlikely race case that page_mlock_one() finds VM_LOCKED has been
cleared by the time it got page table lock, page_vma_mapped_walk_done()
must be called before returning, either explicitly, or by a final call
to page_vma_mapped_walk() - otherwise the page table remains locked.
Fixes: cd62734ca6 ("mm/rmap: split try_to_munlock from try_to_unmap")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/lkml/20210711151446.GB4070@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/f71f8523-cba7-3342-40a7-114abc5d1f51@google.com/
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel recovers in due course from missing Mlocked pages: but there
was no point in calling page_mlock() (formerly known as
try_to_munlock()) on a THP, because nothing got done even when it was
found to be mapped in another VM_LOCKED vma.
It's true that we need to be careful: Mlocked accounting of pte-mapped
THPs is too difficult (so consistently avoided); but Mlocked accounting
of only-pmd-mapped THPs is supposed to work, even when multiple mappings
are mlocked and munlocked or munmapped. Refine the tests.
There is already a VM_BUG_ON_PAGE(PageDoubleMap) in page_mlock(), so
page_mlock_one() does not even have to worry about that complication.
(I said the kernel recovers: but would page reclaim be likely to split
THP before rediscovering that it's VM_LOCKED? I've not followed that up)
Fixes: 9a73f61bdb ("thp, mlock: do not mlock PTE-mapped file huge pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/lkml/cfa154c-d595-406-eb7d-eb9df730f944@google.com/
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Parallel developments in mm/rmap.c have left behind some out-of-date
comments: try_to_migrate_one() also accepts TTU_SYNC (already commented
in try_to_migrate() itself), and try_to_migrate() returns nothing at
all.
TTU_SPLIT_FREEZE has just been deleted, so reword the comment about it
in mm/huge_memory.c; and TTU_IGNORE_ACCESS was removed in 5.11, so
delete the "recently referenced" comment from try_to_unmap_one() (once
upon a time the comment was near the removed codeblock, but they drifted
apart).
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Link: https://lore.kernel.org/lkml/563ce5b2-7a44-5b4d-1dfd-59a0e65932a9@google.com/
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some devices require exclusive write access to shared virtual memory (SVM)
ranges to perform atomic operations on that memory. This requires CPU
page tables to be updated to deny access whilst atomic operations are
occurring.
In order to do this introduce a new swap entry type
(SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive
access by a device all page table mappings for the particular range are
replaced with device exclusive swap entries. This causes any CPU access
to the page to result in a fault.
Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive access
to the region.
Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.
[dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()]
Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda
Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.
However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a separate
function reduces the complexity of try_to_unmap_one() making it more
readable.
Several simplifications can also be made in try_to_migrate_one() based on
the following observations:
- All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
- No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
- No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.
TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
for PageAnon or try_to_unmap().
Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used in
different combinations.
TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.
Link: https://lkml.kernel.org/r/20210616105937.23201-4-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both migration and device private pages use special swap entries that are
manipluated by a range of inline functions. The arguments to these are
somewhat inconsistent so rework them to remove flag type arguments and to
make the arguments similar for both read and write entry creation.
Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently try_to_unmap() return bool value by checking page_mapcount(),
however this may return false positive since page_mapcount() doesn't check
all subpages of compound page. The total_mapcount() could be used
instead, but its cost is higher since it traverses all subpages.
Actually the most callers of try_to_unmap() don't care about the return
value at all. So just need check if page is still mapped by page_mapped()
when necessary. And page_mapped() does bail out early when it finds
mapped subpage.
Link: https://lkml.kernel.org/r/bb27e3fe-6036-b637-5086-272befbfe3da@google.com
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anon THP tails were already supported, but memory-failure may need to
use page_address_in_vma() on file THP tails, which its page->mapping
check did not permit: fix it.
hughd adds: no current usage is known to hit the issue, but this does
fix a subtle trap in a general helper: best fixed in stable sooner than
later.
Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
Fixes: 800d8c63b2 ("shmem: add huge pages support")
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for
try_to_unmap().
vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low:
vma_address() chooses max(start, vma->vm_start), but that decides on the
wrong address, because start has become almost ULONG_MAX.
Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.
Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.
An irritation is that their apparent generality breaks down on KSM
pages, which cannot be located by the page->index that page_to_pgoff()
uses: as commit 4b0ece6fa0 ("mm: migrate: fix remove_migration_pte()
for ksm pages") once discovered. I dithered over the best thing to do
about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both
vma_address() and vma_address_end(); though the only place in danger of
using it on them was try_to_unmap_one().
Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured. try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never
mattered.
Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
Fixes: a8fa41ad2f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its raw
struct page output showing _mapcount ffffffff i.e. mapcount 0.
And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and
silently.
I believe the problem is that once a racing unmap has cleared pte or
pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
from try_to_unmap() before the racing task has reached decrementing
mapcount.
Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
TTU_SYNC to the options, and passing that from unmap_page().
When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
for both: the slight overhead added should rarely matter, except perhaps
if splitting sparsely-populated multiply-mapped shmem. Once confident
that bugs are fixed, TTU_SYNC here can be removed, and the race
tolerated.
Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
Fixes: fec89c109f ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 746b18d421 ("mm: use refcounts for page_lock_anon_vma()"),
page_lock_anon_vma() is renamed to page_get_anon_vma() and converted to
return a refcount increased anon_vma. But it forgot to change the
relevant comment.
Link: https://lkml.kernel.org/r/20210203093215.31990-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_mapcount_is_zero() calculates accurately how many mappings a hugepage
has in order to check against 0 only. This is a waste of cpu time. We
can do this via page_not_mapped() to save some possible atomic_read
cycles. Remove the function page_mapcount_is_zero() as it's not used
anymore and move page_not_mapped() above try_to_unmap() to avoid
identifier undeclared compilation error.
Link: https://lkml.kernel.org/r/20210130084904.35307-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 21333b2b66 ("ksm: no debug in page_dup_rmap()") has reverted
page_dup_rmap() to an inline atomic_inc of mapcount. So page_dup_rmap()
does not call __page_check_anon_rmap() anymore.
Link: https://lkml.kernel.org/r/20210128110209.50857-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove extra semicolon without any functional change intended.
Link: https://lkml.kernel.org/r/20210127093425.39640-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 2b575eb64f ("mm: convert anon_vma->lock to a mutex") changed
spinlock used to serialize access to vma list to mutex. And further, the
commit 5a505085f0 ("mm/rmap: Convert the struct anon_vma::mutex to an
rwsem") converted the mutex to an rwsem for solving scalability problem.
So replace spinlock with rwsem to make comment uptodate.
Link: https://lkml.kernel.org/r/20210123072459.25903-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>