mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2024-12-28 16:53:49 +00:00
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K DmxHkn0LAitGgJRS/W9w81yrgig9tAQ= =MlGs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
This commit is contained in:
commit
3822a7c409
@ -182,3 +182,42 @@ Date: November 2021
|
||||
Contact: Jarkko Sakkinen <jarkko@kernel.org>
|
||||
Description:
|
||||
The total amount of SGX physical memory in bytes.
|
||||
|
||||
What: /sys/devices/system/node/nodeX/memory_failure/total
|
||||
Date: January 2023
|
||||
Contact: Jiaqi Yan <jiaqiyan@google.com>
|
||||
Description:
|
||||
The total number of raw poisoned pages (pages containing
|
||||
corrupted data due to memory errors) on a NUMA node.
|
||||
|
||||
What: /sys/devices/system/node/nodeX/memory_failure/ignored
|
||||
Date: January 2023
|
||||
Contact: Jiaqi Yan <jiaqiyan@google.com>
|
||||
Description:
|
||||
Of the raw poisoned pages on a NUMA node, how many pages are
|
||||
ignored by memory error recovery attempt, usually because
|
||||
support for this type of pages is unavailable, and kernel
|
||||
gives up the recovery.
|
||||
|
||||
What: /sys/devices/system/node/nodeX/memory_failure/failed
|
||||
Date: January 2023
|
||||
Contact: Jiaqi Yan <jiaqiyan@google.com>
|
||||
Description:
|
||||
Of the raw poisoned pages on a NUMA node, how many pages are
|
||||
failed by memory error recovery attempt. This usually means
|
||||
a key recovery operation failed.
|
||||
|
||||
What: /sys/devices/system/node/nodeX/memory_failure/delayed
|
||||
Date: January 2023
|
||||
Contact: Jiaqi Yan <jiaqiyan@google.com>
|
||||
Description:
|
||||
Of the raw poisoned pages on a NUMA node, how many pages are
|
||||
delayed by memory error recovery attempt. Delayed poisoned
|
||||
pages usually will be retried by kernel.
|
||||
|
||||
What: /sys/devices/system/node/nodeX/memory_failure/recovered
|
||||
Date: January 2023
|
||||
Contact: Jiaqi Yan <jiaqiyan@google.com>
|
||||
Description:
|
||||
Of the raw poisoned pages on a NUMA node, how many pages are
|
||||
recovered by memory error recovery attempt.
|
||||
|
@ -258,6 +258,35 @@ Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing to and reading from this file sets and gets the low
|
||||
watermark of the scheme in permil.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/nr_filters
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing a number 'N' to this file creates the number of
|
||||
directories for setting filters of the scheme named '0' to
|
||||
'N-1' under the filters/ directory.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/type
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing to and reading from this file sets and gets the type of
|
||||
the memory of the interest. 'anon' for anonymous pages, or
|
||||
'memcg' for specific memory cgroup can be written and read.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/memcg_path
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: If 'memcg' is written to the 'type' file, writing to and
|
||||
reading from this file sets and gets the path to the memory
|
||||
cgroup of the interest.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/matching
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing 'Y' or 'N' to this file sets whether to filter out
|
||||
pages that do or do not match to the 'type' and 'memcg_path',
|
||||
respectively. Filter out means the action of the scheme will
|
||||
not be applied to.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/nr_tried
|
||||
Date: Mar 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
|
@ -87,6 +87,8 @@ Brief summary of control files.
|
||||
memory.swappiness set/show swappiness parameter of vmscan
|
||||
(See sysctl's vm.swappiness)
|
||||
memory.move_charge_at_immigrate set/show controls of moving charges
|
||||
This knob is deprecated and shouldn't be
|
||||
used.
|
||||
memory.oom_control set/show oom controls.
|
||||
memory.numa_stat show the number of memory usage per numa
|
||||
node
|
||||
@ -727,8 +729,15 @@ If we want to change this to 1G, we can at any time use::
|
||||
|
||||
.. _cgroup-v1-memory-move-charges:
|
||||
|
||||
8. Move charges at task migration
|
||||
=================================
|
||||
8. Move charges at task migration (DEPRECATED!)
|
||||
===============================================
|
||||
|
||||
THIS IS DEPRECATED!
|
||||
|
||||
It's expensive and unreliable! It's better practice to launch workload
|
||||
tasks directly from inside their target cgroup. Use dedicated workload
|
||||
cgroups to allow fine-grained policy adjustments without having to
|
||||
move physical pages between control domains.
|
||||
|
||||
Users can move charges associated with a task along with task migration, that
|
||||
is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
|
||||
|
@ -205,6 +205,15 @@ The end physical address of memory region that DAMON_RECLAIM will do work
|
||||
against. That is, DAMON_RECLAIM will find cold memory regions in this region
|
||||
and reclaims. By default, biggest System RAM is used as the region.
|
||||
|
||||
skip_anon
|
||||
---------
|
||||
|
||||
Skip anonymous pages reclamation.
|
||||
|
||||
If this parameter is set as ``Y``, DAMON_RECLAIM does not reclaim anonymous
|
||||
pages. By default, ``N``.
|
||||
|
||||
|
||||
kdamond_pid
|
||||
-----------
|
||||
|
||||
|
@ -25,10 +25,12 @@ DAMON provides below interfaces for different users.
|
||||
interface provides only simple :ref:`statistics <damos_stats>` for the
|
||||
monitoring results. For detailed monitoring results, DAMON provides a
|
||||
:ref:`tracepoint <tracepoint>`.
|
||||
- *debugfs interface.*
|
||||
- *debugfs interface. (DEPRECATED!)*
|
||||
:ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
|
||||
<sysfs_interface>`. This will be removed after next LTS kernel is released,
|
||||
so users should move to the :ref:`sysfs interface <sysfs_interface>`.
|
||||
<sysfs_interface>`. This is deprecated, so users should move to the
|
||||
:ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
|
||||
move, please report your usecase to damon@lists.linux.dev and
|
||||
linux-mm@kvack.org.
|
||||
- *Kernel Space Programming Interface.*
|
||||
:doc:`This </mm/damon/api>` is for kernel space programmers. Using this,
|
||||
users can utilize every feature of DAMON most flexibly and efficiently by
|
||||
@ -87,6 +89,8 @@ comma (","). ::
|
||||
│ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms
|
||||
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
|
||||
│ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low
|
||||
│ │ │ │ │ │ │ filters/nr_filters
|
||||
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
|
||||
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
|
||||
│ │ │ │ │ │ │ tried_regions/
|
||||
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
|
||||
@ -151,6 +155,8 @@ number (``N``) to the file creates the number of child directories named as
|
||||
moment, only one context per kdamond is supported, so only ``0`` or ``1`` can
|
||||
be written to the file.
|
||||
|
||||
.. _sysfs_contexts:
|
||||
|
||||
contexts/<N>/
|
||||
-------------
|
||||
|
||||
@ -268,21 +274,32 @@ schemes/<N>/
|
||||
------------
|
||||
|
||||
In each scheme directory, five directories (``access_pattern``, ``quotas``,
|
||||
``watermarks``, ``stats``, and ``tried_regions``) and one file (``action``)
|
||||
exist.
|
||||
``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and one file
|
||||
(``action``) exist.
|
||||
|
||||
The ``action`` file is for setting and getting what action you want to apply to
|
||||
memory regions having specific access pattern of the interest. The keywords
|
||||
that can be written to and read from the file and their meaning are as below.
|
||||
|
||||
- ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``
|
||||
- ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``
|
||||
- ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``
|
||||
- ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``
|
||||
- ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``
|
||||
Note that support of each action depends on the running DAMON operations set
|
||||
`implementation <sysfs_contexts>`.
|
||||
|
||||
- ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
|
||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||
- ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``.
|
||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||
- ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
|
||||
Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set.
|
||||
- ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``.
|
||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||
- ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``.
|
||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||
- ``lru_prio``: Prioritize the region on its LRU lists.
|
||||
Supported by ``paddr`` operations set.
|
||||
- ``lru_deprio``: Deprioritize the region on its LRU lists.
|
||||
- ``stat``: Do nothing but count the statistics
|
||||
Supported by ``paddr`` operations set.
|
||||
- ``stat``: Do nothing but count the statistics.
|
||||
Supported by all operations sets.
|
||||
|
||||
schemes/<N>/access_pattern/
|
||||
---------------------------
|
||||
@ -347,6 +364,46 @@ as below.
|
||||
|
||||
The ``interval`` should written in microseconds unit.
|
||||
|
||||
schemes/<N>/filters/
|
||||
--------------------
|
||||
|
||||
Users could know something more than the kernel for specific types of memory.
|
||||
In the case, users could do their own management for the memory and hence
|
||||
doesn't want DAMOS bothers that. Users could limit DAMOS by setting the access
|
||||
pattern of the scheme and/or the monitoring regions for the purpose, but that
|
||||
can be inefficient in some cases. In such cases, users could set non-access
|
||||
pattern driven filters using files in this directory.
|
||||
|
||||
In the beginning, this directory has only one file, ``nr_filters``. Writing a
|
||||
number (``N``) to the file creates the number of child directories named ``0``
|
||||
to ``N-1``. Each directory represents each filter. The filters are evaluated
|
||||
in the numeric order.
|
||||
|
||||
Each filter directory contains three files, namely ``type``, ``matcing``, and
|
||||
``memcg_path``. You can write one of two special keywords, ``anon`` for
|
||||
anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of
|
||||
the memory cgroup filtering, you can specify the memory cgroup of the interest
|
||||
by writing the path of the memory cgroup from the cgroups mount point to
|
||||
``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to
|
||||
filter out pages that does or does not match to the type, respectively. Then,
|
||||
the scheme's action will not be applied to the pages that specified to be
|
||||
filtered out.
|
||||
|
||||
For example, below restricts a DAMOS action to be applied to only non-anonymous
|
||||
pages of all memory cgroups except ``/having_care_already``.::
|
||||
|
||||
# echo 2 > nr_filters
|
||||
# # filter out anonymous pages
|
||||
echo anon > 0/type
|
||||
echo Y > 0/matching
|
||||
# # further filter out all cgroups except one at '/having_care_already'
|
||||
echo memcg > 1/type
|
||||
echo /having_care_already > 1/memcg_path
|
||||
echo N > 1/matching
|
||||
|
||||
Note that filters are currently supported only when ``paddr``
|
||||
`implementation <sysfs_contexts>` is being used.
|
||||
|
||||
.. _sysfs_schemes_stats:
|
||||
|
||||
schemes/<N>/stats/
|
||||
@ -432,13 +489,17 @@ the files as above. Above is only for an example.
|
||||
|
||||
.. _debugfs_interface:
|
||||
|
||||
debugfs Interface
|
||||
=================
|
||||
debugfs Interface (DEPRECATED!)
|
||||
===============================
|
||||
|
||||
.. note::
|
||||
|
||||
DAMON debugfs interface will be removed after next LTS kernel is released, so
|
||||
users should move to the :ref:`sysfs interface <sysfs_interface>`.
|
||||
THIS IS DEPRECATED!
|
||||
|
||||
DAMON debugfs interface is deprecated, so users should move to the
|
||||
:ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
|
||||
move, please report your usecase to damon@lists.linux.dev and
|
||||
linux-mm@kvack.org.
|
||||
|
||||
DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
|
||||
``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
|
||||
@ -574,11 +635,15 @@ The ``<action>`` is a predefined integer for memory management actions, which
|
||||
DAMON will apply to the regions having the target access pattern. The
|
||||
supported numbers and their meanings are as below.
|
||||
|
||||
- 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``
|
||||
- 1: Call ``madvise()`` for the region with ``MADV_COLD``
|
||||
- 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``
|
||||
- 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``
|
||||
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``
|
||||
- 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if
|
||||
``target`` is ``paddr``.
|
||||
- 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if
|
||||
``target`` is ``paddr``.
|
||||
- 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
|
||||
- 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if
|
||||
``target`` is ``paddr``.
|
||||
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if
|
||||
``target`` is ``paddr``.
|
||||
- 5: Do nothing but count the statistics
|
||||
|
||||
Quota
|
||||
|
@ -459,13 +459,13 @@ Examples
|
||||
.. _map_hugetlb:
|
||||
|
||||
``map_hugetlb``
|
||||
see tools/testing/selftests/vm/map_hugetlb.c
|
||||
see tools/testing/selftests/mm/map_hugetlb.c
|
||||
|
||||
``hugepage-shm``
|
||||
see tools/testing/selftests/vm/hugepage-shm.c
|
||||
see tools/testing/selftests/mm/hugepage-shm.c
|
||||
|
||||
``hugepage-mmap``
|
||||
see tools/testing/selftests/vm/hugepage-mmap.c
|
||||
see tools/testing/selftests/mm/hugepage-mmap.c
|
||||
|
||||
The `libhugetlbfs`_ library provides a wide range of userspace tools
|
||||
to help with huge page usability, environment setup, and control.
|
||||
|
@ -63,7 +63,7 @@ workload one should:
|
||||
are not reclaimable, he or she can filter them out using
|
||||
``/proc/kpageflags``.
|
||||
|
||||
The page-types tool in the tools/vm directory can be used to assist in this.
|
||||
The page-types tool in the tools/mm directory can be used to assist in this.
|
||||
If the tool is run initially with the appropriate option, it will mark all the
|
||||
queried pages as idle. Subsequent runs of the tool can then show which pages have
|
||||
their idle flag cleared in the interim.
|
||||
|
@ -1,4 +1,7 @@
|
||||
=============
|
||||
=======================
|
||||
NUMA Memory Performance
|
||||
=======================
|
||||
|
||||
NUMA Locality
|
||||
=============
|
||||
|
||||
@ -59,7 +62,6 @@ that are CPUs and hence suitable for generic task scheduling, and
|
||||
IO initiators such as GPUs and NICs. Unlike access class 0, only
|
||||
nodes containing CPUs are considered.
|
||||
|
||||
================
|
||||
NUMA Performance
|
||||
================
|
||||
|
||||
@ -94,7 +96,6 @@ for the platform.
|
||||
Access class 1 takes the same form but only includes values for CPU to
|
||||
memory activity.
|
||||
|
||||
==========
|
||||
NUMA Cache
|
||||
==========
|
||||
|
||||
@ -168,7 +169,6 @@ The "size" is the number of bytes provided by this cache level.
|
||||
The "write_policy" will be 0 for write-back, and non-zero for
|
||||
write-through caching.
|
||||
|
||||
========
|
||||
See Also
|
||||
========
|
||||
|
||||
|
@ -44,7 +44,7 @@ There are four components to pagemap:
|
||||
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of
|
||||
times each page is mapped, indexed by PFN.
|
||||
|
||||
The page-types tool in the tools/vm directory can be used to query the
|
||||
The page-types tool in the tools/mm directory can be used to query the
|
||||
number of times a page is mapped.
|
||||
|
||||
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
|
||||
@ -170,7 +170,7 @@ LRU related page flags
|
||||
14 - SWAPBACKED
|
||||
The page is backed by swap/RAM.
|
||||
|
||||
The page-types tool in the tools/vm directory can be used to query the
|
||||
The page-types tool in the tools/mm directory can be used to query the
|
||||
above flags.
|
||||
|
||||
Using pagemap to do something useful
|
||||
|
@ -55,18 +55,17 @@ flags the caller provides. The caller is required to pass in a non-null struct
|
||||
pages* array, and the function then pins pages by incrementing each by a special
|
||||
value: GUP_PIN_COUNTING_BIAS.
|
||||
|
||||
For compound pages, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
|
||||
an exact form of pin counting is achieved, by using the 2nd struct page
|
||||
in the compound page. A new struct page field, compound_pincount, has
|
||||
been added in order to support this.
|
||||
For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
|
||||
the extra space available in the struct folio is used to store the
|
||||
pincount directly.
|
||||
|
||||
This approach for compound pages avoids the counting upper limit problems that
|
||||
are discussed below. Those limitations would have been aggravated severely by
|
||||
huge pages, because each tail page adds a refcount to the head page. And in
|
||||
fact, testing revealed that, without a separate compound_pincount field,
|
||||
page overflows were seen in some huge page stress tests.
|
||||
This approach for large folios avoids the counting upper limit problems
|
||||
that are discussed below. Those limitations would have been aggravated
|
||||
severely by huge pages, because each tail page adds a refcount to the
|
||||
head page. And in fact, testing revealed that, without a separate pincount
|
||||
field, refcount overflows were seen in some huge page stress tests.
|
||||
|
||||
This also means that huge pages and compound pages do not suffer
|
||||
This also means that huge pages and large folios do not suffer
|
||||
from the false positives problem that is mentioned below.::
|
||||
|
||||
Function
|
||||
@ -221,7 +220,7 @@ Unit testing
|
||||
============
|
||||
This file::
|
||||
|
||||
tools/testing/selftests/vm/gup_test.c
|
||||
tools/testing/selftests/mm/gup_test.c
|
||||
|
||||
has the following new calls to exercise the new pin*() wrapper functions:
|
||||
|
||||
@ -264,9 +263,9 @@ place.)
|
||||
Other diagnostics
|
||||
=================
|
||||
|
||||
dump_page() has been enhanced slightly, to handle these new counting
|
||||
fields, and to better report on compound pages in general. Specifically,
|
||||
for compound pages, the exact (compound_pincount) pincount is reported.
|
||||
dump_page() has been enhanced slightly to handle these new counting
|
||||
fields, and to better report on large folios in general. Specifically,
|
||||
for large folios, the exact pincount is reported.
|
||||
|
||||
References
|
||||
==========
|
||||
|
@ -140,6 +140,23 @@ disabling KASAN altogether or controlling its features:
|
||||
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
|
||||
allocations (default: ``on``).
|
||||
|
||||
- ``kasan.page_alloc.sample=<sampling interval>`` makes KASAN tag only every
|
||||
Nth page_alloc allocation with the order equal or greater than
|
||||
``kasan.page_alloc.sample.order``, where N is the value of the ``sample``
|
||||
parameter (default: ``1``, or tag every such allocation).
|
||||
This parameter is intended to mitigate the performance overhead introduced
|
||||
by KASAN.
|
||||
Note that enabling this parameter makes Hardware Tag-Based KASAN skip checks
|
||||
of allocations chosen by sampling and thus miss bad accesses to these
|
||||
allocations. Use the default value for accurate bug detection.
|
||||
|
||||
- ``kasan.page_alloc.sample.order=<minimum page order>`` specifies the minimum
|
||||
order of allocations that are affected by sampling (default: ``3``).
|
||||
Only applies when ``kasan.page_alloc.sample`` is set to a value greater
|
||||
than ``1``.
|
||||
This parameter is intended to allow sampling only large page_alloc
|
||||
allocations, which is the biggest source of the performance overhead.
|
||||
|
||||
Error reports
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
|
@ -4,7 +4,7 @@ Memory Balancing
|
||||
|
||||
Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
|
||||
|
||||
Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
|
||||
Memory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as
|
||||
well as for non __GFP_IO allocations.
|
||||
|
||||
The first reason why a caller may avoid reclaim is that the caller can not
|
||||
|
@ -4,8 +4,9 @@
|
||||
DAMON: Data Access MONitor
|
||||
==========================
|
||||
|
||||
DAMON is a data access monitoring framework subsystem for the Linux kernel.
|
||||
The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it
|
||||
DAMON is a Linux kernel subsystem that provides a framework for data access
|
||||
monitoring and the monitoring results based system operations. The core
|
||||
monitoring mechanisms of DAMON (refer to :doc:`design` for the detail) make it
|
||||
|
||||
- *accurate* (the monitoring output is useful enough for DRAM level memory
|
||||
management; It might not appropriate for CPU Cache levels, though),
|
||||
@ -14,12 +15,16 @@ The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it
|
||||
- *scalable* (the upper-bound of the overhead is in constant range regardless
|
||||
of the size of target workloads).
|
||||
|
||||
Using this framework, therefore, the kernel's memory management mechanisms can
|
||||
make advanced decisions. Experimental memory management optimization works
|
||||
that incurring high data accesses monitoring overhead could implemented again.
|
||||
In user space, meanwhile, users who have some special workloads can write
|
||||
personalized applications for better understanding and optimizations of their
|
||||
workloads and systems.
|
||||
Using this framework, therefore, the kernel can operate system in an
|
||||
access-aware fashion. Because the features are also exposed to the user space,
|
||||
users who have special information about their workloads can write personalized
|
||||
applications for better understanding and optimizations of their workloads and
|
||||
systems.
|
||||
|
||||
For easier development of such systems, DAMON provides a feature called DAMOS
|
||||
(DAMon-based Operation Schemes) in addition to the monitoring. Using the
|
||||
feature, DAMON users in both kernel and user spaces can do access-aware system
|
||||
operations with no code but simple configurations.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
@ -27,3 +32,4 @@ workloads and systems.
|
||||
faq
|
||||
design
|
||||
api
|
||||
maintainer-profile
|
||||
|
62
Documentation/mm/damon/maintainer-profile.rst
Normal file
62
Documentation/mm/damon/maintainer-profile.rst
Normal file
@ -0,0 +1,62 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
DAMON Maintainer Entry Profile
|
||||
==============================
|
||||
|
||||
The DAMON subsystem covers the files that listed in 'DATA ACCESS MONITOR'
|
||||
section of 'MAINTAINERS' file.
|
||||
|
||||
The mailing lists for the subsystem are damon@lists.linux.dev and
|
||||
linux-mm@kvack.org. Patches should be made against the mm-unstable tree [1]_
|
||||
whenever possible and posted to the mailing lists.
|
||||
|
||||
SCM Trees
|
||||
---------
|
||||
|
||||
There are multiple Linux trees for DAMON development. Patches under
|
||||
development or testing are queued in damon/next [2]_ by the DAMON maintainer.
|
||||
Suffieicntly reviewed patches will be queued in mm-unstable [1]_ by the memory
|
||||
management subsystem maintainer. After more sufficient tests, the patches will
|
||||
be queued in mm-stable [3]_ , and finally pull-requested to the mainline by the
|
||||
memory management subsystem maintainer.
|
||||
|
||||
Note again the patches for review should be made against the mm-unstable
|
||||
tree[1] whenever possible. damon/next is only for preview of others' works in
|
||||
progress.
|
||||
|
||||
Submit checklist addendum
|
||||
-------------------------
|
||||
|
||||
When making DAMON changes, you should do below.
|
||||
|
||||
- Build changes related outputs including kernel and documents.
|
||||
- Ensure the builds introduce no new errors or warnings.
|
||||
- Run and ensure no new failures for DAMON selftests [4]_ and kunittests [5]_ .
|
||||
|
||||
Further doing below and putting the results will be helpful.
|
||||
|
||||
- Run damon-tests/corr [6]_ for normal changes.
|
||||
- Run damon-tests/perf [7]_ for performance changes.
|
||||
|
||||
Key cycle dates
|
||||
---------------
|
||||
|
||||
Patches can be sent anytime. Key cycle dates of the mm-unstable[1] and
|
||||
mm-stable[3] trees depend on the memory management subsystem maintainer.
|
||||
|
||||
Review cadence
|
||||
--------------
|
||||
|
||||
The DAMON maintainer does the work on the usual work hour (09:00 to 17:00,
|
||||
Mon-Fri) in PST. The response to patches will occasionally be slow. Do not
|
||||
hesitate to send a ping if you have not heard back within a week of sending a
|
||||
patch.
|
||||
|
||||
|
||||
.. [1] https://git.kernel.org/akpm/mm/h/mm-unstable
|
||||
.. [2] https://git.kernel.org/sj/h/damon/next
|
||||
.. [3] https://git.kernel.org/akpm/mm/h/mm-stable
|
||||
.. [4] https://github.com/awslabs/damon-tests/blob/master/corr/run.sh#L49
|
||||
.. [5] https://github.com/awslabs/damon-tests/blob/master/corr/tests/kunit.sh
|
||||
.. [6] https://github.com/awslabs/damon-tests/tree/master/corr
|
||||
.. [7] https://github.com/awslabs/damon-tests/tree/master/perf
|
@ -55,7 +55,8 @@ list shows them in order of preference of use.
|
||||
It can be invoked from any context (including interrupts) but the mappings
|
||||
can only be used in the context which acquired them.
|
||||
|
||||
This function should be preferred, where feasible, over all the others.
|
||||
This function should always be used, whereas kmap_atomic() and kmap() have
|
||||
been deprecated.
|
||||
|
||||
These mappings are thread-local and CPU-local, meaning that the mapping
|
||||
can only be accessed from within this thread and the thread is bound to the
|
||||
@ -80,7 +81,7 @@ list shows them in order of preference of use.
|
||||
for pages which are known to not come from ZONE_HIGHMEM. However, it is
|
||||
always safe to use kmap_local_page() / kunmap_local().
|
||||
|
||||
While it is significantly faster than kmap(), for the higmem case it
|
||||
While it is significantly faster than kmap(), for the highmem case it
|
||||
comes with restrictions about the pointers validity. Contrary to kmap()
|
||||
mappings, the local mappings are only valid in the context of the caller
|
||||
and cannot be handed to other contexts. This implies that users must
|
||||
@ -98,10 +99,21 @@ list shows them in order of preference of use.
|
||||
(included in the "Functions" section) for details on how to manage nested
|
||||
mappings.
|
||||
|
||||
* kmap_atomic(). This permits a very short duration mapping of a single
|
||||
page. Since the mapping is restricted to the CPU that issued it, it
|
||||
performs well, but the issuing task is therefore required to stay on that
|
||||
CPU until it has finished, lest some other task displace its mappings.
|
||||
* kmap_atomic(). This function has been deprecated; use kmap_local_page().
|
||||
|
||||
NOTE: Conversions to kmap_local_page() must take care to follow the mapping
|
||||
restrictions imposed on kmap_local_page(). Furthermore, the code between
|
||||
calls to kmap_atomic() and kunmap_atomic() may implicitly depend on the side
|
||||
effects of atomic mappings, i.e. disabling page faults or preemption, or both.
|
||||
In that case, explicit calls to pagefault_disable() or preempt_disable() or
|
||||
both must be made in conjunction with the use of kmap_local_page().
|
||||
|
||||
[Legacy documentation]
|
||||
|
||||
This permits a very short duration mapping of a single page. Since the
|
||||
mapping is restricted to the CPU that issued it, it performs well, but
|
||||
the issuing task is therefore required to stay on that CPU until it has
|
||||
finished, lest some other task displace its mappings.
|
||||
|
||||
kmap_atomic() may also be used by interrupt contexts, since it does not
|
||||
sleep and the callers too may not sleep until after kunmap_atomic() is
|
||||
@ -113,11 +125,20 @@ list shows them in order of preference of use.
|
||||
|
||||
It is assumed that k[un]map_atomic() won't fail.
|
||||
|
||||
* kmap(). This should be used to make short duration mapping of a single
|
||||
page with no restrictions on preemption or migration. It comes with an
|
||||
overhead as mapping space is restricted and protected by a global lock
|
||||
for synchronization. When mapping is no longer needed, the address that
|
||||
the page was mapped to must be released with kunmap().
|
||||
* kmap(). This function has been deprecated; use kmap_local_page().
|
||||
|
||||
NOTE: Conversions to kmap_local_page() must take care to follow the mapping
|
||||
restrictions imposed on kmap_local_page(). In particular, it is necessary to
|
||||
make sure that the kernel virtual memory pointer is only valid in the thread
|
||||
that obtained it.
|
||||
|
||||
[Legacy documentation]
|
||||
|
||||
This should be used to make short duration mapping of a single page with no
|
||||
restrictions on preemption or migration. It comes with an overhead as mapping
|
||||
space is restricted and protected by a global lock for synchronization. When
|
||||
mapping is no longer needed, the address that the page was mapped to must be
|
||||
released with kunmap().
|
||||
|
||||
Mapping changes must be propagated across all the CPUs. kmap() also
|
||||
requires global TLB invalidation when the kmap's pool wraps and it might
|
||||
|
@ -179,14 +179,14 @@ Consuming Reservations/Allocating a Huge Page
|
||||
|
||||
Reservations are consumed when huge pages associated with the reservations
|
||||
are allocated and instantiated in the corresponding mapping. The allocation
|
||||
is performed within the routine alloc_huge_page()::
|
||||
is performed within the routine alloc_hugetlb_folio()::
|
||||
|
||||
struct page *alloc_huge_page(struct vm_area_struct *vma,
|
||||
struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
|
||||
unsigned long addr, int avoid_reserve)
|
||||
|
||||
alloc_huge_page is passed a VMA pointer and a virtual address, so it can
|
||||
alloc_hugetlb_folio is passed a VMA pointer and a virtual address, so it can
|
||||
consult the reservation map to determine if a reservation exists. In addition,
|
||||
alloc_huge_page takes the argument avoid_reserve which indicates reserves
|
||||
alloc_hugetlb_folio takes the argument avoid_reserve which indicates reserves
|
||||
should not be used even if it appears they have been set aside for the
|
||||
specified address. The avoid_reserve argument is most often used in the case
|
||||
of Copy on Write and Page Migration where additional copies of an existing
|
||||
@ -206,7 +206,8 @@ a reservation for the allocation. After determining whether a reservation
|
||||
exists and can be used for the allocation, the routine dequeue_huge_page_vma()
|
||||
is called. This routine takes two arguments related to reservations:
|
||||
|
||||
- avoid_reserve, this is the same value/argument passed to alloc_huge_page()
|
||||
- avoid_reserve, this is the same value/argument passed to
|
||||
alloc_hugetlb_folio().
|
||||
- chg, even though this argument is of type long only the values 0 or 1 are
|
||||
passed to dequeue_huge_page_vma. If the value is 0, it indicates a
|
||||
reservation exists (see the section "Memory Policy and Reservations" for
|
||||
@ -231,9 +232,9 @@ the scope reservations. Even if a surplus page is allocated, the same
|
||||
reservation based adjustments as above will be made: SetPagePrivate(page) and
|
||||
resv_huge_pages--.
|
||||
|
||||
After obtaining a new huge page, (page)->private is set to the value of
|
||||
the subpool associated with the page if it exists. This will be used for
|
||||
subpool accounting when the page is freed.
|
||||
After obtaining a new hugetlb folio, (folio)->_hugetlb_subpool is set to the
|
||||
value of the subpool associated with the page if it exists. This will be used
|
||||
for subpool accounting when the folio is freed.
|
||||
|
||||
The routine vma_commit_reservation() is then called to adjust the reserve
|
||||
map based on the consumption of the reservation. In general, this involves
|
||||
@ -244,8 +245,8 @@ was no reservation in a shared mapping or this was a private mapping a new
|
||||
entry must be created.
|
||||
|
||||
It is possible that the reserve map could have been changed between the call
|
||||
to vma_needs_reservation() at the beginning of alloc_huge_page() and the
|
||||
call to vma_commit_reservation() after the page was allocated. This would
|
||||
to vma_needs_reservation() at the beginning of alloc_hugetlb_folio() and the
|
||||
call to vma_commit_reservation() after the folio was allocated. This would
|
||||
be possible if hugetlb_reserve_pages was called for the same page in a shared
|
||||
mapping. In such cases, the reservation count and subpool free page count
|
||||
will be off by one. This rare condition can be identified by comparing the
|
||||
|
@ -89,15 +89,15 @@ variables are monotonically increasing.
|
||||
|
||||
Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
|
||||
bits in order to fit into the gen counter in ``folio->flags``. Each
|
||||
truncated generation number is an index to ``lrugen->lists[]``. The
|
||||
truncated generation number is an index to ``lrugen->folios[]``. The
|
||||
sliding window technique is used to track at least ``MIN_NR_GENS`` and
|
||||
at most ``MAX_NR_GENS`` generations. The gen counter stores a value
|
||||
within ``[1, MAX_NR_GENS]`` while a page is on one of
|
||||
``lrugen->lists[]``; otherwise it stores zero.
|
||||
``lrugen->folios[]``; otherwise it stores zero.
|
||||
|
||||
Each generation is divided into multiple tiers. A page accessed ``N``
|
||||
times through file descriptors is in tier ``order_base_2(N)``. Unlike
|
||||
generations, tiers do not have dedicated ``lrugen->lists[]``. In
|
||||
generations, tiers do not have dedicated ``lrugen->folios[]``. In
|
||||
contrast to moving across generations, which requires the LRU lock,
|
||||
moving across tiers only involves atomic operations on
|
||||
``folio->flags`` and therefore has a negligible cost. A feedback loop
|
||||
@ -127,7 +127,7 @@ page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
|
||||
Eviction
|
||||
--------
|
||||
The eviction consumes old generations. Given an ``lruvec``, it
|
||||
increments ``min_seq`` when ``lrugen->lists[]`` indexed by
|
||||
increments ``min_seq`` when ``lrugen->folios[]`` indexed by
|
||||
``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
|
||||
evict from, it first compares ``min_seq[]`` to select the older type.
|
||||
If both types are equally old, it selects the one whose first tier has
|
||||
@ -141,9 +141,85 @@ loop has detected outlying refaults from the tier this page is in. To
|
||||
this end, the feedback loop uses the first tier as the baseline, for
|
||||
the reason stated earlier.
|
||||
|
||||
Working set protection
|
||||
----------------------
|
||||
Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is
|
||||
set, an ``lruvec`` is protected from the eviction when its oldest
|
||||
generation was born within ``lru_gen_min_ttl`` milliseconds. In other
|
||||
words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds
|
||||
from getting evicted. The OOM killer is triggered if this working set
|
||||
cannot be kept in memory.
|
||||
|
||||
This time-based approach has the following advantages:
|
||||
|
||||
1. It is easier to configure because it is agnostic to applications
|
||||
and memory sizes.
|
||||
2. It is more reliable because it is directly wired to the OOM killer.
|
||||
|
||||
Rmap/PT walk feedback
|
||||
---------------------
|
||||
Searching the rmap for PTEs mapping each page on an LRU list (to test
|
||||
and clear the accessed bit) can be expensive because pages from
|
||||
different VMAs (PA space) are not cache friendly to the rmap (VA
|
||||
space). For workloads mostly using mapped pages, searching the rmap
|
||||
can incur the highest CPU cost in the reclaim path.
|
||||
|
||||
``lru_gen_look_around()`` exploits spatial locality to reduce the
|
||||
trips into the rmap. It scans the adjacent PTEs of a young PTE and
|
||||
promotes hot pages. If the scan was done cacheline efficiently, it
|
||||
adds the PMD entry pointing to the PTE table to the Bloom filter. This
|
||||
forms a feedback loop between the eviction and the aging.
|
||||
|
||||
Bloom Filters
|
||||
-------------
|
||||
Bloom filters are a space and memory efficient data structure for set
|
||||
membership test, i.e., test if an element is not in the set or may be
|
||||
in the set.
|
||||
|
||||
In the eviction path, specifically, in ``lru_gen_look_around()``, if a
|
||||
PMD has a sufficient number of hot pages, its address is placed in the
|
||||
filter. In the aging path, set membership means that the PTE range
|
||||
will be scanned for young pages.
|
||||
|
||||
Note that Bloom filters are probabilistic on set membership. If a test
|
||||
is false positive, the cost is an additional scan of a range of PTEs,
|
||||
which may yield hot pages anyway. Parameters of the filter itself can
|
||||
control the false positive rate in the limit.
|
||||
|
||||
Memcg LRU
|
||||
---------
|
||||
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
|
||||
since each node and memcg combination has an LRU of folios (see
|
||||
``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
|
||||
global reclaim, which is critical to system-wide memory overcommit in
|
||||
data centers. Note that memcg LRU only applies to global reclaim.
|
||||
|
||||
The basic structure of an memcg LRU can be understood by an analogy to
|
||||
the active/inactive LRU (of folios):
|
||||
|
||||
1. It has the young and the old (generations), i.e., the counterparts
|
||||
to the active and the inactive;
|
||||
2. The increment of ``max_seq`` triggers promotion, i.e., the
|
||||
counterpart to activation;
|
||||
3. Other events trigger similar operations, e.g., offlining an memcg
|
||||
triggers demotion, i.e., the counterpart to deactivation.
|
||||
|
||||
In terms of global reclaim, it has two distinct features:
|
||||
|
||||
1. Sharding, which allows each thread to start at a random memcg (in
|
||||
the old generation) and improves parallelism;
|
||||
2. Eventual fairness, which allows direct reclaim to bail out at will
|
||||
and reduces latency without affecting fairness over some time.
|
||||
|
||||
In terms of traversing memcgs during global reclaim, it improves the
|
||||
best-case complexity from O(n) to O(1) and does not affect the
|
||||
worst-case complexity O(n). Therefore, on average, it has a sublinear
|
||||
complexity.
|
||||
|
||||
Summary
|
||||
-------
|
||||
The multi-gen LRU can be disassembled into the following parts:
|
||||
The multi-gen LRU (of folios) can be disassembled into the following
|
||||
parts:
|
||||
|
||||
* Generations
|
||||
* Rmap walks
|
||||
|
@ -59,7 +59,7 @@ Usage
|
||||
|
||||
1) Build user-space helper::
|
||||
|
||||
cd tools/vm
|
||||
cd tools/mm
|
||||
make page_owner_sort
|
||||
|
||||
2) Enable page owner: add "page_owner=on" to boot cmdline.
|
||||
|
@ -19,7 +19,7 @@ slabs that have data in them. See "slabinfo -h" for more options when
|
||||
running the command. ``slabinfo`` can be compiled with
|
||||
::
|
||||
|
||||
gcc -o slabinfo tools/vm/slabinfo.c
|
||||
gcc -o slabinfo tools/mm/slabinfo.c
|
||||
|
||||
Some of the modes of operation of ``slabinfo`` require that slub debugging
|
||||
be enabled on the command line. F.e. no tracking information will be
|
||||
|
@ -110,20 +110,20 @@ Refcounts and transparent huge pages
|
||||
Refcounting on THP is mostly consistent with refcounting on other compound
|
||||
pages:
|
||||
|
||||
- get_page()/put_page() and GUP operate on head page's ->_refcount.
|
||||
- get_page()/put_page() and GUP operate on the folio->_refcount.
|
||||
|
||||
- ->_refcount in tail pages is always zero: get_page_unless_zero() never
|
||||
succeeds on tail pages.
|
||||
|
||||
- map/unmap of PMD entry for the whole compound page increment/decrement
|
||||
->compound_mapcount, stored in the first tail page of the compound page;
|
||||
and also increment/decrement ->subpages_mapcount (also in the first tail)
|
||||
by COMPOUND_MAPPED when compound_mapcount goes from -1 to 0 or 0 to -1.
|
||||
- map/unmap of a PMD entry for the whole THP increment/decrement
|
||||
folio->_entire_mapcount and also increment/decrement
|
||||
folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount
|
||||
goes from -1 to 0 or 0 to -1.
|
||||
|
||||
- map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount
|
||||
on relevant sub-page of the compound page, and also increment/decrement
|
||||
->subpages_mapcount, stored in first tail page of the compound page, when
|
||||
_mapcount goes from -1 to 0 or 0 to -1: counting sub-pages mapped by PTE.
|
||||
- map/unmap of individual pages with PTE entry increment/decrement
|
||||
page->_mapcount and also increment/decrement folio->_nr_pages_mapped
|
||||
when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts
|
||||
the number of pages mapped by PTE.
|
||||
|
||||
split_huge_page internally has to distribute the refcounts in the head
|
||||
page to the tail pages before clearing all PG_head/tail bits from the page
|
||||
@ -151,8 +151,8 @@ clear where references should go after split: it will stay on the head page.
|
||||
Note that split_huge_pmd() doesn't have any limitations on refcounting:
|
||||
pmd can be split at any point and never fails.
|
||||
|
||||
Partial unmap and deferred_split_huge_page()
|
||||
============================================
|
||||
Partial unmap and deferred_split_folio()
|
||||
========================================
|
||||
|
||||
Unmapping part of THP (with munmap() or other way) is not going to free
|
||||
memory immediately. Instead, we detect that a subpage of THP is not in use
|
||||
@ -164,6 +164,6 @@ the place where we can detect partial unmap. It also might be
|
||||
counterproductive since in many cases partial unmap happens during exit(2) if
|
||||
a THP crosses a VMA boundary.
|
||||
|
||||
The function deferred_split_huge_page() is used to queue a page for splitting.
|
||||
The function deferred_split_folio() is used to queue a folio for splitting.
|
||||
The splitting itself will happen when we get memory pressure via shrinker
|
||||
interface.
|
||||
|
@ -10,7 +10,7 @@ Introduction
|
||||
|
||||
This document describes the Linux memory manager's "Unevictable LRU"
|
||||
infrastructure and the use of this to manage several types of "unevictable"
|
||||
pages.
|
||||
folios.
|
||||
|
||||
The document attempts to provide the overall rationale behind this mechanism
|
||||
and the rationale for some of the design decisions that drove the
|
||||
@ -25,8 +25,8 @@ The Unevictable LRU
|
||||
===================
|
||||
|
||||
The Unevictable LRU facility adds an additional LRU list to track unevictable
|
||||
pages and to hide these pages from vmscan. This mechanism is based on a patch
|
||||
by Larry Woodman of Red Hat to address several scalability problems with page
|
||||
folios and to hide these folios from vmscan. This mechanism is based on a patch
|
||||
by Larry Woodman of Red Hat to address several scalability problems with folio
|
||||
reclaim in Linux. The problems have been observed at customer sites on large
|
||||
memory x86_64 systems.
|
||||
|
||||
@ -50,40 +50,41 @@ The infrastructure may also be able to handle other conditions that make pages
|
||||
unevictable, either by definition or by circumstance, in the future.
|
||||
|
||||
|
||||
The Unevictable LRU Page List
|
||||
-----------------------------
|
||||
The Unevictable LRU Folio List
|
||||
------------------------------
|
||||
|
||||
The Unevictable LRU page list is a lie. It was never an LRU-ordered list, but a
|
||||
companion to the LRU-ordered anonymous and file, active and inactive page lists;
|
||||
and now it is not even a page list. But following familiar convention, here in
|
||||
this document and in the source, we often imagine it as a fifth LRU page list.
|
||||
The Unevictable LRU folio list is a lie. It was never an LRU-ordered
|
||||
list, but a companion to the LRU-ordered anonymous and file, active and
|
||||
inactive folio lists; and now it is not even a folio list. But following
|
||||
familiar convention, here in this document and in the source, we often
|
||||
imagine it as a fifth LRU folio list.
|
||||
|
||||
The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
|
||||
called the "unevictable" list and an associated page flag, PG_unevictable, to
|
||||
indicate that the page is being managed on the unevictable list.
|
||||
called the "unevictable" list and an associated folio flag, PG_unevictable, to
|
||||
indicate that the folio is being managed on the unevictable list.
|
||||
|
||||
The PG_unevictable flag is analogous to, and mutually exclusive with, the
|
||||
PG_active flag in that it indicates on which LRU list a page resides when
|
||||
PG_active flag in that it indicates on which LRU list a folio resides when
|
||||
PG_lru is set.
|
||||
|
||||
The Unevictable LRU infrastructure maintains unevictable pages as if they were
|
||||
The Unevictable LRU infrastructure maintains unevictable folios as if they were
|
||||
on an additional LRU list for a few reasons:
|
||||
|
||||
(1) We get to "treat unevictable pages just like we treat other pages in the
|
||||
(1) We get to "treat unevictable folios just like we treat other folios in the
|
||||
system - which means we get to use the same code to manipulate them, the
|
||||
same code to isolate them (for migrate, etc.), the same code to keep track
|
||||
of the statistics, etc..." [Rik van Riel]
|
||||
|
||||
(2) We want to be able to migrate unevictable pages between nodes for memory
|
||||
(2) We want to be able to migrate unevictable folios between nodes for memory
|
||||
defragmentation, workload management and memory hotplug. The Linux kernel
|
||||
can only migrate pages that it can successfully isolate from the LRU
|
||||
can only migrate folios that it can successfully isolate from the LRU
|
||||
lists (or "Movable" pages: outside of consideration here). If we were to
|
||||
maintain pages elsewhere than on an LRU-like list, where they can be
|
||||
detected by isolate_lru_page(), we would prevent their migration.
|
||||
maintain folios elsewhere than on an LRU-like list, where they can be
|
||||
detected by folio_isolate_lru(), we would prevent their migration.
|
||||
|
||||
The unevictable list does not differentiate between file-backed and anonymous,
|
||||
swap-backed pages. This differentiation is only important while the pages are,
|
||||
in fact, evictable.
|
||||
The unevictable list does not differentiate between file-backed and
|
||||
anonymous, swap-backed folios. This differentiation is only important
|
||||
while the folios are, in fact, evictable.
|
||||
|
||||
The unevictable list benefits from the "arrayification" of the per-node LRU
|
||||
lists and statistics originally proposed and posted by Christoph Lameter.
|
||||
@ -156,7 +157,7 @@ These are currently used in three places in the kernel:
|
||||
Detecting Unevictable Pages
|
||||
---------------------------
|
||||
|
||||
The function page_evictable() in mm/internal.h determines whether a page is
|
||||
The function folio_evictable() in mm/internal.h determines whether a folio is
|
||||
evictable or not using the query function outlined above [see section
|
||||
:ref:`Marking address spaces unevictable <mark_addr_space_unevict>`]
|
||||
to check the AS_UNEVICTABLE flag.
|
||||
@ -165,7 +166,7 @@ For address spaces that are so marked after being populated (as SHM regions
|
||||
might be), the lock action (e.g. SHM_LOCK) can be lazy, and need not populate
|
||||
the page tables for the region as does, for example, mlock(), nor need it make
|
||||
any special effort to push any pages in the SHM_LOCK'd area to the unevictable
|
||||
list. Instead, vmscan will do this if and when it encounters the pages during
|
||||
list. Instead, vmscan will do this if and when it encounters the folios during
|
||||
a reclamation scan.
|
||||
|
||||
On an unlock action (such as SHM_UNLOCK), the unlocker (e.g. shmctl()) must scan
|
||||
@ -174,41 +175,43 @@ condition is keeping them unevictable. If an unevictable region is destroyed,
|
||||
the pages are also "rescued" from the unevictable list in the process of
|
||||
freeing them.
|
||||
|
||||
page_evictable() also checks for mlocked pages by testing an additional page
|
||||
flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
|
||||
faulted into a VM_LOCKED VMA, or found in a VMA being VM_LOCKED.
|
||||
folio_evictable() also checks for mlocked folios by calling
|
||||
folio_test_mlocked(), which is set when a folio is faulted into a
|
||||
VM_LOCKED VMA, or found in a VMA being VM_LOCKED.
|
||||
|
||||
|
||||
Vmscan's Handling of Unevictable Pages
|
||||
--------------------------------------
|
||||
Vmscan's Handling of Unevictable Folios
|
||||
---------------------------------------
|
||||
|
||||
If unevictable pages are culled in the fault path, or moved to the unevictable
|
||||
list at mlock() or mmap() time, vmscan will not encounter the pages until they
|
||||
If unevictable folios are culled in the fault path, or moved to the unevictable
|
||||
list at mlock() or mmap() time, vmscan will not encounter the folios until they
|
||||
have become evictable again (via munlock() for example) and have been "rescued"
|
||||
from the unevictable list. However, there may be situations where we decide,
|
||||
for the sake of expediency, to leave an unevictable page on one of the regular
|
||||
for the sake of expediency, to leave an unevictable folio on one of the regular
|
||||
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
|
||||
pages in all of the shrink_{active|inactive|page}_list() functions and will
|
||||
"cull" such pages that it encounters: that is, it diverts those pages to the
|
||||
folios in all of the shrink_{active|inactive|page}_list() functions and will
|
||||
"cull" such folios that it encounters: that is, it diverts those folios to the
|
||||
unevictable list for the memory cgroup and node being scanned.
|
||||
|
||||
There may be situations where a page is mapped into a VM_LOCKED VMA, but the
|
||||
page is not marked as PG_mlocked. Such pages will make it all the way to
|
||||
shrink_active_list() or shrink_page_list() where they will be detected when
|
||||
vmscan walks the reverse map in folio_referenced() or try_to_unmap(). The page
|
||||
is culled to the unevictable list when it is released by the shrinker.
|
||||
There may be situations where a folio is mapped into a VM_LOCKED VMA,
|
||||
but the folio does not have the mlocked flag set. Such folios will make
|
||||
it all the way to shrink_active_list() or shrink_page_list() where they
|
||||
will be detected when vmscan walks the reverse map in folio_referenced()
|
||||
or try_to_unmap(). The folio is culled to the unevictable list when it
|
||||
is released by the shrinker.
|
||||
|
||||
To "cull" an unevictable page, vmscan simply puts the page back on the LRU list
|
||||
using putback_lru_page() - the inverse operation to isolate_lru_page() - after
|
||||
dropping the page lock. Because the condition which makes the page unevictable
|
||||
may change once the page is unlocked, __pagevec_lru_add_fn() will recheck the
|
||||
unevictable state of a page before placing it on the unevictable list.
|
||||
To "cull" an unevictable folio, vmscan simply puts the folio back on
|
||||
the LRU list using folio_putback_lru() - the inverse operation to
|
||||
folio_isolate_lru() - after dropping the folio lock. Because the
|
||||
condition which makes the folio unevictable may change once the folio
|
||||
is unlocked, __pagevec_lru_add_fn() will recheck the unevictable state
|
||||
of a folio before placing it on the unevictable list.
|
||||
|
||||
|
||||
MLOCKED Pages
|
||||
=============
|
||||
|
||||
The unevictable page list is also useful for mlock(), in addition to ramfs and
|
||||
The unevictable folio list is also useful for mlock(), in addition to ramfs and
|
||||
SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
|
||||
NOMMU situations, all mappings are effectively mlocked.
|
||||
|
||||
@ -293,7 +296,7 @@ treated as a no-op and mlock_fixup() simply returns.
|
||||
If the VMA passes some filtering as described in "Filtering Special VMAs"
|
||||
below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
|
||||
off a subset of the VMA if the range does not cover the entire VMA. Any pages
|
||||
already present in the VMA are then marked as mlocked by mlock_page() via
|
||||
already present in the VMA are then marked as mlocked by mlock_folio() via
|
||||
mlock_pte_range() via walk_page_range() via mlock_vma_pages_range().
|
||||
|
||||
Before returning from the system call, do_mlock() or mlockall() will call
|
||||
@ -306,22 +309,22 @@ do end up getting faulted into this VM_LOCKED VMA, they will be handled in the
|
||||
fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled.
|
||||
|
||||
For each PTE (or PMD) being faulted into a VMA, the page add rmap function
|
||||
calls mlock_vma_page(), which calls mlock_page() when the VMA is VM_LOCKED
|
||||
calls mlock_vma_folio(), which calls mlock_folio() when the VMA is VM_LOCKED
|
||||
(unless it is a PTE mapping of a part of a transparent huge page). Or when
|
||||
it is a newly allocated anonymous page, lru_cache_add_inactive_or_unevictable()
|
||||
calls mlock_new_page() instead: similar to mlock_page(), but can make better
|
||||
it is a newly allocated anonymous page, folio_add_lru_vma() calls
|
||||
mlock_new_folio() instead: similar to mlock_folio(), but can make better
|
||||
judgments, since this page is held exclusively and known not to be on LRU yet.
|
||||
|
||||
mlock_page() sets PageMlocked immediately, then places the page on the CPU's
|
||||
mlock pagevec, to batch up the rest of the work to be done under lru_lock by
|
||||
__mlock_page(). __mlock_page() sets PageUnevictable, initializes mlock_count
|
||||
mlock_folio() sets PG_mlocked immediately, then places the page on the CPU's
|
||||
mlock folio batch, to batch up the rest of the work to be done under lru_lock by
|
||||
__mlock_folio(). __mlock_folio() sets PG_unevictable, initializes mlock_count
|
||||
and moves the page to unevictable state ("the unevictable LRU", but with
|
||||
mlock_count in place of LRU threading). Or if the page was already PageLRU
|
||||
and PageUnevictable and PageMlocked, it simply increments the mlock_count.
|
||||
mlock_count in place of LRU threading). Or if the page was already PG_lru
|
||||
and PG_unevictable and PG_mlocked, it simply increments the mlock_count.
|
||||
|
||||
But in practice that may not work ideally: the page may not yet be on an LRU, or
|
||||
it may have been temporarily isolated from LRU. In such cases the mlock_count
|
||||
field cannot be touched, but will be set to 0 later when __pagevec_lru_add_fn()
|
||||
field cannot be touched, but will be set to 0 later when __munlock_folio()
|
||||
returns the page to "LRU". Races prohibit mlock_count from being set to 1 then:
|
||||
rather than risk stranding a page indefinitely as unevictable, always err with
|
||||
mlock_count on the low side, so that when munlocked the page will be rescued to
|
||||
@ -368,20 +371,21 @@ Because of the VMA filtering discussed above, VM_LOCKED will not be set in
|
||||
any "special" VMAs. So, those VMAs will be ignored for munlock.
|
||||
|
||||
If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
|
||||
specified range. All pages in the VMA are then munlocked by munlock_page() via
|
||||
specified range. All pages in the VMA are then munlocked by munlock_folio() via
|
||||
mlock_pte_range() via walk_page_range() via mlock_vma_pages_range() - the same
|
||||
function used when mlocking a VMA range, with new flags for the VMA indicating
|
||||
that it is munlock() being performed.
|
||||
|
||||
munlock_page() uses the mlock pagevec to batch up work to be done under
|
||||
lru_lock by __munlock_page(). __munlock_page() decrements the page's
|
||||
mlock_count, and when that reaches 0 it clears PageMlocked and clears
|
||||
PageUnevictable, moving the page from unevictable state to inactive LRU.
|
||||
munlock_folio() uses the mlock pagevec to batch up work to be done
|
||||
under lru_lock by __munlock_folio(). __munlock_folio() decrements the
|
||||
folio's mlock_count, and when that reaches 0 it clears the mlocked flag
|
||||
and clears the unevictable flag, moving the folio from unevictable state
|
||||
to the inactive LRU.
|
||||
|
||||
But in practice that may not work ideally: the page may not yet have reached
|
||||
But in practice that may not work ideally: the folio may not yet have reached
|
||||
"the unevictable LRU", or it may have been temporarily isolated from it. In
|
||||
those cases its mlock_count field is unusable and must be assumed to be 0: so
|
||||
that the page will be rescued to an evictable LRU, then perhaps be mlocked
|
||||
that the folio will be rescued to an evictable LRU, then perhaps be mlocked
|
||||
again later if vmscan finds it in a VM_LOCKED VMA.
|
||||
|
||||
|
||||
@ -408,7 +412,7 @@ However, since mlock_vma_pages_range() starts by setting VM_LOCKED on a VMA,
|
||||
before mlocking any pages already present, if one of those pages were migrated
|
||||
before mlock_pte_range() reached it, it would get counted twice in mlock_count.
|
||||
To prevent that, mlock_vma_pages_range() temporarily marks the VMA as VM_IO,
|
||||
so that mlock_vma_page() will skip it.
|
||||
so that mlock_vma_folio() will skip it.
|
||||
|
||||
To complete page migration, we place the old and new pages back onto the LRU
|
||||
afterwards. The "unneeded" page - old page on success, new page on failure -
|
||||
@ -481,18 +485,19 @@ Before the unevictable/mlock changes, mlocking did not mark the pages in any
|
||||
way, so unmapping them required no processing.
|
||||
|
||||
For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
|
||||
munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED
|
||||
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
|
||||
(unless it was a PTE mapping of a part of a transparent huge page).
|
||||
|
||||
munlock_page() uses the mlock pagevec to batch up work to be done under
|
||||
lru_lock by __munlock_page(). __munlock_page() decrements the page's
|
||||
mlock_count, and when that reaches 0 it clears PageMlocked and clears
|
||||
PageUnevictable, moving the page from unevictable state to inactive LRU.
|
||||
munlock_folio() uses the mlock pagevec to batch up work to be done
|
||||
under lru_lock by __munlock_folio(). __munlock_folio() decrements the
|
||||
folio's mlock_count, and when that reaches 0 it clears the mlocked flag
|
||||
and clears the unevictable flag, moving the folio from unevictable state
|
||||
to the inactive LRU.
|
||||
|
||||
But in practice that may not work ideally: the page may not yet have reached
|
||||
But in practice that may not work ideally: the folio may not yet have reached
|
||||
"the unevictable LRU", or it may have been temporarily isolated from it. In
|
||||
those cases its mlock_count field is unusable and must be assumed to be 0: so
|
||||
that the page will be rescued to an evictable LRU, then perhaps be mlocked
|
||||
that the folio will be rescued to an evictable LRU, then perhaps be mlocked
|
||||
again later if vmscan finds it in a VM_LOCKED VMA.
|
||||
|
||||
|
||||
@ -505,7 +510,7 @@ which had been Copied-On-Write from the file pages now being truncated.
|
||||
|
||||
Mlocked pages can be munlocked and deleted in this way: like with munmap(),
|
||||
for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
|
||||
munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED
|
||||
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
|
||||
(unless it was a PTE mapping of a part of a transparent huge page).
|
||||
|
||||
However, if there is a racing munlock(), since mlock_vma_pages_range() starts
|
||||
@ -513,7 +518,7 @@ munlocking by clearing VM_LOCKED from a VMA, before munlocking all the pages
|
||||
present, if one of those pages were unmapped by truncation or hole punch before
|
||||
mlock_pte_range() reached it, it would not be recognized as mlocked by this VMA,
|
||||
and would not be counted out of mlock_count. In this rare case, a page may
|
||||
still appear as PageMlocked after it has been fully unmapped: and it is left to
|
||||
still appear as PG_mlocked after it has been fully unmapped: and it is left to
|
||||
release_pages() (or __page_cache_release()) to clear it and update statistics
|
||||
before freeing (this event is counted in /proc/vmstat unevictable_pgs_cleared,
|
||||
which is usually 0).
|
||||
@ -525,7 +530,7 @@ Page Reclaim in shrink_*_list()
|
||||
vmscan's shrink_active_list() culls any obviously unevictable pages -
|
||||
i.e. !page_evictable(page) pages - diverting those to the unevictable list.
|
||||
However, shrink_active_list() only sees unevictable pages that made it onto the
|
||||
active/inactive LRU lists. Note that these pages do not have PageUnevictable
|
||||
active/inactive LRU lists. Note that these pages do not have PG_unevictable
|
||||
set - otherwise they would be on the unevictable list and shrink_active_list()
|
||||
would never see them.
|
||||
|
||||
@ -547,6 +552,6 @@ and node unevictable list.
|
||||
|
||||
rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or
|
||||
shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(),
|
||||
check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_page()
|
||||
check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio()
|
||||
to correct them. Such pages are culled to the unevictable list when released
|
||||
by the shrinker.
|
||||
|
@ -78,3 +78,171 @@ Similarly, we assign zspage to:
|
||||
* ZS_ALMOST_FULL when n > N / f
|
||||
* ZS_EMPTY when n == 0
|
||||
* ZS_FULL when n == N
|
||||
|
||||
|
||||
Internals
|
||||
=========
|
||||
|
||||
zsmalloc has 255 size classes, each of which can hold a number of zspages.
|
||||
Each zspage can contain up to ZSMALLOC_CHAIN_SIZE physical (0-order) pages.
|
||||
The optimal zspage chain size for each size class is calculated during the
|
||||
creation of the zsmalloc pool (see calculate_zspage_chain_size()).
|
||||
|
||||
As an optimization, zsmalloc merges size classes that have similar
|
||||
characteristics in terms of the number of pages per zspage and the number
|
||||
of objects that each zspage can store.
|
||||
|
||||
For instance, consider the following size classes:::
|
||||
|
||||
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
|
||||
...
|
||||
94 1536 0 0 0 0 0 3 0
|
||||
100 1632 0 0 0 0 0 2 0
|
||||
...
|
||||
|
||||
|
||||
Size classes #95-99 are merged with size class #100. This means that when we
|
||||
need to store an object of size, say, 1568 bytes, we end up using size class
|
||||
#100 instead of size class #96. Size class #100 is meant for objects of size
|
||||
1632 bytes, so each object of size 1568 bytes wastes 1632-1568=64 bytes.
|
||||
|
||||
Size class #100 consists of zspages with 2 physical pages each, which can
|
||||
hold a total of 5 objects. If we need to store 13 objects of size 1568, we
|
||||
end up allocating three zspages, or 6 physical pages.
|
||||
|
||||
However, if we take a closer look at size class #96 (which is meant for
|
||||
objects of size 1568 bytes) and trace `calculate_zspage_chain_size()`, we
|
||||
find that the most optimal zspage configuration for this class is a chain
|
||||
of 5 physical pages:::
|
||||
|
||||
pages per zspage wasted bytes used%
|
||||
1 960 76
|
||||
2 352 95
|
||||
3 1312 89
|
||||
4 704 95
|
||||
5 96 99
|
||||
|
||||
This means that a class #96 configuration with 5 physical pages can store 13
|
||||
objects of size 1568 in a single zspage, using a total of 5 physical pages.
|
||||
This is more efficient than the class #100 configuration, which would use 6
|
||||
physical pages to store the same number of objects.
|
||||
|
||||
As the zspage chain size for class #96 increases, its key characteristics
|
||||
such as pages per-zspage and objects per-zspage also change. This leads to
|
||||
dewer class mergers, resulting in a more compact grouping of classes, which
|
||||
reduces memory wastage.
|
||||
|
||||
Let's take a closer look at the bottom of `/sys/kernel/debug/zsmalloc/zramX/classes`:::
|
||||
|
||||
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
|
||||
...
|
||||
202 3264 0 0 0 0 0 4 0
|
||||
254 4096 0 0 0 0 0 1 0
|
||||
...
|
||||
|
||||
Size class #202 stores objects of size 3264 bytes and has a maximum of 4 pages
|
||||
per zspage. Any object larger than 3264 bytes is considered huge and belongs
|
||||
to size class #254, which stores each object in its own physical page (objects
|
||||
in huge classes do not share pages).
|
||||
|
||||
Increasing the size of the chain of zspages also results in a higher watermark
|
||||
for the huge size class and fewer huge classes overall. This allows for more
|
||||
efficient storage of large objects.
|
||||
|
||||
For zspage chain size of 8, huge class watermark becomes 3632 bytes:::
|
||||
|
||||
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
|
||||
...
|
||||
202 3264 0 0 0 0 0 4 0
|
||||
211 3408 0 0 0 0 0 5 0
|
||||
217 3504 0 0 0 0 0 6 0
|
||||
222 3584 0 0 0 0 0 7 0
|
||||
225 3632 0 0 0 0 0 8 0
|
||||
254 4096 0 0 0 0 0 1 0
|
||||
...
|
||||
|
||||
For zspage chain size of 16, huge class watermark becomes 3840 bytes:::
|
||||
|
||||
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
|
||||
...
|
||||
202 3264 0 0 0 0 0 4 0
|
||||
206 3328 0 0 0 0 0 13 0
|
||||
207 3344 0 0 0 0 0 9 0
|
||||
208 3360 0 0 0 0 0 14 0
|
||||
211 3408 0 0 0 0 0 5 0
|
||||
212 3424 0 0 0 0 0 16 0
|
||||
214 3456 0 0 0 0 0 11 0
|
||||
217 3504 0 0 0 0 0 6 0
|
||||
219 3536 0 0 0 0 0 13 0
|
||||
222 3584 0 0 0 0 0 7 0
|
||||
223 3600 0 0 0 0 0 15 0
|
||||
225 3632 0 0 0 0 0 8 0
|
||||
228 3680 0 0 0 0 0 9 0
|
||||
230 3712 0 0 0 0 0 10 0
|
||||
232 3744 0 0 0 0 0 11 0
|
||||
234 3776 0 0 0 0 0 12 0
|
||||
235 3792 0 0 0 0 0 13 0
|
||||
236 3808 0 0 0 0 0 14 0
|
||||
238 3840 0 0 0 0 0 15 0
|
||||
254 4096 0 0 0 0 0 1 0
|
||||
...
|
||||
|
||||
Overall the combined zspage chain size effect on zsmalloc pool configuration:::
|
||||
|
||||
pages per zspage number of size classes (clusters) huge size class watermark
|
||||
4 69 3264
|
||||
5 86 3408
|
||||
6 93 3504
|
||||
7 112 3584
|
||||
8 123 3632
|
||||
9 140 3680
|
||||
10 143 3712
|
||||
11 159 3744
|
||||
12 164 3776
|
||||
13 180 3792
|
||||
14 183 3808
|
||||
15 188 3840
|
||||
16 191 3840
|
||||
|
||||
|
||||
A synthetic test
|
||||
----------------
|
||||
|
||||
zram as a build artifacts storage (Linux kernel compilation).
|
||||
|
||||
* `CONFIG_ZSMALLOC_CHAIN_SIZE=4`
|
||||
|
||||
zsmalloc classes stats:::
|
||||
|
||||
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
|
||||
...
|
||||
Total 13 51 413836 412973 159955 3
|
||||
|
||||
zram mm_stat:::
|
||||
|
||||
1691783168 628083717 655175680 0 655175680 60 0 34048 34049
|
||||
|
||||
|
||||
* `CONFIG_ZSMALLOC_CHAIN_SIZE=8`
|
||||
|
||||
zsmalloc classes stats:::
|
||||
|
||||
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
|
||||
...
|
||||
Total 18 87 414852 412978 156666 0
|
||||
|
||||
zram mm_stat:::
|
||||
|
||||
1691803648 627793930 641703936 0 641703936 60 0 33591 33591
|
||||
|
||||
Using larger zspage chains may result in using fewer physical pages, as seen
|
||||
in the example where the number of physical pages used decreased from 159955
|
||||
to 156666, at the same time maximum zsmalloc pool memory usage went down from
|
||||
655175680 to 641703936 bytes.
|
||||
|
||||
However, this advantage may be offset by the potential for increased system
|
||||
memory pressure (as some zspages have larger chain sizes) in cases where there
|
||||
is heavy internal fragmentation and zspool compaction is unable to relocate
|
||||
objects and release zspages. In these cases, it is recommended to decrease
|
||||
the limit on the size of the zspage chains (as specified by the
|
||||
CONFIG_ZSMALLOC_CHAIN_SIZE option).
|
||||
|
@ -142,14 +142,14 @@ HPAGE_RESV_OWNER标志被设置,以表明该VMA拥有预留。
|
||||
消耗预留/分配一个巨页
|
||||
===========================
|
||||
|
||||
当与预留相关的巨页在相应的映射中被分配和实例化时,预留就被消耗了。该分配是在函数alloc_huge_page()
|
||||
当与预留相关的巨页在相应的映射中被分配和实例化时,预留就被消耗了。该分配是在函数alloc_hugetlb_folio()
|
||||
中进行的::
|
||||
|
||||
struct page *alloc_huge_page(struct vm_area_struct *vma,
|
||||
struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
|
||||
unsigned long addr, int avoid_reserve)
|
||||
|
||||
alloc_huge_page被传递给一个VMA指针和一个虚拟地址,因此它可以查阅预留映射以确定是否存在预留。
|
||||
此外,alloc_huge_page需要一个参数avoid_reserve,该参数表示即使看起来已经为指定的地址预留了
|
||||
alloc_hugetlb_folio被传递给一个VMA指针和一个虚拟地址,因此它可以查阅预留映射以确定是否存在预留。
|
||||
此外,alloc_hugetlb_folio需要一个参数avoid_reserve,该参数表示即使看起来已经为指定的地址预留了
|
||||
预留,也不应该使用预留。avoid_reserve参数最常被用于写时拷贝和页面迁移的情况下,即现有页面的额
|
||||
外拷贝被分配。
|
||||
|
||||
@ -162,7 +162,7 @@ vma_needs_reservation()返回的值通常为0或1。如果该地址存在预留
|
||||
确定预留是否存在并可用于分配后,调用dequeue_huge_page_vma()函数。这个函数需要两个与预留有关
|
||||
的参数:
|
||||
|
||||
- avoid_reserve,这是传递给alloc_huge_page()的同一个值/参数。
|
||||
- avoid_reserve,这是传递给alloc_hugetlb_folio()的同一个值/参数。
|
||||
- chg,尽管这个参数的类型是long,但只有0或1的值被传递给dequeue_huge_page_vma。如果该值为0,
|
||||
则表明存在预留(关于可能的问题,请参见 “预留和内存策略” 一节)。如果值
|
||||
为1,则表示不存在预留,如果可能的话,必须从全局空闲池中取出该页。
|
||||
@ -179,7 +179,7 @@ free_huge_pages的值被递减。如果有一个与该页相关的预留,将
|
||||
的剩余巨页和超额分配的问题。即使分配了一个多余的页面,也会进行与上面一样的基于预留的调整:
|
||||
SetPagePrivate(page) 和 resv_huge_pages--.
|
||||
|
||||
在获得一个新的巨页后,(page)->private被设置为与该页面相关的子池的值,如果它存在的话。当页
|
||||
在获得一个新的巨页后,(folio)->_hugetlb_subpool被设置为与该页面相关的子池的值,如果它存在的话。当页
|
||||
面被释放时,这将被用于子池的计数。
|
||||
|
||||
然后调用函数vma_commit_reservation(),根据预留的消耗情况调整预留映射。一般来说,这涉及
|
||||
@ -199,7 +199,7 @@ SetPagePrivate(page)和resv_huge_pages-。
|
||||
已经存在,所以不做任何改变。然而,如果共享映射中没有预留,或者这是一个私有映射,则必须创建
|
||||
一个新的条目。
|
||||
|
||||
在alloc_huge_page()开始调用vma_needs_reservation()和页面分配后调用
|
||||
在alloc_hugetlb_folio()开始调用vma_needs_reservation()和页面分配后调用
|
||||
vma_commit_reservation()之间,预留映射有可能被改变。如果hugetlb_reserve_pages在共
|
||||
享映射中为同一页面被调用,这将是可能的。在这种情况下,预留计数和子池空闲页计数会有一个偏差。
|
||||
这种罕见的情况可以通过比较vma_needs_reservation和vma_commit_reservation的返回值来
|
||||
|
@ -51,7 +51,7 @@ page owner在默认情况下是禁用的。所以,如果你想使用它,你
|
||||
|
||||
1) 构建用户空间的帮助::
|
||||
|
||||
cd tools/vm
|
||||
cd tools/mm
|
||||
make page_owner_sort
|
||||
|
||||
2) 启用page owner: 添加 "page_owner=on" 到 boot cmdline.
|
||||
|
14
MAINTAINERS
14
MAINTAINERS
@ -5657,6 +5657,11 @@ M: SeongJae Park <sj@kernel.org>
|
||||
L: damon@lists.linux.dev
|
||||
L: linux-mm@kvack.org
|
||||
S: Maintained
|
||||
W: https://damonitor.github.io
|
||||
P: Documentation/mm/damon/maintainer-profile.rst
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
|
||||
T: quilt git://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/sj/linux.git damon/next
|
||||
F: Documentation/ABI/testing/sysfs-kernel-mm-damon
|
||||
F: Documentation/admin-guide/mm/damon/
|
||||
F: Documentation/mm/damon/
|
||||
@ -9340,7 +9345,7 @@ F: Documentation/mm/hmm.rst
|
||||
F: include/linux/hmm*
|
||||
F: lib/test_hmm*
|
||||
F: mm/hmm*
|
||||
F: tools/testing/selftests/vm/*hmm*
|
||||
F: tools/testing/selftests/mm/*hmm*
|
||||
|
||||
HOST AP DRIVER
|
||||
M: Jouni Malinen <j@w1.fi>
|
||||
@ -13378,7 +13383,7 @@ M: Andrew Morton <akpm@linux-foundation.org>
|
||||
L: linux-mm@kvack.org
|
||||
S: Maintained
|
||||
W: http://www.linux-mm.org
|
||||
T: git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
|
||||
T: quilt git://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new
|
||||
F: include/linux/gfp.h
|
||||
F: include/linux/gfp_types.h
|
||||
@ -13387,7 +13392,8 @@ F: include/linux/mm.h
|
||||
F: include/linux/mmzone.h
|
||||
F: include/linux/pagewalk.h
|
||||
F: mm/
|
||||
F: tools/testing/selftests/vm/
|
||||
F: tools/mm/
|
||||
F: tools/testing/selftests/mm/
|
||||
|
||||
VMALLOC
|
||||
M: Andrew Morton <akpm@linux-foundation.org>
|
||||
@ -13396,7 +13402,7 @@ R: Christoph Hellwig <hch@infradead.org>
|
||||
L: linux-mm@kvack.org
|
||||
S: Maintained
|
||||
W: http://www.linux-mm.org
|
||||
T: git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
|
||||
F: include/linux/vmalloc.h
|
||||
F: mm/vmalloc.c
|
||||
|
||||
|
@ -17,9 +17,8 @@
|
||||
extern void clear_page(void *page);
|
||||
#define clear_user_page(page, vaddr, pg) clear_page(page)
|
||||
|
||||
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \
|
||||
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr)
|
||||
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
|
||||
#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
|
||||
vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
|
||||
|
||||
extern void copy_page(void * _to, void * _from);
|
||||
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
|
||||
@ -87,10 +86,6 @@ typedef struct page *pgtable_t;
|
||||
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
|
||||
#define virt_addr_valid(kaddr) pfn_valid((__pa(kaddr) >> PAGE_SHIFT))
|
||||
|
||||
#ifdef CONFIG_FLATMEM
|
||||
#define pfn_valid(pfn) ((pfn) < max_mapnr)
|
||||
#endif /* CONFIG_FLATMEM */
|
||||
|
||||
#include <asm-generic/memory_model.h>
|
||||
#include <asm-generic/getorder.h>
|
||||
|
||||
|
@ -74,6 +74,9 @@ struct vm_area_struct;
|
||||
#define _PAGE_DIRTY 0x20000
|
||||
#define _PAGE_ACCESSED 0x40000
|
||||
|
||||
/* We borrow bit 39 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE 0x8000000000UL
|
||||
|
||||
/*
|
||||
* NOTE! The "accessed" bit isn't necessarily exact: it can be kept exactly
|
||||
* by software (use the KRE/URE/KWE/UWE bits appropriately), but I'll fake it.
|
||||
@ -301,18 +304,47 @@ extern inline void update_mmu_cache(struct vm_area_struct * vma,
|
||||
}
|
||||
|
||||
/*
|
||||
* Non-present pages: high 24 bits are offset, next 8 bits type,
|
||||
* low 32 bits zero.
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
|
||||
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
|
||||
* <------------------- offset ------------------> E <--- type -->
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <--------------------------- zeroes -------------------------->
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*/
|
||||
extern inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
|
||||
{ pte_t pte; pte_val(pte) = (type << 32) | (offset << 40); return pte; }
|
||||
{ pte_t pte; pte_val(pte) = ((type & 0x7f) << 32) | (offset << 40); return pte; }
|
||||
|
||||
#define __swp_type(x) (((x).val >> 32) & 0xff)
|
||||
#define __swp_type(x) (((x).val >> 32) & 0x7f)
|
||||
#define __swp_offset(x) ((x).val >> 40)
|
||||
#define __swp_entry(type, off) ((swp_entry_t) { pte_val(mk_swap_pte((type), (off))) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
#define pte_ERROR(e) \
|
||||
printk("%s:%d: bad pte %016lx.\n", __FILE__, __LINE__, pte_val(e))
|
||||
#define pmd_ERROR(e) \
|
||||
|
@ -109,7 +109,6 @@ extern int pfn_valid(unsigned long pfn);
|
||||
#else /* CONFIG_HIGHMEM */
|
||||
|
||||
#define ARCH_PFN_OFFSET virt_to_pfn(CONFIG_LINUX_RAM_BASE)
|
||||
#define pfn_valid(pfn) (((pfn) - ARCH_PFN_OFFSET) < max_mapnr)
|
||||
|
||||
#endif /* CONFIG_HIGHMEM */
|
||||
|
||||
|
@ -26,6 +26,9 @@
|
||||
#define _PAGE_GLOBAL (1 << 8) /* ASID agnostic (H) */
|
||||
#define _PAGE_PRESENT (1 << 9) /* PTE/TLB Valid (H) */
|
||||
|
||||
/* We borrow bit 5 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE _PAGE_DIRTY
|
||||
|
||||
#ifdef CONFIG_ARC_MMU_V4
|
||||
#define _PAGE_HW_SZ (1 << 10) /* Normal/super (H) */
|
||||
#else
|
||||
@ -106,9 +109,18 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
|
||||
void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
|
||||
pte_t *ptep);
|
||||
|
||||
/* Encode swap {type,off} tuple into PTE
|
||||
* We reserve 13 bits for 5-bit @type, keeping bits 12-5 zero, ensuring that
|
||||
* PAGE_PRESENT is zero in a PTE holding swap "identifier"
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <-------------- offset -------------> <--- zero --> E < type ->
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
* The zero'ed bits include _PAGE_PRESENT.
|
||||
*/
|
||||
#define __swp_entry(type, off) ((swp_entry_t) \
|
||||
{ ((type) & 0x1f) | ((off) << 13) })
|
||||
@ -120,6 +132,14 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
PTE_BIT_FUNC(swp_mkexclusive, |= (_PAGE_SWP_EXCLUSIVE));
|
||||
PTE_BIT_FUNC(swp_clear_exclusive, &= ~(_PAGE_SWP_EXCLUSIVE));
|
||||
|
||||
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
||||
#include <asm/hugepage.h>
|
||||
#endif
|
||||
|
@ -386,6 +386,4 @@ static inline unsigned long __virt_to_idmap(unsigned long x)
|
||||
|
||||
#endif
|
||||
|
||||
#include <asm-generic/memory_model.h>
|
||||
|
||||
#endif
|
||||
|
@ -158,6 +158,7 @@ typedef struct page *pgtable_t;
|
||||
|
||||
#ifdef CONFIG_HAVE_ARCH_PFN_VALID
|
||||
extern int pfn_valid(unsigned long);
|
||||
#define pfn_valid pfn_valid
|
||||
#endif
|
||||
|
||||
#include <asm/memory.h>
|
||||
@ -167,5 +168,6 @@ extern int pfn_valid(unsigned long);
|
||||
#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC
|
||||
|
||||
#include <asm-generic/getorder.h>
|
||||
#include <asm-generic/memory_model.h>
|
||||
|
||||
#endif
|
||||
|
@ -126,6 +126,9 @@
|
||||
#define L_PTE_SHARED (_AT(pteval_t, 1) << 10) /* shared(v6), coherent(xsc3) */
|
||||
#define L_PTE_NONE (_AT(pteval_t, 1) << 11)
|
||||
|
||||
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
|
||||
#define L_PTE_SWP_EXCLUSIVE L_PTE_RDONLY
|
||||
|
||||
/*
|
||||
* These are the memory types, defined to be compatible with
|
||||
* pre-ARMv6 CPUs cacheable and bufferable bits: n/a,n/a,C,B
|
||||
|
@ -76,6 +76,9 @@
|
||||
#define L_PTE_NONE (_AT(pteval_t, 1) << 57) /* PROT_NONE */
|
||||
#define L_PTE_RDONLY (_AT(pteval_t, 1) << 58) /* READ ONLY */
|
||||
|
||||
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
|
||||
#define L_PTE_SWP_EXCLUSIVE (_AT(pteval_t, 1) << 7)
|
||||
|
||||
#define L_PMD_SECT_VALID (_AT(pmdval_t, 1) << 0)
|
||||
#define L_PMD_SECT_DIRTY (_AT(pmdval_t, 1) << 55)
|
||||
#define L_PMD_SECT_NONE (_AT(pmdval_t, 1) << 57)
|
||||
|
@ -271,27 +271,47 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
|
||||
}
|
||||
|
||||
/*
|
||||
* Encode and decode a swap entry. Swap entries are stored in the Linux
|
||||
* page tables as follows:
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <--------------- offset ------------------------> < type -> 0 0
|
||||
* <------------------- offset ------------------> E < type -> 0 0
|
||||
*
|
||||
* This gives us up to 31 swap files and 128GB per swap file. Note that
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*
|
||||
* This gives us up to 31 swap files and 64GB per swap file. Note that
|
||||
* the offset field is always non-zero.
|
||||
*/
|
||||
#define __SWP_TYPE_SHIFT 2
|
||||
#define __SWP_TYPE_BITS 5
|
||||
#define __SWP_TYPE_MASK ((1 << __SWP_TYPE_BITS) - 1)
|
||||
#define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT)
|
||||
#define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT + 1)
|
||||
|
||||
#define __swp_type(x) (((x).val >> __SWP_TYPE_SHIFT) & __SWP_TYPE_MASK)
|
||||
#define __swp_offset(x) ((x).val >> __SWP_OFFSET_SHIFT)
|
||||
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << __SWP_TYPE_SHIFT) | ((offset) << __SWP_OFFSET_SHIFT) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { (((type) & __SWP_TYPE_MASK) << __SWP_TYPE_SHIFT) | \
|
||||
((offset) << __SWP_OFFSET_SHIFT) })
|
||||
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(swp) __pte((swp).val | PTE_TYPE_FAULT)
|
||||
#define __swp_entry_to_pte(swp) __pte((swp).val)
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_isset(pte, L_PTE_SWP_EXCLUSIVE);
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
return set_pte_bit(pte, __pgprot(L_PTE_SWP_EXCLUSIVE));
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
return clear_pte_bit(pte, __pgprot(L_PTE_SWP_EXCLUSIVE));
|
||||
}
|
||||
|
||||
/*
|
||||
* It is an error for the kernel to have more swap files than we can
|
||||
|
@ -315,7 +315,7 @@ static int __init gate_vma_init(void)
|
||||
gate_vma.vm_page_prot = PAGE_READONLY_EXEC;
|
||||
gate_vma.vm_start = 0xffff0000;
|
||||
gate_vma.vm_end = 0xffff0000 + PAGE_SIZE;
|
||||
gate_vma.vm_flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYEXEC;
|
||||
vm_flags_init(&gate_vma, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYEXEC);
|
||||
return 0;
|
||||
}
|
||||
arch_initcall(gate_vma_init);
|
||||
|
@ -29,9 +29,9 @@ void copy_user_highpage(struct page *to, struct page *from,
|
||||
void copy_highpage(struct page *to, struct page *from);
|
||||
#define __HAVE_ARCH_COPY_HIGHPAGE
|
||||
|
||||
struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
|
||||
struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
|
||||
unsigned long vaddr);
|
||||
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
|
||||
#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
|
||||
|
||||
void tag_clear_highpage(struct page *to);
|
||||
#define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
|
||||
|
@ -421,7 +421,6 @@ static inline pgprot_t mk_pmd_sect_prot(pgprot_t prot)
|
||||
return __pgprot((pgprot_val(prot) & ~PMD_TABLE_BIT) | PMD_TYPE_SECT);
|
||||
}
|
||||
|
||||
#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
return set_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));
|
||||
|
@ -138,13 +138,11 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
|
||||
mmap_read_lock(mm);
|
||||
|
||||
for_each_vma(vmi, vma) {
|
||||
unsigned long size = vma->vm_end - vma->vm_start;
|
||||
|
||||
if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
|
||||
zap_page_range(vma, vma->vm_start, size);
|
||||
zap_vma_pages(vma);
|
||||
#ifdef CONFIG_COMPAT_VDSO
|
||||
if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
|
||||
zap_page_range(vma, vma->vm_start, size);
|
||||
zap_vma_pages(vma);
|
||||
#endif
|
||||
}
|
||||
|
||||
|
@ -925,7 +925,7 @@ NOKPROBE_SYMBOL(do_debug_exception);
|
||||
/*
|
||||
* Used during anonymous page fault handling.
|
||||
*/
|
||||
struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
|
||||
struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
|
||||
unsigned long vaddr)
|
||||
{
|
||||
gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
|
||||
@ -938,7 +938,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
|
||||
if (vma->vm_flags & VM_MTE)
|
||||
flags |= __GFP_ZEROTAGS;
|
||||
|
||||
return alloc_page_vma(flags, vma, vaddr);
|
||||
return vma_alloc_folio(flags, 0, vma, vaddr, false);
|
||||
}
|
||||
|
||||
void tag_clear_highpage(struct page *page)
|
||||
|
@ -10,6 +10,9 @@
|
||||
#define _PAGE_ACCESSED (1<<3)
|
||||
#define _PAGE_MODIFIED (1<<4)
|
||||
|
||||
/* We borrow bit 9 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE (1<<9)
|
||||
|
||||
/* implemented in hardware */
|
||||
#define _PAGE_GLOBAL (1<<6)
|
||||
#define _PAGE_VALID (1<<7)
|
||||
@ -26,7 +29,8 @@
|
||||
#define _PAGE_PROT_NONE _PAGE_READ
|
||||
|
||||
/*
|
||||
* Encode and decode a swap entry
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTE:
|
||||
* bit 0: _PAGE_PRESENT (zero)
|
||||
@ -35,15 +39,16 @@
|
||||
* bit 6: _PAGE_GLOBAL (zero)
|
||||
* bit 7: _PAGE_VALID (zero)
|
||||
* bit 8: swap type[4]
|
||||
* bit 9 - 31: swap offset
|
||||
* bit 9: exclusive marker
|
||||
* bit 10 - 31: swap offset
|
||||
*/
|
||||
#define __swp_type(x) ((((x).val >> 2) & 0xf) | \
|
||||
(((x).val >> 4) & 0x10))
|
||||
#define __swp_offset(x) ((x).val >> 9)
|
||||
#define __swp_offset(x) ((x).val >> 10)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { \
|
||||
((type & 0xf) << 2) | \
|
||||
((type & 0x10) << 4) | \
|
||||
((offset) << 9)})
|
||||
((offset) << 10)})
|
||||
|
||||
#define HAVE_ARCH_UNMAPPED_AREA
|
||||
|
||||
|
@ -10,6 +10,9 @@
|
||||
#define _PAGE_PRESENT (1<<10)
|
||||
#define _PAGE_MODIFIED (1<<11)
|
||||
|
||||
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE (1<<7)
|
||||
|
||||
/* implemented in hardware */
|
||||
#define _PAGE_GLOBAL (1<<0)
|
||||
#define _PAGE_VALID (1<<1)
|
||||
@ -26,23 +29,25 @@
|
||||
#define _PAGE_PROT_NONE _PAGE_WRITE
|
||||
|
||||
/*
|
||||
* Encode and decode a swap entry
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTE:
|
||||
* bit 0: _PAGE_GLOBAL (zero)
|
||||
* bit 1: _PAGE_VALID (zero)
|
||||
* bit 2 - 6: swap type
|
||||
* bit 7 - 8: swap offset[0 - 1]
|
||||
* bit 7: exclusive marker
|
||||
* bit 8: swap offset[0]
|
||||
* bit 9: _PAGE_WRITE (zero)
|
||||
* bit 10: _PAGE_PRESENT (zero)
|
||||
* bit 11 - 31: swap offset[2 - 22]
|
||||
* bit 11 - 31: swap offset[1 - 21]
|
||||
*/
|
||||
#define __swp_type(x) (((x).val >> 2) & 0x1f)
|
||||
#define __swp_offset(x) ((((x).val >> 7) & 0x3) | \
|
||||
(((x).val >> 9) & 0x7ffffc))
|
||||
#define __swp_offset(x) ((((x).val >> 8) & 0x1) | \
|
||||
(((x).val >> 10) & 0x3ffffe))
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { \
|
||||
((type & 0x1f) << 2) | \
|
||||
((offset & 0x3) << 7) | \
|
||||
((offset & 0x7ffffc) << 9)})
|
||||
((offset & 0x1) << 8) | \
|
||||
((offset & 0x3ffffe) << 10)})
|
||||
|
||||
#endif /* __ASM_CSKY_PGTABLE_BITS_H */
|
||||
|
@ -39,7 +39,6 @@
|
||||
|
||||
#define virt_addr_valid(kaddr) ((void *)(kaddr) >= (void *)PAGE_OFFSET && \
|
||||
(void *)(kaddr) < high_memory)
|
||||
#define pfn_valid(pfn) ((pfn) >= ARCH_PFN_OFFSET && ((pfn) - ARCH_PFN_OFFSET) < max_mapnr)
|
||||
|
||||
extern void *memset(void *dest, int c, size_t l);
|
||||
extern void *memcpy(void *to, const void *from, size_t l);
|
||||
|
@ -200,6 +200,23 @@ static inline pte_t pte_mkyoung(pte_t pte)
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
#define __HAVE_PHYS_MEM_ACCESS_PROT
|
||||
struct file;
|
||||
extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
|
||||
|
@ -95,7 +95,6 @@ struct page;
|
||||
/* Default vm area behavior is non-executable. */
|
||||
#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_NON_EXEC
|
||||
|
||||
#define pfn_valid(pfn) ((pfn) < max_mapnr)
|
||||
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
|
||||
|
||||
/* Need to not use a define for linesize; may move this to another file. */
|
||||
|
@ -61,6 +61,9 @@ extern unsigned long empty_zero_page;
|
||||
* So we'll put up with a bit of inefficiency for now...
|
||||
*/
|
||||
|
||||
/* We borrow bit 6 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE (1<<6)
|
||||
|
||||
/*
|
||||
* Top "FOURTH" level (pgd), which for the Hexagon VM is really
|
||||
* only the second from the bottom, pgd and pud both being collapsed.
|
||||
@ -359,9 +362,12 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
|
||||
#define ZERO_PAGE(vaddr) (virt_to_page(&empty_zero_page))
|
||||
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Swap/file PTE definitions. If _PAGE_PRESENT is zero, the rest of the PTE is
|
||||
* interpreted as swap information. The remaining free bits are interpreted as
|
||||
* swap type/offset tuple. Rather than have the TLB fill handler test
|
||||
* listed below. Rather than have the TLB fill handler test
|
||||
* _PAGE_PRESENT, we're going to reserve the permissions bits and set them to
|
||||
* all zeros for swap entries, which speeds up the miss handler at the cost of
|
||||
* 3 bits of offset. That trade-off can be revisited if necessary, but Hexagon
|
||||
@ -371,9 +377,10 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
|
||||
* Format of swap PTE:
|
||||
* bit 0: Present (zero)
|
||||
* bits 1-5: swap type (arch independent layer uses 5 bits max)
|
||||
* bits 6-9: bits 3:0 of offset
|
||||
* bit 6: exclusive marker
|
||||
* bits 7-9: bits 2:0 of offset
|
||||
* bits 10-12: effectively _PAGE_PROTNONE (all zero)
|
||||
* bits 13-31: bits 22:4 of swap offset
|
||||
* bits 13-31: bits 21:3 of swap offset
|
||||
*
|
||||
* The split offset makes some of the following macros a little gnarly,
|
||||
* but there's plenty of precedent for this sort of thing.
|
||||
@ -383,11 +390,28 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
|
||||
#define __swp_type(swp_pte) (((swp_pte).val >> 1) & 0x1f)
|
||||
|
||||
#define __swp_offset(swp_pte) \
|
||||
((((swp_pte).val >> 6) & 0xf) | (((swp_pte).val >> 9) & 0x7ffff0))
|
||||
((((swp_pte).val >> 7) & 0x7) | (((swp_pte).val >> 10) & 0x3ffff8))
|
||||
|
||||
#define __swp_entry(type, offset) \
|
||||
((swp_entry_t) { \
|
||||
((type << 1) | \
|
||||
((offset & 0x7ffff0) << 9) | ((offset & 0xf) << 6)) })
|
||||
(((type & 0x1f) << 1) | \
|
||||
((offset & 0x3ffff8) << 10) | ((offset & 0x7) << 7)) })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
@ -82,25 +82,19 @@ do { \
|
||||
} while (0)
|
||||
|
||||
|
||||
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \
|
||||
#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
|
||||
({ \
|
||||
struct page *page = alloc_page_vma( \
|
||||
GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr); \
|
||||
if (page) \
|
||||
flush_dcache_page(page); \
|
||||
page; \
|
||||
struct folio *folio = vma_alloc_folio( \
|
||||
GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false); \
|
||||
if (folio) \
|
||||
flush_dcache_folio(folio); \
|
||||
folio; \
|
||||
})
|
||||
|
||||
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
|
||||
|
||||
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
|
||||
|
||||
#include <asm-generic/memory_model.h>
|
||||
|
||||
#ifdef CONFIG_FLATMEM
|
||||
# define pfn_valid(pfn) ((pfn) < max_mapnr)
|
||||
#endif
|
||||
|
||||
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
|
||||
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
|
||||
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
|
||||
|
@ -58,6 +58,9 @@
|
||||
#define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */
|
||||
#define _PAGE_PROTNONE (__IA64_UL(1) << 63)
|
||||
|
||||
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE (1 << 7)
|
||||
|
||||
#define _PFN_MASK _PAGE_PPN_MASK
|
||||
/* Mask of bits which may be changed by pte_modify(); the odd bits are there for _PAGE_PROTNONE */
|
||||
#define _PAGE_CHG_MASK (_PAGE_P | _PAGE_PROTNONE | _PAGE_PL_MASK | _PAGE_AR_MASK | _PAGE_ED)
|
||||
@ -399,6 +402,9 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
|
||||
extern void paging_init (void);
|
||||
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of
|
||||
* bits in the swap-type field of the swap pte. It would be nice to
|
||||
* enforce that, but we can't easily include <linux/swap.h> here.
|
||||
@ -406,16 +412,35 @@ extern void paging_init (void);
|
||||
*
|
||||
* Format of swap pte:
|
||||
* bit 0 : present bit (must be zero)
|
||||
* bits 1- 7: swap-type
|
||||
* bits 1- 6: swap type
|
||||
* bit 7 : exclusive marker
|
||||
* bits 8-62: swap offset
|
||||
* bit 63 : _PAGE_PROTNONE bit
|
||||
*/
|
||||
#define __swp_type(entry) (((entry).val >> 1) & 0x7f)
|
||||
#define __swp_type(entry) (((entry).val >> 1) & 0x3f)
|
||||
#define __swp_offset(entry) (((entry).val << 1) >> 9)
|
||||
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 1) | ((long) (offset) << 8) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { ((type & 0x3f) << 1) | \
|
||||
((long) (offset) << 8) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
/*
|
||||
* ZERO_PAGE is a global shared page that is always zero: used
|
||||
* for zero-mapped memory areas etc..
|
||||
|
@ -109,7 +109,7 @@ ia64_init_addr_space (void)
|
||||
vma_set_anonymous(vma);
|
||||
vma->vm_start = current->thread.rbs_bot & PAGE_MASK;
|
||||
vma->vm_end = vma->vm_start + PAGE_SIZE;
|
||||
vma->vm_flags = VM_DATA_DEFAULT_FLAGS|VM_GROWSUP|VM_ACCOUNT;
|
||||
vm_flags_init(vma, VM_DATA_DEFAULT_FLAGS|VM_GROWSUP|VM_ACCOUNT);
|
||||
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
|
||||
mmap_write_lock(current->mm);
|
||||
if (insert_vm_struct(current->mm, vma)) {
|
||||
@ -127,8 +127,8 @@ ia64_init_addr_space (void)
|
||||
vma_set_anonymous(vma);
|
||||
vma->vm_end = PAGE_SIZE;
|
||||
vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
|
||||
vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO |
|
||||
VM_DONTEXPAND | VM_DONTDUMP;
|
||||
vm_flags_init(vma, VM_READ | VM_MAYREAD | VM_IO |
|
||||
VM_DONTEXPAND | VM_DONTDUMP);
|
||||
mmap_write_lock(current->mm);
|
||||
if (insert_vm_struct(current->mm, vma)) {
|
||||
mmap_write_unlock(current->mm);
|
||||
@ -272,7 +272,7 @@ static int __init gate_vma_init(void)
|
||||
vma_init(&gate_vma, NULL);
|
||||
gate_vma.vm_start = FIXADDR_USER_START;
|
||||
gate_vma.vm_end = FIXADDR_USER_END;
|
||||
gate_vma.vm_flags = VM_READ | VM_MAYREAD | VM_EXEC | VM_MAYEXEC;
|
||||
vm_flags_init(&gate_vma, VM_READ | VM_MAYREAD | VM_EXEC | VM_MAYEXEC);
|
||||
gate_vma.vm_page_prot = __pgprot(__ACCESS_BITS | _PAGE_PL_3 | _PAGE_AR_RX);
|
||||
|
||||
return 0;
|
||||
|
@ -82,19 +82,6 @@ typedef struct { unsigned long pgprot; } pgprot_t;
|
||||
|
||||
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
|
||||
|
||||
#ifdef CONFIG_FLATMEM
|
||||
|
||||
static inline int pfn_valid(unsigned long pfn)
|
||||
{
|
||||
/* avoid <linux/mm.h> include hell */
|
||||
extern unsigned long max_mapnr;
|
||||
unsigned long pfn_offset = ARCH_PFN_OFFSET;
|
||||
|
||||
return pfn >= pfn_offset && pfn < max_mapnr;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
#define virt_to_pfn(kaddr) PFN_DOWN(PHYSADDR(kaddr))
|
||||
#define virt_to_page(kaddr) pfn_to_page(virt_to_pfn(kaddr))
|
||||
|
||||
|
@ -20,6 +20,7 @@
|
||||
#define _PAGE_SPECIAL_SHIFT 11
|
||||
#define _PAGE_HGLOBAL_SHIFT 12 /* HGlobal is a PMD bit */
|
||||
#define _PAGE_PFN_SHIFT 12
|
||||
#define _PAGE_SWP_EXCLUSIVE_SHIFT 23
|
||||
#define _PAGE_PFN_END_SHIFT 48
|
||||
#define _PAGE_NO_READ_SHIFT 61
|
||||
#define _PAGE_NO_EXEC_SHIFT 62
|
||||
@ -33,6 +34,9 @@
|
||||
#define _PAGE_PROTNONE (_ULCAST_(1) << _PAGE_PROTNONE_SHIFT)
|
||||
#define _PAGE_SPECIAL (_ULCAST_(1) << _PAGE_SPECIAL_SHIFT)
|
||||
|
||||
/* We borrow bit 23 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE (_ULCAST_(1) << _PAGE_SWP_EXCLUSIVE_SHIFT)
|
||||
|
||||
/* Used by TLB hardware (placed in EntryLo*) */
|
||||
#define _PAGE_VALID (_ULCAST_(1) << _PAGE_VALID_SHIFT)
|
||||
#define _PAGE_DIRTY (_ULCAST_(1) << _PAGE_DIRTY_SHIFT)
|
||||
|
@ -249,13 +249,26 @@ extern void pud_init(void *addr);
|
||||
extern void pmd_init(void *addr);
|
||||
|
||||
/*
|
||||
* Non-present pages: high 40 bits are offset, next 8 bits type,
|
||||
* low 16 bits zero.
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
|
||||
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
|
||||
* <--------------------------- offset ---------------------------
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* --------------> E <--- type ---> <---------- zeroes ---------->
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
* The zero'ed bits include _PAGE_PRESENT and _PAGE_PROTNONE.
|
||||
*/
|
||||
static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
|
||||
{ pte_t pte; pte_val(pte) = (type << 16) | (offset << 24); return pte; }
|
||||
{ pte_t pte; pte_val(pte) = ((type & 0x7f) << 16) | (offset << 24); return pte; }
|
||||
|
||||
#define __swp_type(x) (((x).val >> 16) & 0xff)
|
||||
#define __swp_type(x) (((x).val >> 16) & 0x7f)
|
||||
#define __swp_offset(x) ((x).val >> 24)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { pte_val(mk_swap_pte((type), (offset))) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
@ -263,6 +276,23 @@ static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
|
||||
#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) })
|
||||
#define __swp_entry_to_pmd(x) ((pmd_t) { (x).val | _PAGE_HUGE })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
extern void paging_init(void);
|
||||
|
||||
#define pte_none(pte) (!(pte_val(pte) & ~_PAGE_GLOBAL))
|
||||
|
@ -149,7 +149,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)
|
||||
struct vm_area_struct vma;
|
||||
|
||||
vma.vm_mm = tlb->mm;
|
||||
vma.vm_flags = 0;
|
||||
vm_flags_init(&vma, 0);
|
||||
if (tlb->fullmm) {
|
||||
flush_tlb_mm(tlb->mm);
|
||||
return;
|
||||
|
@ -46,6 +46,9 @@
|
||||
#define _CACHEMASK040 (~0x060)
|
||||
#define _PAGE_GLOBAL040 0x400 /* 68040 global bit, used for kva descs */
|
||||
|
||||
/* We borrow bit 24 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE CF_PAGE_NOCACHE
|
||||
|
||||
/*
|
||||
* Externally used page protection values.
|
||||
*/
|
||||
@ -254,15 +257,41 @@ static inline pte_t pte_mkcache(pte_t pte)
|
||||
extern pgd_t kernel_pg_dir[PTRS_PER_PGD];
|
||||
|
||||
/*
|
||||
* Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e))
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <------------------ offset -------------> 0 0 0 E <-- type --->
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*/
|
||||
#define __swp_type(x) ((x).val & 0xFF)
|
||||
#define __swp_type(x) ((x).val & 0x7f)
|
||||
#define __swp_offset(x) ((x).val >> 11)
|
||||
#define __swp_entry(typ, off) ((swp_entry_t) { (typ) | \
|
||||
#define __swp_entry(typ, off) ((swp_entry_t) { ((typ) & 0x7f) | \
|
||||
(off << 11) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) (__pte((x).val))
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
#define pmd_pfn(pmd) (pmd_val(pmd) >> PAGE_SHIFT)
|
||||
#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
|
||||
|
||||
|
@ -41,6 +41,9 @@
|
||||
|
||||
#define _PAGE_PROTNONE 0x004
|
||||
|
||||
/* We borrow bit 11 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE 0x800
|
||||
|
||||
#ifndef __ASSEMBLY__
|
||||
|
||||
/* This is the cache mode to be used for pages containing page descriptors for
|
||||
@ -124,7 +127,7 @@ static inline void pud_set(pud_t *pudp, pmd_t *pmdp)
|
||||
* expects pmd_page() to exists, only to then DCE it all. Provide a dummy to
|
||||
* make the compiler happy.
|
||||
*/
|
||||
#define pmd_page(pmd) NULL
|
||||
#define pmd_page(pmd) ((struct page *)NULL)
|
||||
|
||||
|
||||
#define pud_none(pud) (!pud_val(pud))
|
||||
@ -169,12 +172,40 @@ static inline pte_t pte_mkcache(pte_t pte)
|
||||
#define swapper_pg_dir kernel_pg_dir
|
||||
extern pgd_t kernel_pg_dir[128];
|
||||
|
||||
/* Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e)) */
|
||||
#define __swp_type(x) (((x).val >> 4) & 0xff)
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <----------------- offset ------------> E <-- type ---> 0 0 0 0
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*/
|
||||
#define __swp_type(x) (((x).val >> 4) & 0x7f)
|
||||
#define __swp_offset(x) ((x).val >> 12)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 4) | ((offset) << 12) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x7f) << 4) | ((offset) << 12) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
#endif /* !__ASSEMBLY__ */
|
||||
#endif /* _MOTOROLA_PGTABLE_H */
|
||||
|
@ -62,11 +62,7 @@ extern unsigned long _ramend;
|
||||
#include <asm/page_no.h>
|
||||
#endif
|
||||
|
||||
#ifndef CONFIG_MMU
|
||||
#define __phys_to_pfn(paddr) ((unsigned long)((paddr) >> PAGE_SHIFT))
|
||||
#define __pfn_to_phys(pfn) PFN_PHYS(pfn)
|
||||
#endif
|
||||
|
||||
#include <asm-generic/getorder.h>
|
||||
#include <asm-generic/memory_model.h>
|
||||
|
||||
#endif /* _M68K_PAGE_H */
|
||||
|
@ -134,7 +134,6 @@ extern int m68k_virt_to_node_shift;
|
||||
})
|
||||
|
||||
#define ARCH_PFN_OFFSET (m68k_memory[0].addr >> PAGE_SHIFT)
|
||||
#include <asm-generic/memory_model.h>
|
||||
|
||||
#define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && (unsigned long)(kaddr) < (unsigned long)high_memory)
|
||||
#define pfn_valid(pfn) virt_addr_valid(pfn_to_virt(pfn))
|
||||
|
@ -13,9 +13,8 @@ extern unsigned long memory_end;
|
||||
#define clear_user_page(page, vaddr, pg) clear_page(page)
|
||||
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
|
||||
|
||||
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \
|
||||
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr)
|
||||
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
|
||||
#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
|
||||
vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
|
||||
|
||||
#define __pa(vaddr) ((unsigned long)(vaddr))
|
||||
#define __va(paddr) ((void *)((unsigned long)(paddr)))
|
||||
@ -26,13 +25,11 @@ extern unsigned long memory_end;
|
||||
#define virt_to_page(addr) (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))
|
||||
#define page_to_virt(page) __va(((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET))
|
||||
|
||||
#define pfn_to_page(pfn) virt_to_page(pfn_to_virt(pfn))
|
||||
#define page_to_pfn(page) virt_to_pfn(page_to_virt(page))
|
||||
#define pfn_valid(pfn) ((pfn) < max_mapnr)
|
||||
|
||||
#define virt_addr_valid(kaddr) (((unsigned long)(kaddr) >= PAGE_OFFSET) && \
|
||||
((unsigned long)(kaddr) < memory_end))
|
||||
|
||||
#define ARCH_PFN_OFFSET PHYS_PFN(PAGE_OFFSET_RAW)
|
||||
|
||||
#endif /* __ASSEMBLY__ */
|
||||
|
||||
#endif /* _M68K_PAGE_NO_H */
|
||||
|
@ -31,12 +31,6 @@
|
||||
extern void paging_init(void);
|
||||
#define swapper_pg_dir ((pgd_t *) 0)
|
||||
|
||||
#define __swp_type(x) (0)
|
||||
#define __swp_offset(x) (0)
|
||||
#define __swp_entry(typ,off) ((swp_entry_t) { ((typ) | ((off) << 7)) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
/*
|
||||
* ZERO_PAGE is a global shared page that is always zero: used
|
||||
* for zero-mapped memory areas etc..
|
||||
|
@ -71,6 +71,9 @@
|
||||
#define SUN3_PMD_MASK (0x0000003F)
|
||||
#define SUN3_PMD_MAGIC (0x0000002B)
|
||||
|
||||
/* We borrow bit 6 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE 0x040
|
||||
|
||||
#ifndef __ASSEMBLY__
|
||||
|
||||
/*
|
||||
@ -152,12 +155,41 @@ static inline pte_t pte_mkcache(pte_t pte) { return pte; }
|
||||
extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
|
||||
extern pgd_t kernel_pg_dir[PTRS_PER_PGD];
|
||||
|
||||
/* Macros to (de)construct the fake PTEs representing swap pages. */
|
||||
#define __swp_type(x) ((x).val & 0x7F)
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* 0 <--------------------- offset ----------------> E <- type -->
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*/
|
||||
#define __swp_type(x) ((x).val & 0x3f)
|
||||
#define __swp_offset(x) (((x).val) >> 7)
|
||||
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) | ((offset) << 7)) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x3f) | \
|
||||
(((offset) << 7) & ~SUN3_PAGE_VALID)) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
#endif /* !__ASSEMBLY__ */
|
||||
#endif /* !_SUN3_PGTABLE_H */
|
||||
|
@ -112,7 +112,6 @@ extern int page_is_ram(unsigned long pfn);
|
||||
# define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
|
||||
|
||||
# define ARCH_PFN_OFFSET (memory_start >> PAGE_SHIFT)
|
||||
# define pfn_valid(pfn) ((pfn) >= ARCH_PFN_OFFSET && (pfn) < (max_mapnr + ARCH_PFN_OFFSET))
|
||||
# endif /* __ASSEMBLY__ */
|
||||
|
||||
#define virt_addr_valid(vaddr) (pfn_valid(virt_to_pfn(vaddr)))
|
||||
|
@ -131,10 +131,10 @@ extern pte_t *va_to_pte(unsigned long address);
|
||||
* of the 16 available. Bit 24-26 of the TLB are cleared in the TLB
|
||||
* miss handler. Bit 27 is PAGE_USER, thus selecting the correct
|
||||
* zone.
|
||||
* - PRESENT *must* be in the bottom two bits because swap cache
|
||||
* entries use the top 30 bits. Because 4xx doesn't support SMP
|
||||
* anyway, M is irrelevant so we borrow it for PAGE_PRESENT. Bit 30
|
||||
* is cleared in the TLB miss handler before the TLB entry is loaded.
|
||||
* - PRESENT *must* be in the bottom two bits because swap PTEs use the top
|
||||
* 30 bits. Because 4xx doesn't support SMP anyway, M is irrelevant so we
|
||||
* borrow it for PAGE_PRESENT. Bit 30 is cleared in the TLB miss handler
|
||||
* before the TLB entry is loaded.
|
||||
* - All other bits of the PTE are loaded into TLBLO without
|
||||
* * modification, leaving us only the bits 20, 21, 24, 25, 26, 30 for
|
||||
* software PTE bits. We actually use bits 21, 24, 25, and
|
||||
@ -155,6 +155,9 @@ extern pte_t *va_to_pte(unsigned long address);
|
||||
#define _PAGE_ACCESSED 0x400 /* software: R: page referenced */
|
||||
#define _PMD_PRESENT PAGE_MASK
|
||||
|
||||
/* We borrow bit 24 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE _PAGE_DIRTY
|
||||
|
||||
/*
|
||||
* Some bits are unused...
|
||||
*/
|
||||
@ -393,18 +396,39 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
|
||||
extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
|
||||
|
||||
/*
|
||||
* Encode and decode a swap entry.
|
||||
* Note that the bits we use in a PTE for representing a swap entry
|
||||
* must not include the _PAGE_PRESENT bit, or the _PAGE_HASHPTE bit
|
||||
* (if used). -- paulus
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
|
||||
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
* <------------------ offset -------------------> E < type -> 0 0
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*/
|
||||
#define __swp_type(entry) ((entry).val & 0x3f)
|
||||
#define __swp_type(entry) ((entry).val & 0x1f)
|
||||
#define __swp_offset(entry) ((entry).val >> 6)
|
||||
#define __swp_entry(type, offset) \
|
||||
((swp_entry_t) { (type) | ((offset) << 6) })
|
||||
((swp_entry_t) { ((type) & 0x1f) | ((offset) << 6) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 2 })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val << 2 })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
extern unsigned long iopa(unsigned long addr);
|
||||
|
||||
/* Values for nocacheflag and cmode */
|
||||
|
@ -224,34 +224,6 @@ extern phys_addr_t __phys_addr_symbol(unsigned long x);
|
||||
|
||||
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
|
||||
|
||||
#ifdef CONFIG_FLATMEM
|
||||
|
||||
static inline int pfn_valid(unsigned long pfn)
|
||||
{
|
||||
/* avoid <linux/mm.h> include hell */
|
||||
extern unsigned long max_mapnr;
|
||||
unsigned long pfn_offset = ARCH_PFN_OFFSET;
|
||||
|
||||
return pfn >= pfn_offset && pfn < max_mapnr;
|
||||
}
|
||||
|
||||
#elif defined(CONFIG_SPARSEMEM)
|
||||
|
||||
/* pfn_valid is defined in linux/mmzone.h */
|
||||
|
||||
#elif defined(CONFIG_NUMA)
|
||||
|
||||
#define pfn_valid(pfn) \
|
||||
({ \
|
||||
unsigned long __pfn = (pfn); \
|
||||
int __n = pfn_to_nid(__pfn); \
|
||||
((__n >= 0) ? (__pfn < NODE_DATA(__n)->node_start_pfn + \
|
||||
NODE_DATA(__n)->node_spanned_pages) \
|
||||
: 0); \
|
||||
})
|
||||
|
||||
#endif
|
||||
|
||||
#define virt_to_pfn(kaddr) PFN_DOWN(virt_to_phys((void *)(kaddr)))
|
||||
#define virt_to_page(kaddr) pfn_to_page(virt_to_pfn(kaddr))
|
||||
|
||||
|
@ -191,49 +191,113 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
|
||||
|
||||
#define pte_page(x) pfn_to_page(pte_pfn(x))
|
||||
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*/
|
||||
#if defined(CONFIG_CPU_R3K_TLB)
|
||||
|
||||
/* Swap entries must have VALID bit cleared. */
|
||||
/*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <----------- offset ------------> < type -> V G E 0 0 0 0 0 0 P
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
* _PAGE_PRESENT (P), _PAGE_VALID (V) and_PAGE_GLOBAL (G) have to remain
|
||||
* unused.
|
||||
*/
|
||||
#define __swp_type(x) (((x).val >> 10) & 0x1f)
|
||||
#define __swp_offset(x) ((x).val >> 15)
|
||||
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 10) | ((offset) << 15) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x1f) << 10) | ((offset) << 15) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE (1 << 7)
|
||||
|
||||
#else
|
||||
|
||||
#if defined(CONFIG_XPA)
|
||||
|
||||
/* Swap entries must have VALID and GLOBAL bits cleared. */
|
||||
/*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
|
||||
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
|
||||
* 0 0 0 0 0 0 E P <------------------ zeroes ------------------->
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <----------------- offset ------------------> < type -> V G 0 0
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
* _PAGE_PRESENT (P), _PAGE_VALID (V) and_PAGE_GLOBAL (G) have to remain
|
||||
* unused.
|
||||
*/
|
||||
#define __swp_type(x) (((x).val >> 4) & 0x1f)
|
||||
#define __swp_offset(x) ((x).val >> 9)
|
||||
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 4) | ((offset) << 9) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x1f) << 4) | ((offset) << 9) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_high })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { 0, (x).val })
|
||||
|
||||
/*
|
||||
* We borrow bit 57 (bit 25 in the low PTE) to store the exclusive marker in
|
||||
* swap PTEs.
|
||||
*/
|
||||
#define _PAGE_SWP_EXCLUSIVE (1 << 25)
|
||||
|
||||
#elif defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
|
||||
|
||||
/* Swap entries must have VALID and GLOBAL bits cleared. */
|
||||
/*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
|
||||
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
|
||||
* <------------------ zeroes -------------------> E P 0 0 0 0 0 0
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <------------------- offset --------------------> < type -> V G
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
* _PAGE_PRESENT (P), _PAGE_VALID (V) and_PAGE_GLOBAL (G) have to remain
|
||||
* unused.
|
||||
*/
|
||||
#define __swp_type(x) (((x).val >> 2) & 0x1f)
|
||||
#define __swp_offset(x) ((x).val >> 7)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 2) | ((offset) << 7) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x1f) << 2) | ((offset) << 7) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_high })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { 0, (x).val })
|
||||
|
||||
/*
|
||||
* We borrow bit 39 (bit 7 in the low PTE) to store the exclusive marker in swap
|
||||
* PTEs.
|
||||
*/
|
||||
#define _PAGE_SWP_EXCLUSIVE (1 << 7)
|
||||
|
||||
#else
|
||||
/*
|
||||
* Constraints:
|
||||
* _PAGE_PRESENT at bit 0
|
||||
* _PAGE_MODIFIED at bit 4
|
||||
* _PAGE_GLOBAL at bit 6
|
||||
* _PAGE_VALID at bit 7
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <------------- offset --------------> < type -> 0 0 0 0 0 0 E P
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
* _PAGE_PRESENT (P), _PAGE_VALID (V) and_PAGE_GLOBAL (G) have to remain
|
||||
* unused. The location of V and G varies.
|
||||
*/
|
||||
#define __swp_type(x) (((x).val >> 8) & 0x1f)
|
||||
#define __swp_offset(x) ((x).val >> 13)
|
||||
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 8) | ((offset) << 13) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 8) | ((offset) << 13) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
/* We borrow bit 1 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE (1 << 1)
|
||||
|
||||
#endif /* defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32) */
|
||||
|
||||
#endif /* defined(CONFIG_CPU_R3K_TLB) */
|
||||
|
@ -320,16 +320,31 @@ extern void pud_init(void *addr);
|
||||
extern void pmd_init(void *addr);
|
||||
|
||||
/*
|
||||
* Non-present pages: high 40 bits are offset, next 8 bits type,
|
||||
* low 16 bits zero.
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
|
||||
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
|
||||
* <--------------------------- offset ---------------------------
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* --------------> E <-- type ---> <---------- zeroes ----------->
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*/
|
||||
static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
|
||||
{ pte_t pte; pte_val(pte) = (type << 16) | (offset << 24); return pte; }
|
||||
{ pte_t pte; pte_val(pte) = ((type & 0x7f) << 16) | (offset << 24); return pte; }
|
||||
|
||||
#define __swp_type(x) (((x).val >> 16) & 0xff)
|
||||
#define __swp_type(x) (((x).val >> 16) & 0x7f)
|
||||
#define __swp_offset(x) ((x).val >> 24)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { pte_val(mk_swap_pte((type), (offset))) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
/* We borrow bit 23 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE (1 << 23)
|
||||
|
||||
#endif /* _ASM_PGTABLE_64_H */
|
||||
|
@ -528,6 +528,41 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
|
||||
}
|
||||
#endif
|
||||
|
||||
#if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte.pte_low & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte.pte_low |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte.pte_low &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
#else
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
#endif
|
||||
|
||||
extern void __update_tlb(struct vm_area_struct *vma, unsigned long address,
|
||||
pte_t pte);
|
||||
|
@ -86,15 +86,6 @@ extern struct page *mem_map;
|
||||
|
||||
# define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
|
||||
|
||||
static inline bool pfn_valid(unsigned long pfn)
|
||||
{
|
||||
/* avoid <linux/mm.h> include hell */
|
||||
extern unsigned long max_mapnr;
|
||||
unsigned long pfn_offset = ARCH_PFN_OFFSET;
|
||||
|
||||
return pfn >= pfn_offset && pfn < max_mapnr;
|
||||
}
|
||||
|
||||
# define virt_to_page(vaddr) pfn_to_page(PFN_DOWN(virt_to_phys(vaddr)))
|
||||
# define virt_addr_valid(vaddr) pfn_valid(PFN_DOWN(virt_to_phys(vaddr)))
|
||||
|
||||
|
@ -31,4 +31,7 @@
|
||||
#define _PAGE_ACCESSED (1<<26) /* page referenced */
|
||||
#define _PAGE_DIRTY (1<<27) /* dirty page */
|
||||
|
||||
/* We borrow bit 31 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE (1<<31)
|
||||
|
||||
#endif /* _ASM_NIOS2_PGTABLE_BITS_H */
|
||||
|
@ -232,23 +232,44 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
|
||||
__FILE__, __LINE__, pgd_val(e))
|
||||
|
||||
/*
|
||||
* Encode and decode a swap entry (must be !pte_none(pte) && !pte_present(pte):
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* 31 30 29 28 27 26 25 24 23 22 21 20 19 18 ... 1 0
|
||||
* 0 0 0 0 type. 0 0 0 0 0 0 offset.........
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* This gives us up to 2**2 = 4 swap files and 2**20 * 4K = 4G per swap file.
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* E < type -> 0 0 0 0 0 0 <-------------- offset --------------->
|
||||
*
|
||||
* Note that the offset field is always non-zero, thus !pte_none(pte) is always
|
||||
* true.
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*
|
||||
* Note that the offset field is always non-zero if the swap type is 0, thus
|
||||
* !pte_none() is always true.
|
||||
*/
|
||||
#define __swp_type(swp) (((swp).val >> 26) & 0x3)
|
||||
#define __swp_type(swp) (((swp).val >> 26) & 0x1f)
|
||||
#define __swp_offset(swp) ((swp).val & 0xfffff)
|
||||
#define __swp_entry(type, off) ((swp_entry_t) { (((type) & 0x3) << 26) \
|
||||
#define __swp_entry(type, off) ((swp_entry_t) { (((type) & 0x1f) << 26) \
|
||||
| ((off) & 0xfffff) })
|
||||
#define __swp_entry_to_pte(swp) ((pte_t) { (swp).val })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
extern void __init paging_init(void);
|
||||
extern void __init mmu_init(void);
|
||||
|
||||
|
@ -80,8 +80,6 @@ typedef struct page *pgtable_t;
|
||||
|
||||
#define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)
|
||||
|
||||
#define pfn_valid(pfn) ((pfn) < max_mapnr)
|
||||
|
||||
#define virt_addr_valid(kaddr) (pfn_valid(virt_to_pfn(kaddr)))
|
||||
|
||||
#endif /* __ASSEMBLY__ */
|
||||
|
@ -154,6 +154,9 @@ extern void paging_init(void);
|
||||
#define _KERNPG_TABLE \
|
||||
(_PAGE_BASE | _PAGE_SRE | _PAGE_SWE | _PAGE_ACCESSED | _PAGE_DIRTY)
|
||||
|
||||
/* We borrow bit 11 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE _PAGE_U_SHARED
|
||||
|
||||
#define PAGE_NONE __pgprot(_PAGE_ALL)
|
||||
#define PAGE_READONLY __pgprot(_PAGE_ALL | _PAGE_URE | _PAGE_SRE)
|
||||
#define PAGE_READONLY_X __pgprot(_PAGE_ALL | _PAGE_URE | _PAGE_SRE | _PAGE_EXEC)
|
||||
@ -385,16 +388,43 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
|
||||
|
||||
/* __PHX__ FIXME, SWAP, this probably doesn't work */
|
||||
|
||||
/* Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e)) */
|
||||
/* Since the PAGE_PRESENT bit is bit 4, we can use the bits above */
|
||||
|
||||
#define __swp_type(x) (((x).val >> 5) & 0x7f)
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <-------------- offset ---------------> E <- type --> 0 0 0 0 0
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
* The zero'ed bits include _PAGE_PRESENT.
|
||||
*/
|
||||
#define __swp_type(x) (((x).val >> 5) & 0x3f)
|
||||
#define __swp_offset(x) ((x).val >> 12)
|
||||
#define __swp_entry(type, offset) \
|
||||
((swp_entry_t) { ((type) << 5) | ((offset) << 12) })
|
||||
((swp_entry_t) { (((type) & 0x3f) << 5) | ((offset) << 12) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
typedef pte_t *pte_addr_t;
|
||||
|
||||
#endif /* __ASSEMBLY__ */
|
||||
|
@ -155,10 +155,6 @@ extern int npmem_ranges;
|
||||
#define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
|
||||
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
|
||||
|
||||
#ifndef CONFIG_SPARSEMEM
|
||||
#define pfn_valid(pfn) ((pfn) < max_mapnr)
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_HUGETLB_PAGE
|
||||
#define HPAGE_SHIFT PMD_SHIFT /* fixed for transparent huge pages */
|
||||
#define HPAGE_SIZE ((1UL) << HPAGE_SHIFT)
|
||||
|
@ -218,6 +218,9 @@ extern void __update_cache(pte_t pte);
|
||||
#define _PAGE_KERNEL_RWX (_PAGE_KERNEL_EXEC | _PAGE_WRITE)
|
||||
#define _PAGE_KERNEL (_PAGE_KERNEL_RO | _PAGE_WRITE)
|
||||
|
||||
/* We borrow bit 23 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE _PAGE_ACCESSED
|
||||
|
||||
/* The pgd/pmd contains a ptr (in phys addr space); since all pgds/pmds
|
||||
* are page-aligned, we don't care about the PAGE_OFFSET bits, except
|
||||
* for a few meta-information bits, so we shift the address to be
|
||||
@ -394,17 +397,48 @@ extern void paging_init (void);
|
||||
|
||||
#define update_mmu_cache(vms,addr,ptep) __update_cache(*ptep)
|
||||
|
||||
/* Encode and de-code a swap entry */
|
||||
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs (32bit):
|
||||
*
|
||||
* 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
|
||||
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
* <---------------- offset -----------------> P E <ofs> < type ->
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
* _PAGE_PRESENT (P) must be 0.
|
||||
*
|
||||
* For the 64bit version, the offset is extended by 32bit.
|
||||
*/
|
||||
#define __swp_type(x) ((x).val & 0x1f)
|
||||
#define __swp_offset(x) ( (((x).val >> 6) & 0x7) | \
|
||||
(((x).val >> 8) & ~0x7) )
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { (type) | \
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { \
|
||||
((type) & 0x1f) | \
|
||||
((offset & 0x7) << 6) | \
|
||||
((offset & ~0x7) << 8) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
|
||||
return pte;
|
||||
}
|
||||
|
||||
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
|
||||
{
|
||||
pte_t pte;
|
||||
|
@ -42,6 +42,9 @@
|
||||
#define _PMD_PRESENT_MASK (PAGE_MASK)
|
||||
#define _PMD_BAD (~PAGE_MASK)
|
||||
|
||||
/* We borrow the _PAGE_USER bit to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE _PAGE_USER
|
||||
|
||||
/* And here we include common definitions */
|
||||
|
||||
#define _PAGE_KERNEL_RO 0
|
||||
@ -363,17 +366,41 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma,
|
||||
#define pmd_page(pmd) pfn_to_page(pmd_pfn(pmd))
|
||||
|
||||
/*
|
||||
* Encode and decode a swap entry.
|
||||
* Note that the bits we use in a PTE for representing a swap entry
|
||||
* must not include the _PAGE_PRESENT bit or the _PAGE_HASHPTE bit (if used).
|
||||
* -- paulus
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs (32bit PTEs):
|
||||
*
|
||||
* 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
|
||||
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
* <----------------- offset --------------------> < type -> E H P
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
* _PAGE_PRESENT (P) and __PAGE_HASHPTE (H) must be 0.
|
||||
*
|
||||
* For 64bit PTEs, the offset is extended by 32bit.
|
||||
*/
|
||||
#define __swp_type(entry) ((entry).val & 0x1f)
|
||||
#define __swp_offset(entry) ((entry).val >> 5)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { (type) | ((offset) << 5) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { ((type) & 0x1f) | ((offset) << 5) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 3 })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val << 3 })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
return __pte(pte_val(pte) | _PAGE_SWP_EXCLUSIVE);
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE);
|
||||
}
|
||||
|
||||
/* Generic accessors to PTE bits */
|
||||
static inline int pte_write(pte_t pte) { return !!(pte_val(pte) & _PAGE_RW);}
|
||||
static inline int pte_read(pte_t pte) { return 1; }
|
||||
|
@ -717,7 +717,6 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
|
||||
}
|
||||
#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
|
||||
|
||||
#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SWP_EXCLUSIVE));
|
||||
|
@ -360,18 +360,30 @@ static inline int pte_young(pte_t pte)
|
||||
#endif
|
||||
|
||||
#define pmd_page(pmd) pfn_to_page(pmd_pfn(pmd))
|
||||
|
||||
/*
|
||||
* Encode and decode a swap entry.
|
||||
* Note that the bits we use in a PTE for representing a swap entry
|
||||
* must not include the _PAGE_PRESENT bit.
|
||||
* -- paulus
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs (32bit PTEs):
|
||||
*
|
||||
* 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
|
||||
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
* <------------------ offset -------------------> < type -> E 0 0
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*
|
||||
* For 64bit PTEs, the offset is extended by 32bit.
|
||||
*/
|
||||
#define __swp_type(entry) ((entry).val & 0x1f)
|
||||
#define __swp_offset(entry) ((entry).val >> 5)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { (type) | ((offset) << 5) })
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { ((type) & 0x1f) | ((offset) << 5) })
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 3 })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val << 3 })
|
||||
|
||||
/* We borrow LSB 2 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE 0x000004
|
||||
|
||||
#endif /* !__ASSEMBLY__ */
|
||||
|
||||
#endif /* __ASM_POWERPC_NOHASH_32_PGTABLE_H */
|
||||
|
@ -27,9 +27,9 @@
|
||||
* of the 16 available. Bit 24-26 of the TLB are cleared in the TLB
|
||||
* miss handler. Bit 27 is PAGE_USER, thus selecting the correct
|
||||
* zone.
|
||||
* - PRESENT *must* be in the bottom two bits because swap cache
|
||||
* entries use the top 30 bits. Because 40x doesn't support SMP
|
||||
* anyway, M is irrelevant so we borrow it for PAGE_PRESENT. Bit 30
|
||||
* - PRESENT *must* be in the bottom two bits because swap PTEs
|
||||
* use the top 30 bits. Because 40x doesn't support SMP anyway, M is
|
||||
* irrelevant so we borrow it for PAGE_PRESENT. Bit 30
|
||||
* is cleared in the TLB miss handler before the TLB entry is loaded.
|
||||
* - All other bits of the PTE are loaded into TLBLO without
|
||||
* modification, leaving us only the bits 20, 21, 24, 25, 26, 30 for
|
||||
|
@ -56,20 +56,10 @@
|
||||
* above bits. Note that the bit values are CPU specific, not architecture
|
||||
* specific.
|
||||
*
|
||||
* The kernel PTE entry holds an arch-dependent swp_entry structure under
|
||||
* certain situations. In other words, in such situations some portion of
|
||||
* the PTE bits are used as a swp_entry. In the PPC implementation, the
|
||||
* 3-24th LSB are shared with swp_entry, however the 0-2nd three LSB still
|
||||
* hold protection values. That means the three protection bits are
|
||||
* reserved for both PTE and SWAP entry at the most significant three
|
||||
* LSBs.
|
||||
*
|
||||
* There are three protection bits available for SWAP entry:
|
||||
* _PAGE_PRESENT
|
||||
* _PAGE_HASHPTE (if HW has)
|
||||
*
|
||||
* So those three bits have to be inside of 0-2nd LSB of PTE.
|
||||
*
|
||||
* The kernel PTE entry can be an ordinary PTE mapping a page or a special swap
|
||||
* PTE. In case of a swap PTE, LSB 2-24 are used to store information regarding
|
||||
* the swap entry. However LSB 0-1 still hold protection values, for example,
|
||||
* to distinguish swap PTEs from ordinary PTEs, and must be used with care.
|
||||
*/
|
||||
|
||||
#define _PAGE_PRESENT 0x00000001 /* S: PTE valid */
|
||||
|
@ -11,8 +11,8 @@
|
||||
32 33 34 35 36 ... 50 51 52 53 54 55 56 57 58 59 60 61 62 63
|
||||
RPN...................... 0 0 U0 U1 U2 U3 UX SX UW SW UR SR
|
||||
|
||||
- PRESENT *must* be in the bottom three bits because swap cache
|
||||
entries use the top 29 bits.
|
||||
- PRESENT *must* be in the bottom two bits because swap PTEs use
|
||||
the top 30 bits.
|
||||
|
||||
*/
|
||||
|
||||
|
@ -276,22 +276,40 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma,
|
||||
#define pgd_ERROR(e) \
|
||||
pr_err("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e))
|
||||
|
||||
/* Encode and de-code a swap entry */
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
|
||||
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
* <-------------------------- offset ----------------------------
|
||||
*
|
||||
* 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6
|
||||
* 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
|
||||
* --------------> <----------- zero ------------> E < type -> 0 0
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*/
|
||||
#define MAX_SWAPFILES_CHECK() do { \
|
||||
BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS); \
|
||||
} while (0)
|
||||
|
||||
#define SWP_TYPE_BITS 5
|
||||
#define __swp_type(x) (((x).val >> _PAGE_BIT_SWAP_TYPE) \
|
||||
#define __swp_type(x) (((x).val >> 2) \
|
||||
& ((1UL << SWP_TYPE_BITS) - 1))
|
||||
#define __swp_offset(x) ((x).val >> PTE_RPN_SHIFT)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) { \
|
||||
((type) << _PAGE_BIT_SWAP_TYPE) \
|
||||
(((type) & 0x1f) << 2) \
|
||||
| ((offset) << PTE_RPN_SHIFT) })
|
||||
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
|
||||
#define __swp_entry_to_pte(x) __pte((x).val)
|
||||
|
||||
/* We borrow MSB 56 (LSB 7) to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE 0x80
|
||||
|
||||
int map_kernel_page(unsigned long ea, unsigned long pa, pgprot_t prot);
|
||||
void unmap_kernel_page(unsigned long va);
|
||||
extern int __meminit vmemmap_create_mapping(unsigned long start,
|
||||
|
@ -151,6 +151,21 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
|
||||
return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
|
||||
}
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
return __pte(pte_val(pte) | _PAGE_SWP_EXCLUSIVE);
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE);
|
||||
}
|
||||
|
||||
/* Insert a PTE, top-level function is out of line. It uses an inline
|
||||
* low level function in the respective pgtable-* files
|
||||
*/
|
||||
|
@ -12,7 +12,6 @@
|
||||
/* Architected bits */
|
||||
#define _PAGE_PRESENT 0x000001 /* software: pte contains a translation */
|
||||
#define _PAGE_SW1 0x000002
|
||||
#define _PAGE_BIT_SWAP_TYPE 2
|
||||
#define _PAGE_BAP_SR 0x000004
|
||||
#define _PAGE_BAP_UR 0x000008
|
||||
#define _PAGE_BAP_SW 0x000010
|
||||
|
@ -117,15 +117,6 @@ extern long long virt_phys_offset;
|
||||
|
||||
#ifdef CONFIG_FLATMEM
|
||||
#define ARCH_PFN_OFFSET ((unsigned long)(MEMORY_START >> PAGE_SHIFT))
|
||||
#ifndef __ASSEMBLY__
|
||||
extern unsigned long max_mapnr;
|
||||
static inline bool pfn_valid(unsigned long pfn)
|
||||
{
|
||||
unsigned long min_pfn = ARCH_PFN_OFFSET;
|
||||
|
||||
return pfn >= min_pfn && pfn < max_mapnr;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
|
||||
#define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT)
|
||||
|
@ -132,7 +132,7 @@ void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size)
|
||||
* address decoding but I'd rather not deal with those outside of the
|
||||
* reserved 64K legacy region.
|
||||
*/
|
||||
area = __get_vm_area_caller(size, 0, PHB_IO_BASE, PHB_IO_END,
|
||||
area = __get_vm_area_caller(size, VM_IOREMAP, PHB_IO_BASE, PHB_IO_END,
|
||||
__builtin_return_address(0));
|
||||
if (!area)
|
||||
return NULL;
|
||||
|
@ -120,10 +120,8 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
|
||||
|
||||
mmap_read_lock(mm);
|
||||
for_each_vma(vmi, vma) {
|
||||
unsigned long size = vma->vm_end - vma->vm_start;
|
||||
|
||||
if (vma_is_special_mapping(vma, &vvar_spec))
|
||||
zap_page_range(vma, vma->vm_start, size);
|
||||
zap_vma_pages(vma);
|
||||
}
|
||||
mmap_read_unlock(mm);
|
||||
|
||||
|
@ -393,6 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
|
||||
{
|
||||
unsigned long gfn = memslot->base_gfn;
|
||||
unsigned long end, start = gfn_to_hva(kvm, gfn);
|
||||
unsigned long vm_flags;
|
||||
int ret = 0;
|
||||
struct vm_area_struct *vma;
|
||||
int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
|
||||
@ -409,12 +410,15 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
|
||||
ret = H_STATE;
|
||||
break;
|
||||
}
|
||||
/* Copy vm_flags to avoid partial modifications in ksm_madvise */
|
||||
vm_flags = vma->vm_flags;
|
||||
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
|
||||
merge_flag, &vma->vm_flags);
|
||||
merge_flag, &vm_flags);
|
||||
if (ret) {
|
||||
ret = H_STATE;
|
||||
break;
|
||||
}
|
||||
vm_flags_reset(vma, vm_flags);
|
||||
start = vma->vm_end;
|
||||
} while (end > vma->vm_end);
|
||||
|
||||
|
@ -324,7 +324,7 @@ static int kvmppc_xive_native_mmap(struct kvm_device *dev,
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
vma->vm_flags |= VM_IO | VM_PFNMAP;
|
||||
vm_flags_set(vma, VM_IO | VM_PFNMAP);
|
||||
vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
|
||||
|
||||
/*
|
||||
|
@ -156,7 +156,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
|
||||
* VM_NOHUGEPAGE and split them.
|
||||
*/
|
||||
for_each_vma_range(vmi, vma, addr + len) {
|
||||
vma->vm_flags |= VM_NOHUGEPAGE;
|
||||
vm_flags_set(vma, VM_NOHUGEPAGE);
|
||||
walk_page_vma(vma, &subpage_walk_ops, NULL);
|
||||
}
|
||||
}
|
||||
|
@ -414,7 +414,7 @@ static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
|
||||
/*
|
||||
* When the LPAR lost credits due to core removal or during
|
||||
* migration, invalidate the existing mapping for the current
|
||||
* paste addresses and set windows in-active (zap_page_range in
|
||||
* paste addresses and set windows in-active (zap_vma_pages in
|
||||
* reconfig_close_windows()).
|
||||
* New mapping will be done later after migration or new credits
|
||||
* available. So continue to receive faults if the user space
|
||||
@ -525,7 +525,7 @@ static int coproc_mmap(struct file *fp, struct vm_area_struct *vma)
|
||||
pfn = paste_addr >> PAGE_SHIFT;
|
||||
|
||||
/* flags, page_prot from cxl_mmap(), except we want cachable */
|
||||
vma->vm_flags |= VM_IO | VM_PFNMAP;
|
||||
vm_flags_set(vma, VM_IO | VM_PFNMAP);
|
||||
vma->vm_page_prot = pgprot_cached(vma->vm_page_prot);
|
||||
|
||||
prot = __pgprot(pgprot_val(vma->vm_page_prot) | _PAGE_DIRTY);
|
||||
|
@ -291,7 +291,7 @@ static int spufs_mem_mmap(struct file *file, struct vm_area_struct *vma)
|
||||
if (!(vma->vm_flags & VM_SHARED))
|
||||
return -EINVAL;
|
||||
|
||||
vma->vm_flags |= VM_IO | VM_PFNMAP;
|
||||
vm_flags_set(vma, VM_IO | VM_PFNMAP);
|
||||
vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
|
||||
|
||||
vma->vm_ops = &spufs_mem_mmap_vmops;
|
||||
@ -381,7 +381,7 @@ static int spufs_cntl_mmap(struct file *file, struct vm_area_struct *vma)
|
||||
if (!(vma->vm_flags & VM_SHARED))
|
||||
return -EINVAL;
|
||||
|
||||
vma->vm_flags |= VM_IO | VM_PFNMAP;
|
||||
vm_flags_set(vma, VM_IO | VM_PFNMAP);
|
||||
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
|
||||
|
||||
vma->vm_ops = &spufs_cntl_mmap_vmops;
|
||||
@ -1043,7 +1043,7 @@ static int spufs_signal1_mmap(struct file *file, struct vm_area_struct *vma)
|
||||
if (!(vma->vm_flags & VM_SHARED))
|
||||
return -EINVAL;
|
||||
|
||||
vma->vm_flags |= VM_IO | VM_PFNMAP;
|
||||
vm_flags_set(vma, VM_IO | VM_PFNMAP);
|
||||
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
|
||||
|
||||
vma->vm_ops = &spufs_signal1_mmap_vmops;
|
||||
@ -1179,7 +1179,7 @@ static int spufs_signal2_mmap(struct file *file, struct vm_area_struct *vma)
|
||||
if (!(vma->vm_flags & VM_SHARED))
|
||||
return -EINVAL;
|
||||
|
||||
vma->vm_flags |= VM_IO | VM_PFNMAP;
|
||||
vm_flags_set(vma, VM_IO | VM_PFNMAP);
|
||||
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
|
||||
|
||||
vma->vm_ops = &spufs_signal2_mmap_vmops;
|
||||
@ -1302,7 +1302,7 @@ static int spufs_mss_mmap(struct file *file, struct vm_area_struct *vma)
|
||||
if (!(vma->vm_flags & VM_SHARED))
|
||||
return -EINVAL;
|
||||
|
||||
vma->vm_flags |= VM_IO | VM_PFNMAP;
|
||||
vm_flags_set(vma, VM_IO | VM_PFNMAP);
|
||||
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
|
||||
|
||||
vma->vm_ops = &spufs_mss_mmap_vmops;
|
||||
@ -1364,7 +1364,7 @@ static int spufs_psmap_mmap(struct file *file, struct vm_area_struct *vma)
|
||||
if (!(vma->vm_flags & VM_SHARED))
|
||||
return -EINVAL;
|
||||
|
||||
vma->vm_flags |= VM_IO | VM_PFNMAP;
|
||||
vm_flags_set(vma, VM_IO | VM_PFNMAP);
|
||||
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
|
||||
|
||||
vma->vm_ops = &spufs_psmap_mmap_vmops;
|
||||
@ -1424,7 +1424,7 @@ static int spufs_mfc_mmap(struct file *file, struct vm_area_struct *vma)
|
||||
if (!(vma->vm_flags & VM_SHARED))
|
||||
return -EINVAL;
|
||||
|
||||
vma->vm_flags |= VM_IO | VM_PFNMAP;
|
||||
vm_flags_set(vma, VM_IO | VM_PFNMAP);
|
||||
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
|
||||
|
||||
vma->vm_ops = &spufs_mfc_mmap_vmops;
|
||||
|
@ -760,8 +760,7 @@ static int reconfig_close_windows(struct vas_caps *vcap, int excess_creds,
|
||||
* is done before the original mmap() and after the ioctl.
|
||||
*/
|
||||
if (vma)
|
||||
zap_page_range(vma, vma->vm_start,
|
||||
vma->vm_end - vma->vm_start);
|
||||
zap_vma_pages(vma);
|
||||
|
||||
mmap_write_unlock(task_ref->mm);
|
||||
mutex_unlock(&task_ref->mmap_mutex);
|
||||
|
@ -171,11 +171,6 @@ extern phys_addr_t __phys_addr_symbol(unsigned long x);
|
||||
|
||||
#define sym_to_pfn(x) __phys_to_pfn(__pa_symbol(x))
|
||||
|
||||
#ifdef CONFIG_FLATMEM
|
||||
#define pfn_valid(pfn) \
|
||||
(((pfn) >= ARCH_PFN_OFFSET) && (((pfn) - ARCH_PFN_OFFSET) < max_mapnr))
|
||||
#endif
|
||||
|
||||
#endif /* __ASSEMBLY__ */
|
||||
|
||||
#define virt_addr_valid(vaddr) ({ \
|
||||
|
@ -27,6 +27,9 @@
|
||||
*/
|
||||
#define _PAGE_PROT_NONE _PAGE_GLOBAL
|
||||
|
||||
/* Used for swap PTEs only. */
|
||||
#define _PAGE_SWP_EXCLUSIVE _PAGE_ACCESSED
|
||||
|
||||
#define _PAGE_PFN_SHIFT 10
|
||||
|
||||
/*
|
||||
|
@ -728,16 +728,18 @@ extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
|
||||
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
||||
|
||||
/*
|
||||
* Encode and decode a swap entry
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Format of swap PTE:
|
||||
* bit 0: _PAGE_PRESENT (zero)
|
||||
* bit 1 to 3: _PAGE_LEAF (zero)
|
||||
* bit 5: _PAGE_PROT_NONE (zero)
|
||||
* bits 6 to 10: swap type
|
||||
* bits 10 to XLEN-1: swap offset
|
||||
* bit 6: exclusive marker
|
||||
* bits 7 to 11: swap type
|
||||
* bits 11 to XLEN-1: swap offset
|
||||
*/
|
||||
#define __SWP_TYPE_SHIFT 6
|
||||
#define __SWP_TYPE_SHIFT 7
|
||||
#define __SWP_TYPE_BITS 5
|
||||
#define __SWP_TYPE_MASK ((1UL << __SWP_TYPE_BITS) - 1)
|
||||
#define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT)
|
||||
@ -748,11 +750,27 @@ extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
|
||||
#define __swp_type(x) (((x).val >> __SWP_TYPE_SHIFT) & __SWP_TYPE_MASK)
|
||||
#define __swp_offset(x) ((x).val >> __SWP_OFFSET_SHIFT)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t) \
|
||||
{ ((type) << __SWP_TYPE_SHIFT) | ((offset) << __SWP_OFFSET_SHIFT) })
|
||||
{ (((type) & __SWP_TYPE_MASK) << __SWP_TYPE_SHIFT) | \
|
||||
((offset) << __SWP_OFFSET_SHIFT) })
|
||||
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_mkexclusive(pte_t pte)
|
||||
{
|
||||
return __pte(pte_val(pte) | _PAGE_SWP_EXCLUSIVE);
|
||||
}
|
||||
|
||||
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
|
||||
{
|
||||
return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
|
||||
#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) })
|
||||
#define __swp_entry_to_pmd(swp) __pmd((swp).val)
|
||||
|
@ -124,13 +124,11 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
|
||||
mmap_read_lock(mm);
|
||||
|
||||
for_each_vma(vmi, vma) {
|
||||
unsigned long size = vma->vm_end - vma->vm_start;
|
||||
|
||||
if (vma_is_special_mapping(vma, vdso_info.dm))
|
||||
zap_page_range(vma, vma->vm_start, size);
|
||||
zap_vma_pages(vma);
|
||||
#ifdef CONFIG_COMPAT
|
||||
if (vma_is_special_mapping(vma, compat_vdso_info.dm))
|
||||
zap_page_range(vma, vma->vm_start, size);
|
||||
zap_vma_pages(vma);
|
||||
#endif
|
||||
}
|
||||
|
||||
|
@ -73,9 +73,8 @@ static inline void copy_page(void *to, void *from)
|
||||
#define clear_user_page(page, vaddr, pg) clear_page(page)
|
||||
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
|
||||
|
||||
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \
|
||||
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr)
|
||||
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
|
||||
#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
|
||||
vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
|
||||
|
||||
/*
|
||||
* These are used to make use of C type-checking..
|
||||
|
@ -827,7 +827,6 @@ static inline int pmd_protnone(pmd_t pmd)
|
||||
}
|
||||
#endif
|
||||
|
||||
#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
|
||||
|
@ -59,11 +59,9 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
|
||||
|
||||
mmap_read_lock(mm);
|
||||
for_each_vma(vmi, vma) {
|
||||
unsigned long size = vma->vm_end - vma->vm_start;
|
||||
|
||||
if (!vma_is_special_mapping(vma, &vvar_mapping))
|
||||
continue;
|
||||
zap_page_range(vma, vma->vm_start, size);
|
||||
zap_vma_pages(vma);
|
||||
break;
|
||||
}
|
||||
mmap_read_unlock(mm);
|
||||
|
@ -722,7 +722,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
|
||||
if (is_vm_hugetlb_page(vma))
|
||||
continue;
|
||||
size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
|
||||
zap_page_range(vma, vmaddr, size);
|
||||
zap_page_range_single(vma, vmaddr, size, NULL);
|
||||
}
|
||||
mmap_read_unlock(gmap->mm);
|
||||
}
|
||||
@ -2522,8 +2522,7 @@ static inline void thp_split_mm(struct mm_struct *mm)
|
||||
VMA_ITERATOR(vmi, mm, 0);
|
||||
|
||||
for_each_vma(vmi, vma) {
|
||||
vma->vm_flags &= ~VM_HUGEPAGE;
|
||||
vma->vm_flags |= VM_NOHUGEPAGE;
|
||||
vm_flags_mod(vma, VM_NOHUGEPAGE, VM_HUGEPAGE);
|
||||
walk_page_vma(vma, &thp_split_walk_ops, NULL);
|
||||
}
|
||||
mm->def_flags |= VM_NOHUGEPAGE;
|
||||
@ -2588,14 +2587,18 @@ int gmap_mark_unmergeable(void)
|
||||
{
|
||||
struct mm_struct *mm = current->mm;
|
||||
struct vm_area_struct *vma;
|
||||
unsigned long vm_flags;
|
||||
int ret;
|
||||
VMA_ITERATOR(vmi, mm, 0);
|
||||
|
||||
for_each_vma(vmi, vma) {
|
||||
/* Copy vm_flags to avoid partial modifications in ksm_madvise */
|
||||
vm_flags = vma->vm_flags;
|
||||
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
|
||||
MADV_UNMERGEABLE, &vma->vm_flags);
|
||||
MADV_UNMERGEABLE, &vm_flags);
|
||||
if (ret)
|
||||
return ret;
|
||||
vm_flags_reset(vma, vm_flags);
|
||||
}
|
||||
mm->def_flags &= ~VM_MERGEABLE;
|
||||
return 0;
|
||||
|
@ -169,9 +169,6 @@ typedef struct page *pgtable_t;
|
||||
#define PFN_START (__MEMORY_START >> PAGE_SHIFT)
|
||||
#define ARCH_PFN_OFFSET (PFN_START)
|
||||
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
|
||||
#ifdef CONFIG_FLATMEM
|
||||
#define pfn_valid(pfn) ((pfn) >= min_low_pfn && (pfn) < max_low_pfn)
|
||||
#endif
|
||||
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
|
||||
|
||||
#include <asm-generic/memory_model.h>
|
||||
|
@ -423,40 +423,69 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Encode and de-code a swap entry
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
*
|
||||
* Constraints:
|
||||
* _PAGE_PRESENT at bit 8
|
||||
* _PAGE_PROTNONE at bit 9
|
||||
*
|
||||
* For the normal case, we encode the swap type into bits 0:7 and the
|
||||
* swap offset into bits 10:30. For the 64-bit PTE case, we keep the
|
||||
* preserved bits in the low 32-bits and use the upper 32 as the swap
|
||||
* offset (along with a 5-bit type), following the same approach as x86
|
||||
* PAE. This keeps the logic quite simple.
|
||||
* For the normal case, we encode the swap type and offset into the swap PTE
|
||||
* such that bits 8 and 9 stay zero. For the 64-bit PTE case, we use the
|
||||
* upper 32 for the swap offset and swap type, following the same approach as
|
||||
* x86 PAE. This keeps the logic quite simple.
|
||||
*
|
||||
* As is evident by the Alpha code, if we ever get a 64-bit unsigned
|
||||
* long (swp_entry_t) to match up with the 64-bit PTEs, this all becomes
|
||||
* much cleaner..
|
||||
*
|
||||
* NOTE: We should set ZEROs at the position of _PAGE_PRESENT
|
||||
* and _PAGE_PROTNONE bits
|
||||
*/
|
||||
|
||||
#ifdef CONFIG_X2TLB
|
||||
/*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
|
||||
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
|
||||
* <--------------------- offset ----------------------> < type ->
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <------------------- zeroes --------------------> E 0 0 0 0 0 0
|
||||
*/
|
||||
#define __swp_type(x) ((x).val & 0x1f)
|
||||
#define __swp_offset(x) ((x).val >> 5)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t){ (type) | (offset) << 5})
|
||||
#define __swp_entry(type, offset) ((swp_entry_t){ ((type) & 0x1f) | (offset) << 5})
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high })
|
||||
#define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val })
|
||||
|
||||
#else
|
||||
#define __swp_type(x) ((x).val & 0xff)
|
||||
/*
|
||||
* Format of swap PTEs:
|
||||
*
|
||||
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
|
||||
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
|
||||
* <--------------- offset ----------------> 0 0 0 0 E < type -> 0
|
||||
*
|
||||
* E is the exclusive marker that is not stored in swap entries.
|
||||
*/
|
||||
#define __swp_type(x) ((x).val & 0x1f)
|
||||
#define __swp_offset(x) ((x).val >> 10)
|
||||
#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) <<10})
|
||||
#define __swp_entry(type, offset) ((swp_entry_t){((type) & 0x1f) | (offset) << 10})
|
||||
|
||||
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 1 })
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val << 1 })
|
||||
#endif
|
||||
|
||||
/* In both cases, we borrow bit 6 to store the exclusive marker in swap PTEs. */
|
||||
#define _PAGE_SWP_EXCLUSIVE _PAGE_USER
|
||||
|
||||
static inline int pte_swp_exclusive(pte_t pte)
|
||||
{
|
||||
return pte.pte_low & _PAGE_SWP_EXCLUSIVE;
|
||||
}
|
||||
|
||||
PTE_BIT_FUNC(low, swp_mkexclusive, |= _PAGE_SWP_EXCLUSIVE);
|
||||
PTE_BIT_FUNC(low, swp_clear_exclusive, &= ~_PAGE_SWP_EXCLUSIVE);
|
||||
|
||||
#endif /* __ASSEMBLY__ */
|
||||
#endif /* __ASM_SH_PGTABLE_32_H */
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user