- Daniel Verkamp has contributed a memfd series ("mm/memfd: add

F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X bit.
 
 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.
 
 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes
 
 - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
   does perform some memcg maintenance and cleanup work.
 
 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".  These filters provide users
   with finer-grained control over DAMOS's actions.  SeongJae has also done
   some DAMON cleanup work.
 
 - Kairui Song adds a series ("Clean up and fixes for swap").
 
 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".
 
 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series.  It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.
 
 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".
 
 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".
 
 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".
 
 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".
 
 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series "mm:
   support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
   PTEs".
 
 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".
 
 - Sergey Senozhatsky has improved zsmalloc's memory utilization with his
   series "zsmalloc: make zspage chain size configurable".
 
 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.  The previous BPF-based approach had
   shortcomings.  See "mm: In-kernel support for memory-deny-write-execute
   (MDWE)".
 
 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
 
 - T.J.  Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".
 
 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a per-node
   basis.  See the series "Introduce per NUMA node memory error
   statistics".
 
 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage during
   compaction".
 
 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".
 
 - Christoph Hellwig has removed block_device_operations.rw_page() in ths
   series "remove ->rw_page".
 
 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".
 
 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier functions".
 
 - Some pagemap cleanup and generalization work in Mike Rapoport's series
   "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
   "fixups for generic implementation of pfn_valid()"
 
 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
 
 - Jason Gunthorpe rationalized the GUP system's interface to the rest of
   the kernel in the series "Simplify the external interface for GUP".
 
 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface.  To support this, we'll temporarily be
   printing warnings when people use the debugfs interface.  See the series
   "mm/damon: deprecate DAMON debugfs interface".
 
 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.
 
 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".
 
 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
 jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
 DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
 =MlGs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add
   F_SEAL_EXEC") which permits the setting of the memfd execute bit at
   memfd creation time, with the option of sealing the state of the X
   bit.

 - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
   thread-safe for pmd unshare") which addresses a rare race condition
   related to PMD unsharing.

 - Several folioification patch serieses from Matthew Wilcox, Vishal
   Moola, Sidhartha Kumar and Lorenzo Stoakes

 - Johannes Weiner has a series ("mm: push down lock_page_memcg()")
   which does perform some memcg maintenance and cleanup work.

 - SeongJae Park has added DAMOS filtering to DAMON, with the series
   "mm/damon/core: implement damos filter".

   These filters provide users with finer-grained control over DAMOS's
   actions. SeongJae has also done some DAMON cleanup work.

 - Kairui Song adds a series ("Clean up and fixes for swap").

 - Vernon Yang contributed the series "Clean up and refinement for maple
   tree".

 - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
   adds to MGLRU an LRU of memcgs, to improve the scalability of global
   reclaim.

 - David Hildenbrand has added some userfaultfd cleanup work in the
   series "mm: uffd-wp + change_protection() cleanups".

 - Christoph Hellwig has removed the generic_writepages() library
   function in the series "remove generic_writepages".

 - Baolin Wang has performed some maintenance on the compaction code in
   his series "Some small improvements for compaction".

 - Sidhartha Kumar is doing some maintenance work on struct page in his
   series "Get rid of tail page fields".

 - David Hildenbrand contributed some cleanup, bugfixing and
   generalization of pte management and of pte debugging in his series
   "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
   swap PTEs".

 - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
   flag in the series "Discard __GFP_ATOMIC".

 - Sergey Senozhatsky has improved zsmalloc's memory utilization with
   his series "zsmalloc: make zspage chain size configurable".

 - Joey Gouly has added prctl() support for prohibiting the creation of
   writeable+executable mappings.

   The previous BPF-based approach had shortcomings. See "mm: In-kernel
   support for memory-deny-write-execute (MDWE)".

 - Waiman Long did some kmemleak cleanup and bugfixing in the series
   "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".

 - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
   "mm: multi-gen LRU: improve".

 - Jiaqi Yan has provided some enhancements to our memory error
   statistics reporting, mainly by presenting the statistics on a
   per-node basis. See the series "Introduce per NUMA node memory error
   statistics".

 - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
   regression in compaction via his series "Fix excessive CPU usage
   during compaction".

 - Christoph Hellwig does some vmalloc maintenance work in the series
   "cleanup vfree and vunmap".

 - Christoph Hellwig has removed block_device_operations.rw_page() in
   ths series "remove ->rw_page".

 - We get some maple_tree improvements and cleanups in Liam Howlett's
   series "VMA tree type safety and remove __vma_adjust()".

 - Suren Baghdasaryan has done some work on the maintainability of our
   vm_flags handling in the series "introduce vm_flags modifier
   functions".

 - Some pagemap cleanup and generalization work in Mike Rapoport's
   series "mm, arch: add generic implementation of pfn_valid() for
   FLATMEM" and "fixups for generic implementation of pfn_valid()"

 - Baoquan He has done some work to make /proc/vmallocinfo and
   /proc/kcore better represent the real state of things in his series
   "mm/vmalloc.c: allow vread() to read out vm_map_ram areas".

 - Jason Gunthorpe rationalized the GUP system's interface to the rest
   of the kernel in the series "Simplify the external interface for
   GUP".

 - SeongJae Park wishes to migrate people from DAMON's debugfs interface
   over to its sysfs interface. To support this, we'll temporarily be
   printing warnings when people use the debugfs interface. See the
   series "mm/damon: deprecate DAMON debugfs interface".

 - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
   and clean-ups" series.

 - Huang Ying has provided a dramatic reduction in migration's TLB flush
   IPI rates with the series "migrate_pages(): batch TLB flushing".

 - Arnd Bergmann has some objtool fixups in "objtool warning fixes".

* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
  include/linux/migrate.h: remove unneeded externs
  mm/memory_hotplug: cleanup return value handing in do_migrate_range()
  mm/uffd: fix comment in handling pte markers
  mm: change to return bool for isolate_movable_page()
  mm: hugetlb: change to return bool for isolate_hugetlb()
  mm: change to return bool for isolate_lru_page()
  mm: change to return bool for folio_isolate_lru()
  objtool: add UACCESS exceptions for __tsan_volatile_read/write
  kmsan: disable ftrace in kmsan core code
  kasan: mark addr_has_metadata __always_inline
  mm: memcontrol: rename memcg_kmem_enabled()
  sh: initialize max_mapnr
  m68k/nommu: add missing definition of ARCH_PFN_OFFSET
  mm: percpu: fix incorrect size in pcpu_obj_full_size()
  maple_tree: reduce stack usage with gcc-9 and earlier
  mm: page_alloc: call panic() when memoryless node allocation fails
  mm: multi-gen LRU: avoid futile retries
  migrate_pages: move THP/hugetlb migration support check to simplify code
  migrate_pages: batch flushing TLB
  migrate_pages: share more code between _unmap and _move
  ...
This commit is contained in:
Linus Torvalds 2023-02-23 17:09:35 -08:00
commit 3822a7c409
496 changed files with 11611 additions and 6985 deletions

View File

@ -182,3 +182,42 @@ Date: November 2021
Contact: Jarkko Sakkinen <jarkko@kernel.org> Contact: Jarkko Sakkinen <jarkko@kernel.org>
Description: Description:
The total amount of SGX physical memory in bytes. The total amount of SGX physical memory in bytes.
What: /sys/devices/system/node/nodeX/memory_failure/total
Date: January 2023
Contact: Jiaqi Yan <jiaqiyan@google.com>
Description:
The total number of raw poisoned pages (pages containing
corrupted data due to memory errors) on a NUMA node.
What: /sys/devices/system/node/nodeX/memory_failure/ignored
Date: January 2023
Contact: Jiaqi Yan <jiaqiyan@google.com>
Description:
Of the raw poisoned pages on a NUMA node, how many pages are
ignored by memory error recovery attempt, usually because
support for this type of pages is unavailable, and kernel
gives up the recovery.
What: /sys/devices/system/node/nodeX/memory_failure/failed
Date: January 2023
Contact: Jiaqi Yan <jiaqiyan@google.com>
Description:
Of the raw poisoned pages on a NUMA node, how many pages are
failed by memory error recovery attempt. This usually means
a key recovery operation failed.
What: /sys/devices/system/node/nodeX/memory_failure/delayed
Date: January 2023
Contact: Jiaqi Yan <jiaqiyan@google.com>
Description:
Of the raw poisoned pages on a NUMA node, how many pages are
delayed by memory error recovery attempt. Delayed poisoned
pages usually will be retried by kernel.
What: /sys/devices/system/node/nodeX/memory_failure/recovered
Date: January 2023
Contact: Jiaqi Yan <jiaqiyan@google.com>
Description:
Of the raw poisoned pages on a NUMA node, how many pages are
recovered by memory error recovery attempt.

View File

@ -258,6 +258,35 @@ Contact: SeongJae Park <sj@kernel.org>
Description: Writing to and reading from this file sets and gets the low Description: Writing to and reading from this file sets and gets the low
watermark of the scheme in permil. watermark of the scheme in permil.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/nr_filters
Date: Dec 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Writing a number 'N' to this file creates the number of
directories for setting filters of the scheme named '0' to
'N-1' under the filters/ directory.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/type
Date: Dec 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Writing to and reading from this file sets and gets the type of
the memory of the interest. 'anon' for anonymous pages, or
'memcg' for specific memory cgroup can be written and read.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/memcg_path
Date: Dec 2022
Contact: SeongJae Park <sj@kernel.org>
Description: If 'memcg' is written to the 'type' file, writing to and
reading from this file sets and gets the path to the memory
cgroup of the interest.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/matching
Date: Dec 2022
Contact: SeongJae Park <sj@kernel.org>
Description: Writing 'Y' or 'N' to this file sets whether to filter out
pages that do or do not match to the 'type' and 'memcg_path',
respectively. Filter out means the action of the scheme will
not be applied to.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/nr_tried What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/nr_tried
Date: Mar 2022 Date: Mar 2022
Contact: SeongJae Park <sj@kernel.org> Contact: SeongJae Park <sj@kernel.org>

View File

@ -87,6 +87,8 @@ Brief summary of control files.
memory.swappiness set/show swappiness parameter of vmscan memory.swappiness set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness) (See sysctl's vm.swappiness)
memory.move_charge_at_immigrate set/show controls of moving charges memory.move_charge_at_immigrate set/show controls of moving charges
This knob is deprecated and shouldn't be
used.
memory.oom_control set/show oom controls. memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa memory.numa_stat show the number of memory usage per numa
node node
@ -727,8 +729,15 @@ If we want to change this to 1G, we can at any time use::
.. _cgroup-v1-memory-move-charges: .. _cgroup-v1-memory-move-charges:
8. Move charges at task migration 8. Move charges at task migration (DEPRECATED!)
================================= ===============================================
THIS IS DEPRECATED!
It's expensive and unreliable! It's better practice to launch workload
tasks directly from inside their target cgroup. Use dedicated workload
cgroups to allow fine-grained policy adjustments without having to
move physical pages between control domains.
Users can move charges associated with a task along with task migration, that Users can move charges associated with a task along with task migration, that
is, uncharge task's pages from the old cgroup and charge them to the new cgroup. is, uncharge task's pages from the old cgroup and charge them to the new cgroup.

View File

@ -205,6 +205,15 @@ The end physical address of memory region that DAMON_RECLAIM will do work
against. That is, DAMON_RECLAIM will find cold memory regions in this region against. That is, DAMON_RECLAIM will find cold memory regions in this region
and reclaims. By default, biggest System RAM is used as the region. and reclaims. By default, biggest System RAM is used as the region.
skip_anon
---------
Skip anonymous pages reclamation.
If this parameter is set as ``Y``, DAMON_RECLAIM does not reclaim anonymous
pages. By default, ``N``.
kdamond_pid kdamond_pid
----------- -----------

View File

@ -25,10 +25,12 @@ DAMON provides below interfaces for different users.
interface provides only simple :ref:`statistics <damos_stats>` for the interface provides only simple :ref:`statistics <damos_stats>` for the
monitoring results. For detailed monitoring results, DAMON provides a monitoring results. For detailed monitoring results, DAMON provides a
:ref:`tracepoint <tracepoint>`. :ref:`tracepoint <tracepoint>`.
- *debugfs interface.* - *debugfs interface. (DEPRECATED!)*
:ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
<sysfs_interface>`. This will be removed after next LTS kernel is released, <sysfs_interface>`. This is deprecated, so users should move to the
so users should move to the :ref:`sysfs interface <sysfs_interface>`. :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
move, please report your usecase to damon@lists.linux.dev and
linux-mm@kvack.org.
- *Kernel Space Programming Interface.* - *Kernel Space Programming Interface.*
:doc:`This </mm/damon/api>` is for kernel space programmers. Using this, :doc:`This </mm/damon/api>` is for kernel space programmers. Using this,
users can utilize every feature of DAMON most flexibly and efficiently by users can utilize every feature of DAMON most flexibly and efficiently by
@ -87,6 +89,8 @@ comma (","). ::
│ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms │ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
│ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low
│ │ │ │ │ │ │ filters/nr_filters
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
│ │ │ │ │ │ │ tried_regions/ │ │ │ │ │ │ │ tried_regions/
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
@ -151,6 +155,8 @@ number (``N``) to the file creates the number of child directories named as
moment, only one context per kdamond is supported, so only ``0`` or ``1`` can moment, only one context per kdamond is supported, so only ``0`` or ``1`` can
be written to the file. be written to the file.
.. _sysfs_contexts:
contexts/<N>/ contexts/<N>/
------------- -------------
@ -268,21 +274,32 @@ schemes/<N>/
------------ ------------
In each scheme directory, five directories (``access_pattern``, ``quotas``, In each scheme directory, five directories (``access_pattern``, ``quotas``,
``watermarks``, ``stats``, and ``tried_regions``) and one file (``action``) ``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and one file
exist. (``action``) exist.
The ``action`` file is for setting and getting what action you want to apply to The ``action`` file is for setting and getting what action you want to apply to
memory regions having specific access pattern of the interest. The keywords memory regions having specific access pattern of the interest. The keywords
that can be written to and read from the file and their meaning are as below. that can be written to and read from the file and their meaning are as below.
- ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED`` Note that support of each action depends on the running DAMON operations set
- ``cold``: Call ``madvise()`` for the region with ``MADV_COLD`` `implementation <sysfs_contexts>`.
- ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``
- ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
- ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` Supported by ``vaddr`` and ``fvaddr`` operations set.
- ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``.
Supported by ``vaddr`` and ``fvaddr`` operations set.
- ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set.
- ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``.
Supported by ``vaddr`` and ``fvaddr`` operations set.
- ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``.
Supported by ``vaddr`` and ``fvaddr`` operations set.
- ``lru_prio``: Prioritize the region on its LRU lists. - ``lru_prio``: Prioritize the region on its LRU lists.
Supported by ``paddr`` operations set.
- ``lru_deprio``: Deprioritize the region on its LRU lists. - ``lru_deprio``: Deprioritize the region on its LRU lists.
- ``stat``: Do nothing but count the statistics Supported by ``paddr`` operations set.
- ``stat``: Do nothing but count the statistics.
Supported by all operations sets.
schemes/<N>/access_pattern/ schemes/<N>/access_pattern/
--------------------------- ---------------------------
@ -347,6 +364,46 @@ as below.
The ``interval`` should written in microseconds unit. The ``interval`` should written in microseconds unit.
schemes/<N>/filters/
--------------------
Users could know something more than the kernel for specific types of memory.
In the case, users could do their own management for the memory and hence
doesn't want DAMOS bothers that. Users could limit DAMOS by setting the access
pattern of the scheme and/or the monitoring regions for the purpose, but that
can be inefficient in some cases. In such cases, users could set non-access
pattern driven filters using files in this directory.
In the beginning, this directory has only one file, ``nr_filters``. Writing a
number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each filter. The filters are evaluated
in the numeric order.
Each filter directory contains three files, namely ``type``, ``matcing``, and
``memcg_path``. You can write one of two special keywords, ``anon`` for
anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of
the memory cgroup filtering, you can specify the memory cgroup of the interest
by writing the path of the memory cgroup from the cgroups mount point to
``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to
filter out pages that does or does not match to the type, respectively. Then,
the scheme's action will not be applied to the pages that specified to be
filtered out.
For example, below restricts a DAMOS action to be applied to only non-anonymous
pages of all memory cgroups except ``/having_care_already``.::
# echo 2 > nr_filters
# # filter out anonymous pages
echo anon > 0/type
echo Y > 0/matching
# # further filter out all cgroups except one at '/having_care_already'
echo memcg > 1/type
echo /having_care_already > 1/memcg_path
echo N > 1/matching
Note that filters are currently supported only when ``paddr``
`implementation <sysfs_contexts>` is being used.
.. _sysfs_schemes_stats: .. _sysfs_schemes_stats:
schemes/<N>/stats/ schemes/<N>/stats/
@ -432,13 +489,17 @@ the files as above. Above is only for an example.
.. _debugfs_interface: .. _debugfs_interface:
debugfs Interface debugfs Interface (DEPRECATED!)
================= ===============================
.. note:: .. note::
DAMON debugfs interface will be removed after next LTS kernel is released, so THIS IS DEPRECATED!
users should move to the :ref:`sysfs interface <sysfs_interface>`.
DAMON debugfs interface is deprecated, so users should move to the
:ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
move, please report your usecase to damon@lists.linux.dev and
linux-mm@kvack.org.
DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``, DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and ``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
@ -574,11 +635,15 @@ The ``<action>`` is a predefined integer for memory management actions, which
DAMON will apply to the regions having the target access pattern. The DAMON will apply to the regions having the target access pattern. The
supported numbers and their meanings are as below. supported numbers and their meanings are as below.
- 0: Call ``madvise()`` for the region with ``MADV_WILLNEED`` - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if
- 1: Call ``madvise()`` for the region with ``MADV_COLD`` ``target`` is ``paddr``.
- 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT`` - 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if
- 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` ``target`` is ``paddr``.
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
- 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if
``target`` is ``paddr``.
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if
``target`` is ``paddr``.
- 5: Do nothing but count the statistics - 5: Do nothing but count the statistics
Quota Quota

View File

@ -459,13 +459,13 @@ Examples
.. _map_hugetlb: .. _map_hugetlb:
``map_hugetlb`` ``map_hugetlb``
see tools/testing/selftests/vm/map_hugetlb.c see tools/testing/selftests/mm/map_hugetlb.c
``hugepage-shm`` ``hugepage-shm``
see tools/testing/selftests/vm/hugepage-shm.c see tools/testing/selftests/mm/hugepage-shm.c
``hugepage-mmap`` ``hugepage-mmap``
see tools/testing/selftests/vm/hugepage-mmap.c see tools/testing/selftests/mm/hugepage-mmap.c
The `libhugetlbfs`_ library provides a wide range of userspace tools The `libhugetlbfs`_ library provides a wide range of userspace tools
to help with huge page usability, environment setup, and control. to help with huge page usability, environment setup, and control.

View File

@ -63,7 +63,7 @@ workload one should:
are not reclaimable, he or she can filter them out using are not reclaimable, he or she can filter them out using
``/proc/kpageflags``. ``/proc/kpageflags``.
The page-types tool in the tools/vm directory can be used to assist in this. The page-types tool in the tools/mm directory can be used to assist in this.
If the tool is run initially with the appropriate option, it will mark all the If the tool is run initially with the appropriate option, it will mark all the
queried pages as idle. Subsequent runs of the tool can then show which pages have queried pages as idle. Subsequent runs of the tool can then show which pages have
their idle flag cleared in the interim. their idle flag cleared in the interim.

View File

@ -1,4 +1,7 @@
============= =======================
NUMA Memory Performance
=======================
NUMA Locality NUMA Locality
============= =============
@ -59,7 +62,6 @@ that are CPUs and hence suitable for generic task scheduling, and
IO initiators such as GPUs and NICs. Unlike access class 0, only IO initiators such as GPUs and NICs. Unlike access class 0, only
nodes containing CPUs are considered. nodes containing CPUs are considered.
================
NUMA Performance NUMA Performance
================ ================
@ -94,7 +96,6 @@ for the platform.
Access class 1 takes the same form but only includes values for CPU to Access class 1 takes the same form but only includes values for CPU to
memory activity. memory activity.
==========
NUMA Cache NUMA Cache
========== ==========
@ -168,7 +169,6 @@ The "size" is the number of bytes provided by this cache level.
The "write_policy" will be 0 for write-back, and non-zero for The "write_policy" will be 0 for write-back, and non-zero for
write-through caching. write-through caching.
========
See Also See Also
======== ========

View File

@ -44,7 +44,7 @@ There are four components to pagemap:
* ``/proc/kpagecount``. This file contains a 64-bit count of the number of * ``/proc/kpagecount``. This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN. times each page is mapped, indexed by PFN.
The page-types tool in the tools/vm directory can be used to query the The page-types tool in the tools/mm directory can be used to query the
number of times a page is mapped. number of times a page is mapped.
* ``/proc/kpageflags``. This file contains a 64-bit set of flags for each * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each
@ -170,7 +170,7 @@ LRU related page flags
14 - SWAPBACKED 14 - SWAPBACKED
The page is backed by swap/RAM. The page is backed by swap/RAM.
The page-types tool in the tools/vm directory can be used to query the The page-types tool in the tools/mm directory can be used to query the
above flags. above flags.
Using pagemap to do something useful Using pagemap to do something useful

View File

@ -55,18 +55,17 @@ flags the caller provides. The caller is required to pass in a non-null struct
pages* array, and the function then pins pages by incrementing each by a special pages* array, and the function then pins pages by incrementing each by a special
value: GUP_PIN_COUNTING_BIAS. value: GUP_PIN_COUNTING_BIAS.
For compound pages, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
an exact form of pin counting is achieved, by using the 2nd struct page the extra space available in the struct folio is used to store the
in the compound page. A new struct page field, compound_pincount, has pincount directly.
been added in order to support this.
This approach for compound pages avoids the counting upper limit problems that This approach for large folios avoids the counting upper limit problems
are discussed below. Those limitations would have been aggravated severely by that are discussed below. Those limitations would have been aggravated
huge pages, because each tail page adds a refcount to the head page. And in severely by huge pages, because each tail page adds a refcount to the
fact, testing revealed that, without a separate compound_pincount field, head page. And in fact, testing revealed that, without a separate pincount
page overflows were seen in some huge page stress tests. field, refcount overflows were seen in some huge page stress tests.
This also means that huge pages and compound pages do not suffer This also means that huge pages and large folios do not suffer
from the false positives problem that is mentioned below.:: from the false positives problem that is mentioned below.::
Function Function
@ -221,7 +220,7 @@ Unit testing
============ ============
This file:: This file::
tools/testing/selftests/vm/gup_test.c tools/testing/selftests/mm/gup_test.c
has the following new calls to exercise the new pin*() wrapper functions: has the following new calls to exercise the new pin*() wrapper functions:
@ -264,9 +263,9 @@ place.)
Other diagnostics Other diagnostics
================= =================
dump_page() has been enhanced slightly, to handle these new counting dump_page() has been enhanced slightly to handle these new counting
fields, and to better report on compound pages in general. Specifically, fields, and to better report on large folios in general. Specifically,
for compound pages, the exact (compound_pincount) pincount is reported. for large folios, the exact pincount is reported.
References References
========== ==========

View File

@ -140,6 +140,23 @@ disabling KASAN altogether or controlling its features:
- ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc - ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
allocations (default: ``on``). allocations (default: ``on``).
- ``kasan.page_alloc.sample=<sampling interval>`` makes KASAN tag only every
Nth page_alloc allocation with the order equal or greater than
``kasan.page_alloc.sample.order``, where N is the value of the ``sample``
parameter (default: ``1``, or tag every such allocation).
This parameter is intended to mitigate the performance overhead introduced
by KASAN.
Note that enabling this parameter makes Hardware Tag-Based KASAN skip checks
of allocations chosen by sampling and thus miss bad accesses to these
allocations. Use the default value for accurate bug detection.
- ``kasan.page_alloc.sample.order=<minimum page order>`` specifies the minimum
order of allocations that are affected by sampling (default: ``3``).
Only applies when ``kasan.page_alloc.sample`` is set to a value greater
than ``1``.
This parameter is intended to allow sampling only large page_alloc
allocations, which is the biggest source of the performance overhead.
Error reports Error reports
~~~~~~~~~~~~~ ~~~~~~~~~~~~~

View File

@ -4,7 +4,7 @@ Memory Balancing
Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com> Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as Memory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as
well as for non __GFP_IO allocations. well as for non __GFP_IO allocations.
The first reason why a caller may avoid reclaim is that the caller can not The first reason why a caller may avoid reclaim is that the caller can not

View File

@ -4,8 +4,9 @@
DAMON: Data Access MONitor DAMON: Data Access MONitor
========================== ==========================
DAMON is a data access monitoring framework subsystem for the Linux kernel. DAMON is a Linux kernel subsystem that provides a framework for data access
The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it monitoring and the monitoring results based system operations. The core
monitoring mechanisms of DAMON (refer to :doc:`design` for the detail) make it
- *accurate* (the monitoring output is useful enough for DRAM level memory - *accurate* (the monitoring output is useful enough for DRAM level memory
management; It might not appropriate for CPU Cache levels, though), management; It might not appropriate for CPU Cache levels, though),
@ -14,12 +15,16 @@ The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it
- *scalable* (the upper-bound of the overhead is in constant range regardless - *scalable* (the upper-bound of the overhead is in constant range regardless
of the size of target workloads). of the size of target workloads).
Using this framework, therefore, the kernel's memory management mechanisms can Using this framework, therefore, the kernel can operate system in an
make advanced decisions. Experimental memory management optimization works access-aware fashion. Because the features are also exposed to the user space,
that incurring high data accesses monitoring overhead could implemented again. users who have special information about their workloads can write personalized
In user space, meanwhile, users who have some special workloads can write applications for better understanding and optimizations of their workloads and
personalized applications for better understanding and optimizations of their systems.
workloads and systems.
For easier development of such systems, DAMON provides a feature called DAMOS
(DAMon-based Operation Schemes) in addition to the monitoring. Using the
feature, DAMON users in both kernel and user spaces can do access-aware system
operations with no code but simple configurations.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
@ -27,3 +32,4 @@ workloads and systems.
faq faq
design design
api api
maintainer-profile

View File

@ -0,0 +1,62 @@
.. SPDX-License-Identifier: GPL-2.0
DAMON Maintainer Entry Profile
==============================
The DAMON subsystem covers the files that listed in 'DATA ACCESS MONITOR'
section of 'MAINTAINERS' file.
The mailing lists for the subsystem are damon@lists.linux.dev and
linux-mm@kvack.org. Patches should be made against the mm-unstable tree [1]_
whenever possible and posted to the mailing lists.
SCM Trees
---------
There are multiple Linux trees for DAMON development. Patches under
development or testing are queued in damon/next [2]_ by the DAMON maintainer.
Suffieicntly reviewed patches will be queued in mm-unstable [1]_ by the memory
management subsystem maintainer. After more sufficient tests, the patches will
be queued in mm-stable [3]_ , and finally pull-requested to the mainline by the
memory management subsystem maintainer.
Note again the patches for review should be made against the mm-unstable
tree[1] whenever possible. damon/next is only for preview of others' works in
progress.
Submit checklist addendum
-------------------------
When making DAMON changes, you should do below.
- Build changes related outputs including kernel and documents.
- Ensure the builds introduce no new errors or warnings.
- Run and ensure no new failures for DAMON selftests [4]_ and kunittests [5]_ .
Further doing below and putting the results will be helpful.
- Run damon-tests/corr [6]_ for normal changes.
- Run damon-tests/perf [7]_ for performance changes.
Key cycle dates
---------------
Patches can be sent anytime. Key cycle dates of the mm-unstable[1] and
mm-stable[3] trees depend on the memory management subsystem maintainer.
Review cadence
--------------
The DAMON maintainer does the work on the usual work hour (09:00 to 17:00,
Mon-Fri) in PST. The response to patches will occasionally be slow. Do not
hesitate to send a ping if you have not heard back within a week of sending a
patch.
.. [1] https://git.kernel.org/akpm/mm/h/mm-unstable
.. [2] https://git.kernel.org/sj/h/damon/next
.. [3] https://git.kernel.org/akpm/mm/h/mm-stable
.. [4] https://github.com/awslabs/damon-tests/blob/master/corr/run.sh#L49
.. [5] https://github.com/awslabs/damon-tests/blob/master/corr/tests/kunit.sh
.. [6] https://github.com/awslabs/damon-tests/tree/master/corr
.. [7] https://github.com/awslabs/damon-tests/tree/master/perf

View File

@ -55,7 +55,8 @@ list shows them in order of preference of use.
It can be invoked from any context (including interrupts) but the mappings It can be invoked from any context (including interrupts) but the mappings
can only be used in the context which acquired them. can only be used in the context which acquired them.
This function should be preferred, where feasible, over all the others. This function should always be used, whereas kmap_atomic() and kmap() have
been deprecated.
These mappings are thread-local and CPU-local, meaning that the mapping These mappings are thread-local and CPU-local, meaning that the mapping
can only be accessed from within this thread and the thread is bound to the can only be accessed from within this thread and the thread is bound to the
@ -80,7 +81,7 @@ list shows them in order of preference of use.
for pages which are known to not come from ZONE_HIGHMEM. However, it is for pages which are known to not come from ZONE_HIGHMEM. However, it is
always safe to use kmap_local_page() / kunmap_local(). always safe to use kmap_local_page() / kunmap_local().
While it is significantly faster than kmap(), for the higmem case it While it is significantly faster than kmap(), for the highmem case it
comes with restrictions about the pointers validity. Contrary to kmap() comes with restrictions about the pointers validity. Contrary to kmap()
mappings, the local mappings are only valid in the context of the caller mappings, the local mappings are only valid in the context of the caller
and cannot be handed to other contexts. This implies that users must and cannot be handed to other contexts. This implies that users must
@ -98,10 +99,21 @@ list shows them in order of preference of use.
(included in the "Functions" section) for details on how to manage nested (included in the "Functions" section) for details on how to manage nested
mappings. mappings.
* kmap_atomic(). This permits a very short duration mapping of a single * kmap_atomic(). This function has been deprecated; use kmap_local_page().
page. Since the mapping is restricted to the CPU that issued it, it
performs well, but the issuing task is therefore required to stay on that NOTE: Conversions to kmap_local_page() must take care to follow the mapping
CPU until it has finished, lest some other task displace its mappings. restrictions imposed on kmap_local_page(). Furthermore, the code between
calls to kmap_atomic() and kunmap_atomic() may implicitly depend on the side
effects of atomic mappings, i.e. disabling page faults or preemption, or both.
In that case, explicit calls to pagefault_disable() or preempt_disable() or
both must be made in conjunction with the use of kmap_local_page().
[Legacy documentation]
This permits a very short duration mapping of a single page. Since the
mapping is restricted to the CPU that issued it, it performs well, but
the issuing task is therefore required to stay on that CPU until it has
finished, lest some other task displace its mappings.
kmap_atomic() may also be used by interrupt contexts, since it does not kmap_atomic() may also be used by interrupt contexts, since it does not
sleep and the callers too may not sleep until after kunmap_atomic() is sleep and the callers too may not sleep until after kunmap_atomic() is
@ -113,11 +125,20 @@ list shows them in order of preference of use.
It is assumed that k[un]map_atomic() won't fail. It is assumed that k[un]map_atomic() won't fail.
* kmap(). This should be used to make short duration mapping of a single * kmap(). This function has been deprecated; use kmap_local_page().
page with no restrictions on preemption or migration. It comes with an
overhead as mapping space is restricted and protected by a global lock NOTE: Conversions to kmap_local_page() must take care to follow the mapping
for synchronization. When mapping is no longer needed, the address that restrictions imposed on kmap_local_page(). In particular, it is necessary to
the page was mapped to must be released with kunmap(). make sure that the kernel virtual memory pointer is only valid in the thread
that obtained it.
[Legacy documentation]
This should be used to make short duration mapping of a single page with no
restrictions on preemption or migration. It comes with an overhead as mapping
space is restricted and protected by a global lock for synchronization. When
mapping is no longer needed, the address that the page was mapped to must be
released with kunmap().
Mapping changes must be propagated across all the CPUs. kmap() also Mapping changes must be propagated across all the CPUs. kmap() also
requires global TLB invalidation when the kmap's pool wraps and it might requires global TLB invalidation when the kmap's pool wraps and it might

View File

@ -179,14 +179,14 @@ Consuming Reservations/Allocating a Huge Page
Reservations are consumed when huge pages associated with the reservations Reservations are consumed when huge pages associated with the reservations
are allocated and instantiated in the corresponding mapping. The allocation are allocated and instantiated in the corresponding mapping. The allocation
is performed within the routine alloc_huge_page():: is performed within the routine alloc_hugetlb_folio()::
struct page *alloc_huge_page(struct vm_area_struct *vma, struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve) unsigned long addr, int avoid_reserve)
alloc_huge_page is passed a VMA pointer and a virtual address, so it can alloc_hugetlb_folio is passed a VMA pointer and a virtual address, so it can
consult the reservation map to determine if a reservation exists. In addition, consult the reservation map to determine if a reservation exists. In addition,
alloc_huge_page takes the argument avoid_reserve which indicates reserves alloc_hugetlb_folio takes the argument avoid_reserve which indicates reserves
should not be used even if it appears they have been set aside for the should not be used even if it appears they have been set aside for the
specified address. The avoid_reserve argument is most often used in the case specified address. The avoid_reserve argument is most often used in the case
of Copy on Write and Page Migration where additional copies of an existing of Copy on Write and Page Migration where additional copies of an existing
@ -206,7 +206,8 @@ a reservation for the allocation. After determining whether a reservation
exists and can be used for the allocation, the routine dequeue_huge_page_vma() exists and can be used for the allocation, the routine dequeue_huge_page_vma()
is called. This routine takes two arguments related to reservations: is called. This routine takes two arguments related to reservations:
- avoid_reserve, this is the same value/argument passed to alloc_huge_page() - avoid_reserve, this is the same value/argument passed to
alloc_hugetlb_folio().
- chg, even though this argument is of type long only the values 0 or 1 are - chg, even though this argument is of type long only the values 0 or 1 are
passed to dequeue_huge_page_vma. If the value is 0, it indicates a passed to dequeue_huge_page_vma. If the value is 0, it indicates a
reservation exists (see the section "Memory Policy and Reservations" for reservation exists (see the section "Memory Policy and Reservations" for
@ -231,9 +232,9 @@ the scope reservations. Even if a surplus page is allocated, the same
reservation based adjustments as above will be made: SetPagePrivate(page) and reservation based adjustments as above will be made: SetPagePrivate(page) and
resv_huge_pages--. resv_huge_pages--.
After obtaining a new huge page, (page)->private is set to the value of After obtaining a new hugetlb folio, (folio)->_hugetlb_subpool is set to the
the subpool associated with the page if it exists. This will be used for value of the subpool associated with the page if it exists. This will be used
subpool accounting when the page is freed. for subpool accounting when the folio is freed.
The routine vma_commit_reservation() is then called to adjust the reserve The routine vma_commit_reservation() is then called to adjust the reserve
map based on the consumption of the reservation. In general, this involves map based on the consumption of the reservation. In general, this involves
@ -244,8 +245,8 @@ was no reservation in a shared mapping or this was a private mapping a new
entry must be created. entry must be created.
It is possible that the reserve map could have been changed between the call It is possible that the reserve map could have been changed between the call
to vma_needs_reservation() at the beginning of alloc_huge_page() and the to vma_needs_reservation() at the beginning of alloc_hugetlb_folio() and the
call to vma_commit_reservation() after the page was allocated. This would call to vma_commit_reservation() after the folio was allocated. This would
be possible if hugetlb_reserve_pages was called for the same page in a shared be possible if hugetlb_reserve_pages was called for the same page in a shared
mapping. In such cases, the reservation count and subpool free page count mapping. In such cases, the reservation count and subpool free page count
will be off by one. This rare condition can be identified by comparing the will be off by one. This rare condition can be identified by comparing the

View File

@ -89,15 +89,15 @@ variables are monotonically increasing.
Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
bits in order to fit into the gen counter in ``folio->flags``. Each bits in order to fit into the gen counter in ``folio->flags``. Each
truncated generation number is an index to ``lrugen->lists[]``. The truncated generation number is an index to ``lrugen->folios[]``. The
sliding window technique is used to track at least ``MIN_NR_GENS`` and sliding window technique is used to track at least ``MIN_NR_GENS`` and
at most ``MAX_NR_GENS`` generations. The gen counter stores a value at most ``MAX_NR_GENS`` generations. The gen counter stores a value
within ``[1, MAX_NR_GENS]`` while a page is on one of within ``[1, MAX_NR_GENS]`` while a page is on one of
``lrugen->lists[]``; otherwise it stores zero. ``lrugen->folios[]``; otherwise it stores zero.
Each generation is divided into multiple tiers. A page accessed ``N`` Each generation is divided into multiple tiers. A page accessed ``N``
times through file descriptors is in tier ``order_base_2(N)``. Unlike times through file descriptors is in tier ``order_base_2(N)``. Unlike
generations, tiers do not have dedicated ``lrugen->lists[]``. In generations, tiers do not have dedicated ``lrugen->folios[]``. In
contrast to moving across generations, which requires the LRU lock, contrast to moving across generations, which requires the LRU lock,
moving across tiers only involves atomic operations on moving across tiers only involves atomic operations on
``folio->flags`` and therefore has a negligible cost. A feedback loop ``folio->flags`` and therefore has a negligible cost. A feedback loop
@ -127,7 +127,7 @@ page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
Eviction Eviction
-------- --------
The eviction consumes old generations. Given an ``lruvec``, it The eviction consumes old generations. Given an ``lruvec``, it
increments ``min_seq`` when ``lrugen->lists[]`` indexed by increments ``min_seq`` when ``lrugen->folios[]`` indexed by
``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to ``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
evict from, it first compares ``min_seq[]`` to select the older type. evict from, it first compares ``min_seq[]`` to select the older type.
If both types are equally old, it selects the one whose first tier has If both types are equally old, it selects the one whose first tier has
@ -141,9 +141,85 @@ loop has detected outlying refaults from the tier this page is in. To
this end, the feedback loop uses the first tier as the baseline, for this end, the feedback loop uses the first tier as the baseline, for
the reason stated earlier. the reason stated earlier.
Working set protection
----------------------
Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is
set, an ``lruvec`` is protected from the eviction when its oldest
generation was born within ``lru_gen_min_ttl`` milliseconds. In other
words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds
from getting evicted. The OOM killer is triggered if this working set
cannot be kept in memory.
This time-based approach has the following advantages:
1. It is easier to configure because it is agnostic to applications
and memory sizes.
2. It is more reliable because it is directly wired to the OOM killer.
Rmap/PT walk feedback
---------------------
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, searching the rmap
can incur the highest CPU cost in the reclaim path.
``lru_gen_look_around()`` exploits spatial locality to reduce the
trips into the rmap. It scans the adjacent PTEs of a young PTE and
promotes hot pages. If the scan was done cacheline efficiently, it
adds the PMD entry pointing to the PTE table to the Bloom filter. This
forms a feedback loop between the eviction and the aging.
Bloom Filters
-------------
Bloom filters are a space and memory efficient data structure for set
membership test, i.e., test if an element is not in the set or may be
in the set.
In the eviction path, specifically, in ``lru_gen_look_around()``, if a
PMD has a sufficient number of hot pages, its address is placed in the
filter. In the aging path, set membership means that the PTE range
will be scanned for young pages.
Note that Bloom filters are probabilistic on set membership. If a test
is false positive, the cost is an additional scan of a range of PTEs,
which may yield hot pages anyway. Parameters of the filter itself can
control the false positive rate in the limit.
Memcg LRU
---------
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
global reclaim, which is critical to system-wide memory overcommit in
data centers. Note that memcg LRU only applies to global reclaim.
The basic structure of an memcg LRU can be understood by an analogy to
the active/inactive LRU (of folios):
1. It has the young and the old (generations), i.e., the counterparts
to the active and the inactive;
2. The increment of ``max_seq`` triggers promotion, i.e., the
counterpart to activation;
3. Other events trigger similar operations, e.g., offlining an memcg
triggers demotion, i.e., the counterpart to deactivation.
In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out at will
and reduces latency without affecting fairness over some time.
In terms of traversing memcgs during global reclaim, it improves the
best-case complexity from O(n) to O(1) and does not affect the
worst-case complexity O(n). Therefore, on average, it has a sublinear
complexity.
Summary Summary
------- -------
The multi-gen LRU can be disassembled into the following parts: The multi-gen LRU (of folios) can be disassembled into the following
parts:
* Generations * Generations
* Rmap walks * Rmap walks

View File

@ -59,7 +59,7 @@ Usage
1) Build user-space helper:: 1) Build user-space helper::
cd tools/vm cd tools/mm
make page_owner_sort make page_owner_sort
2) Enable page owner: add "page_owner=on" to boot cmdline. 2) Enable page owner: add "page_owner=on" to boot cmdline.

View File

@ -19,7 +19,7 @@ slabs that have data in them. See "slabinfo -h" for more options when
running the command. ``slabinfo`` can be compiled with running the command. ``slabinfo`` can be compiled with
:: ::
gcc -o slabinfo tools/vm/slabinfo.c gcc -o slabinfo tools/mm/slabinfo.c
Some of the modes of operation of ``slabinfo`` require that slub debugging Some of the modes of operation of ``slabinfo`` require that slub debugging
be enabled on the command line. F.e. no tracking information will be be enabled on the command line. F.e. no tracking information will be

View File

@ -110,20 +110,20 @@ Refcounts and transparent huge pages
Refcounting on THP is mostly consistent with refcounting on other compound Refcounting on THP is mostly consistent with refcounting on other compound
pages: pages:
- get_page()/put_page() and GUP operate on head page's ->_refcount. - get_page()/put_page() and GUP operate on the folio->_refcount.
- ->_refcount in tail pages is always zero: get_page_unless_zero() never - ->_refcount in tail pages is always zero: get_page_unless_zero() never
succeeds on tail pages. succeeds on tail pages.
- map/unmap of PMD entry for the whole compound page increment/decrement - map/unmap of a PMD entry for the whole THP increment/decrement
->compound_mapcount, stored in the first tail page of the compound page; folio->_entire_mapcount and also increment/decrement
and also increment/decrement ->subpages_mapcount (also in the first tail) folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount
by COMPOUND_MAPPED when compound_mapcount goes from -1 to 0 or 0 to -1. goes from -1 to 0 or 0 to -1.
- map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount - map/unmap of individual pages with PTE entry increment/decrement
on relevant sub-page of the compound page, and also increment/decrement page->_mapcount and also increment/decrement folio->_nr_pages_mapped
->subpages_mapcount, stored in first tail page of the compound page, when when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts
_mapcount goes from -1 to 0 or 0 to -1: counting sub-pages mapped by PTE. the number of pages mapped by PTE.
split_huge_page internally has to distribute the refcounts in the head split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the page page to the tail pages before clearing all PG_head/tail bits from the page
@ -151,8 +151,8 @@ clear where references should go after split: it will stay on the head page.
Note that split_huge_pmd() doesn't have any limitations on refcounting: Note that split_huge_pmd() doesn't have any limitations on refcounting:
pmd can be split at any point and never fails. pmd can be split at any point and never fails.
Partial unmap and deferred_split_huge_page() Partial unmap and deferred_split_folio()
============================================ ========================================
Unmapping part of THP (with munmap() or other way) is not going to free Unmapping part of THP (with munmap() or other way) is not going to free
memory immediately. Instead, we detect that a subpage of THP is not in use memory immediately. Instead, we detect that a subpage of THP is not in use
@ -164,6 +164,6 @@ the place where we can detect partial unmap. It also might be
counterproductive since in many cases partial unmap happens during exit(2) if counterproductive since in many cases partial unmap happens during exit(2) if
a THP crosses a VMA boundary. a THP crosses a VMA boundary.
The function deferred_split_huge_page() is used to queue a page for splitting. The function deferred_split_folio() is used to queue a folio for splitting.
The splitting itself will happen when we get memory pressure via shrinker The splitting itself will happen when we get memory pressure via shrinker
interface. interface.

View File

@ -10,7 +10,7 @@ Introduction
This document describes the Linux memory manager's "Unevictable LRU" This document describes the Linux memory manager's "Unevictable LRU"
infrastructure and the use of this to manage several types of "unevictable" infrastructure and the use of this to manage several types of "unevictable"
pages. folios.
The document attempts to provide the overall rationale behind this mechanism The document attempts to provide the overall rationale behind this mechanism
and the rationale for some of the design decisions that drove the and the rationale for some of the design decisions that drove the
@ -25,8 +25,8 @@ The Unevictable LRU
=================== ===================
The Unevictable LRU facility adds an additional LRU list to track unevictable The Unevictable LRU facility adds an additional LRU list to track unevictable
pages and to hide these pages from vmscan. This mechanism is based on a patch folios and to hide these folios from vmscan. This mechanism is based on a patch
by Larry Woodman of Red Hat to address several scalability problems with page by Larry Woodman of Red Hat to address several scalability problems with folio
reclaim in Linux. The problems have been observed at customer sites on large reclaim in Linux. The problems have been observed at customer sites on large
memory x86_64 systems. memory x86_64 systems.
@ -50,40 +50,41 @@ The infrastructure may also be able to handle other conditions that make pages
unevictable, either by definition or by circumstance, in the future. unevictable, either by definition or by circumstance, in the future.
The Unevictable LRU Page List The Unevictable LRU Folio List
----------------------------- ------------------------------
The Unevictable LRU page list is a lie. It was never an LRU-ordered list, but a The Unevictable LRU folio list is a lie. It was never an LRU-ordered
companion to the LRU-ordered anonymous and file, active and inactive page lists; list, but a companion to the LRU-ordered anonymous and file, active and
and now it is not even a page list. But following familiar convention, here in inactive folio lists; and now it is not even a folio list. But following
this document and in the source, we often imagine it as a fifth LRU page list. familiar convention, here in this document and in the source, we often
imagine it as a fifth LRU folio list.
The Unevictable LRU infrastructure consists of an additional, per-node, LRU list The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
called the "unevictable" list and an associated page flag, PG_unevictable, to called the "unevictable" list and an associated folio flag, PG_unevictable, to
indicate that the page is being managed on the unevictable list. indicate that the folio is being managed on the unevictable list.
The PG_unevictable flag is analogous to, and mutually exclusive with, the The PG_unevictable flag is analogous to, and mutually exclusive with, the
PG_active flag in that it indicates on which LRU list a page resides when PG_active flag in that it indicates on which LRU list a folio resides when
PG_lru is set. PG_lru is set.
The Unevictable LRU infrastructure maintains unevictable pages as if they were The Unevictable LRU infrastructure maintains unevictable folios as if they were
on an additional LRU list for a few reasons: on an additional LRU list for a few reasons:
(1) We get to "treat unevictable pages just like we treat other pages in the (1) We get to "treat unevictable folios just like we treat other folios in the
system - which means we get to use the same code to manipulate them, the system - which means we get to use the same code to manipulate them, the
same code to isolate them (for migrate, etc.), the same code to keep track same code to isolate them (for migrate, etc.), the same code to keep track
of the statistics, etc..." [Rik van Riel] of the statistics, etc..." [Rik van Riel]
(2) We want to be able to migrate unevictable pages between nodes for memory (2) We want to be able to migrate unevictable folios between nodes for memory
defragmentation, workload management and memory hotplug. The Linux kernel defragmentation, workload management and memory hotplug. The Linux kernel
can only migrate pages that it can successfully isolate from the LRU can only migrate folios that it can successfully isolate from the LRU
lists (or "Movable" pages: outside of consideration here). If we were to lists (or "Movable" pages: outside of consideration here). If we were to
maintain pages elsewhere than on an LRU-like list, where they can be maintain folios elsewhere than on an LRU-like list, where they can be
detected by isolate_lru_page(), we would prevent their migration. detected by folio_isolate_lru(), we would prevent their migration.
The unevictable list does not differentiate between file-backed and anonymous, The unevictable list does not differentiate between file-backed and
swap-backed pages. This differentiation is only important while the pages are, anonymous, swap-backed folios. This differentiation is only important
in fact, evictable. while the folios are, in fact, evictable.
The unevictable list benefits from the "arrayification" of the per-node LRU The unevictable list benefits from the "arrayification" of the per-node LRU
lists and statistics originally proposed and posted by Christoph Lameter. lists and statistics originally proposed and posted by Christoph Lameter.
@ -156,7 +157,7 @@ These are currently used in three places in the kernel:
Detecting Unevictable Pages Detecting Unevictable Pages
--------------------------- ---------------------------
The function page_evictable() in mm/internal.h determines whether a page is The function folio_evictable() in mm/internal.h determines whether a folio is
evictable or not using the query function outlined above [see section evictable or not using the query function outlined above [see section
:ref:`Marking address spaces unevictable <mark_addr_space_unevict>`] :ref:`Marking address spaces unevictable <mark_addr_space_unevict>`]
to check the AS_UNEVICTABLE flag. to check the AS_UNEVICTABLE flag.
@ -165,7 +166,7 @@ For address spaces that are so marked after being populated (as SHM regions
might be), the lock action (e.g. SHM_LOCK) can be lazy, and need not populate might be), the lock action (e.g. SHM_LOCK) can be lazy, and need not populate
the page tables for the region as does, for example, mlock(), nor need it make the page tables for the region as does, for example, mlock(), nor need it make
any special effort to push any pages in the SHM_LOCK'd area to the unevictable any special effort to push any pages in the SHM_LOCK'd area to the unevictable
list. Instead, vmscan will do this if and when it encounters the pages during list. Instead, vmscan will do this if and when it encounters the folios during
a reclamation scan. a reclamation scan.
On an unlock action (such as SHM_UNLOCK), the unlocker (e.g. shmctl()) must scan On an unlock action (such as SHM_UNLOCK), the unlocker (e.g. shmctl()) must scan
@ -174,41 +175,43 @@ condition is keeping them unevictable. If an unevictable region is destroyed,
the pages are also "rescued" from the unevictable list in the process of the pages are also "rescued" from the unevictable list in the process of
freeing them. freeing them.
page_evictable() also checks for mlocked pages by testing an additional page folio_evictable() also checks for mlocked folios by calling
flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is folio_test_mlocked(), which is set when a folio is faulted into a
faulted into a VM_LOCKED VMA, or found in a VMA being VM_LOCKED. VM_LOCKED VMA, or found in a VMA being VM_LOCKED.
Vmscan's Handling of Unevictable Pages Vmscan's Handling of Unevictable Folios
-------------------------------------- ---------------------------------------
If unevictable pages are culled in the fault path, or moved to the unevictable If unevictable folios are culled in the fault path, or moved to the unevictable
list at mlock() or mmap() time, vmscan will not encounter the pages until they list at mlock() or mmap() time, vmscan will not encounter the folios until they
have become evictable again (via munlock() for example) and have been "rescued" have become evictable again (via munlock() for example) and have been "rescued"
from the unevictable list. However, there may be situations where we decide, from the unevictable list. However, there may be situations where we decide,
for the sake of expediency, to leave an unevictable page on one of the regular for the sake of expediency, to leave an unevictable folio on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such active/inactive LRU lists for vmscan to deal with. vmscan checks for such
pages in all of the shrink_{active|inactive|page}_list() functions and will folios in all of the shrink_{active|inactive|page}_list() functions and will
"cull" such pages that it encounters: that is, it diverts those pages to the "cull" such folios that it encounters: that is, it diverts those folios to the
unevictable list for the memory cgroup and node being scanned. unevictable list for the memory cgroup and node being scanned.
There may be situations where a page is mapped into a VM_LOCKED VMA, but the There may be situations where a folio is mapped into a VM_LOCKED VMA,
page is not marked as PG_mlocked. Such pages will make it all the way to but the folio does not have the mlocked flag set. Such folios will make
shrink_active_list() or shrink_page_list() where they will be detected when it all the way to shrink_active_list() or shrink_page_list() where they
vmscan walks the reverse map in folio_referenced() or try_to_unmap(). The page will be detected when vmscan walks the reverse map in folio_referenced()
is culled to the unevictable list when it is released by the shrinker. or try_to_unmap(). The folio is culled to the unevictable list when it
is released by the shrinker.
To "cull" an unevictable page, vmscan simply puts the page back on the LRU list To "cull" an unevictable folio, vmscan simply puts the folio back on
using putback_lru_page() - the inverse operation to isolate_lru_page() - after the LRU list using folio_putback_lru() - the inverse operation to
dropping the page lock. Because the condition which makes the page unevictable folio_isolate_lru() - after dropping the folio lock. Because the
may change once the page is unlocked, __pagevec_lru_add_fn() will recheck the condition which makes the folio unevictable may change once the folio
unevictable state of a page before placing it on the unevictable list. is unlocked, __pagevec_lru_add_fn() will recheck the unevictable state
of a folio before placing it on the unevictable list.
MLOCKED Pages MLOCKED Pages
============= =============
The unevictable page list is also useful for mlock(), in addition to ramfs and The unevictable folio list is also useful for mlock(), in addition to ramfs and
SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
NOMMU situations, all mappings are effectively mlocked. NOMMU situations, all mappings are effectively mlocked.
@ -293,7 +296,7 @@ treated as a no-op and mlock_fixup() simply returns.
If the VMA passes some filtering as described in "Filtering Special VMAs" If the VMA passes some filtering as described in "Filtering Special VMAs"
below, mlock_fixup() will attempt to merge the VMA with its neighbors or split below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
off a subset of the VMA if the range does not cover the entire VMA. Any pages off a subset of the VMA if the range does not cover the entire VMA. Any pages
already present in the VMA are then marked as mlocked by mlock_page() via already present in the VMA are then marked as mlocked by mlock_folio() via
mlock_pte_range() via walk_page_range() via mlock_vma_pages_range(). mlock_pte_range() via walk_page_range() via mlock_vma_pages_range().
Before returning from the system call, do_mlock() or mlockall() will call Before returning from the system call, do_mlock() or mlockall() will call
@ -306,22 +309,22 @@ do end up getting faulted into this VM_LOCKED VMA, they will be handled in the
fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled. fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled.
For each PTE (or PMD) being faulted into a VMA, the page add rmap function For each PTE (or PMD) being faulted into a VMA, the page add rmap function
calls mlock_vma_page(), which calls mlock_page() when the VMA is VM_LOCKED calls mlock_vma_folio(), which calls mlock_folio() when the VMA is VM_LOCKED
(unless it is a PTE mapping of a part of a transparent huge page). Or when (unless it is a PTE mapping of a part of a transparent huge page). Or when
it is a newly allocated anonymous page, lru_cache_add_inactive_or_unevictable() it is a newly allocated anonymous page, folio_add_lru_vma() calls
calls mlock_new_page() instead: similar to mlock_page(), but can make better mlock_new_folio() instead: similar to mlock_folio(), but can make better
judgments, since this page is held exclusively and known not to be on LRU yet. judgments, since this page is held exclusively and known not to be on LRU yet.
mlock_page() sets PageMlocked immediately, then places the page on the CPU's mlock_folio() sets PG_mlocked immediately, then places the page on the CPU's
mlock pagevec, to batch up the rest of the work to be done under lru_lock by mlock folio batch, to batch up the rest of the work to be done under lru_lock by
__mlock_page(). __mlock_page() sets PageUnevictable, initializes mlock_count __mlock_folio(). __mlock_folio() sets PG_unevictable, initializes mlock_count
and moves the page to unevictable state ("the unevictable LRU", but with and moves the page to unevictable state ("the unevictable LRU", but with
mlock_count in place of LRU threading). Or if the page was already PageLRU mlock_count in place of LRU threading). Or if the page was already PG_lru
and PageUnevictable and PageMlocked, it simply increments the mlock_count. and PG_unevictable and PG_mlocked, it simply increments the mlock_count.
But in practice that may not work ideally: the page may not yet be on an LRU, or But in practice that may not work ideally: the page may not yet be on an LRU, or
it may have been temporarily isolated from LRU. In such cases the mlock_count it may have been temporarily isolated from LRU. In such cases the mlock_count
field cannot be touched, but will be set to 0 later when __pagevec_lru_add_fn() field cannot be touched, but will be set to 0 later when __munlock_folio()
returns the page to "LRU". Races prohibit mlock_count from being set to 1 then: returns the page to "LRU". Races prohibit mlock_count from being set to 1 then:
rather than risk stranding a page indefinitely as unevictable, always err with rather than risk stranding a page indefinitely as unevictable, always err with
mlock_count on the low side, so that when munlocked the page will be rescued to mlock_count on the low side, so that when munlocked the page will be rescued to
@ -368,20 +371,21 @@ Because of the VMA filtering discussed above, VM_LOCKED will not be set in
any "special" VMAs. So, those VMAs will be ignored for munlock. any "special" VMAs. So, those VMAs will be ignored for munlock.
If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
specified range. All pages in the VMA are then munlocked by munlock_page() via specified range. All pages in the VMA are then munlocked by munlock_folio() via
mlock_pte_range() via walk_page_range() via mlock_vma_pages_range() - the same mlock_pte_range() via walk_page_range() via mlock_vma_pages_range() - the same
function used when mlocking a VMA range, with new flags for the VMA indicating function used when mlocking a VMA range, with new flags for the VMA indicating
that it is munlock() being performed. that it is munlock() being performed.
munlock_page() uses the mlock pagevec to batch up work to be done under munlock_folio() uses the mlock pagevec to batch up work to be done
lru_lock by __munlock_page(). __munlock_page() decrements the page's under lru_lock by __munlock_folio(). __munlock_folio() decrements the
mlock_count, and when that reaches 0 it clears PageMlocked and clears folio's mlock_count, and when that reaches 0 it clears the mlocked flag
PageUnevictable, moving the page from unevictable state to inactive LRU. and clears the unevictable flag, moving the folio from unevictable state
to the inactive LRU.
But in practice that may not work ideally: the page may not yet have reached But in practice that may not work ideally: the folio may not yet have reached
"the unevictable LRU", or it may have been temporarily isolated from it. In "the unevictable LRU", or it may have been temporarily isolated from it. In
those cases its mlock_count field is unusable and must be assumed to be 0: so those cases its mlock_count field is unusable and must be assumed to be 0: so
that the page will be rescued to an evictable LRU, then perhaps be mlocked that the folio will be rescued to an evictable LRU, then perhaps be mlocked
again later if vmscan finds it in a VM_LOCKED VMA. again later if vmscan finds it in a VM_LOCKED VMA.
@ -408,7 +412,7 @@ However, since mlock_vma_pages_range() starts by setting VM_LOCKED on a VMA,
before mlocking any pages already present, if one of those pages were migrated before mlocking any pages already present, if one of those pages were migrated
before mlock_pte_range() reached it, it would get counted twice in mlock_count. before mlock_pte_range() reached it, it would get counted twice in mlock_count.
To prevent that, mlock_vma_pages_range() temporarily marks the VMA as VM_IO, To prevent that, mlock_vma_pages_range() temporarily marks the VMA as VM_IO,
so that mlock_vma_page() will skip it. so that mlock_vma_folio() will skip it.
To complete page migration, we place the old and new pages back onto the LRU To complete page migration, we place the old and new pages back onto the LRU
afterwards. The "unneeded" page - old page on success, new page on failure - afterwards. The "unneeded" page - old page on success, new page on failure -
@ -481,18 +485,19 @@ Before the unevictable/mlock changes, mlocking did not mark the pages in any
way, so unmapping them required no processing. way, so unmapping them required no processing.
For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
(unless it was a PTE mapping of a part of a transparent huge page). (unless it was a PTE mapping of a part of a transparent huge page).
munlock_page() uses the mlock pagevec to batch up work to be done under munlock_folio() uses the mlock pagevec to batch up work to be done
lru_lock by __munlock_page(). __munlock_page() decrements the page's under lru_lock by __munlock_folio(). __munlock_folio() decrements the
mlock_count, and when that reaches 0 it clears PageMlocked and clears folio's mlock_count, and when that reaches 0 it clears the mlocked flag
PageUnevictable, moving the page from unevictable state to inactive LRU. and clears the unevictable flag, moving the folio from unevictable state
to the inactive LRU.
But in practice that may not work ideally: the page may not yet have reached But in practice that may not work ideally: the folio may not yet have reached
"the unevictable LRU", or it may have been temporarily isolated from it. In "the unevictable LRU", or it may have been temporarily isolated from it. In
those cases its mlock_count field is unusable and must be assumed to be 0: so those cases its mlock_count field is unusable and must be assumed to be 0: so
that the page will be rescued to an evictable LRU, then perhaps be mlocked that the folio will be rescued to an evictable LRU, then perhaps be mlocked
again later if vmscan finds it in a VM_LOCKED VMA. again later if vmscan finds it in a VM_LOCKED VMA.
@ -505,7 +510,7 @@ which had been Copied-On-Write from the file pages now being truncated.
Mlocked pages can be munlocked and deleted in this way: like with munmap(), Mlocked pages can be munlocked and deleted in this way: like with munmap(),
for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
(unless it was a PTE mapping of a part of a transparent huge page). (unless it was a PTE mapping of a part of a transparent huge page).
However, if there is a racing munlock(), since mlock_vma_pages_range() starts However, if there is a racing munlock(), since mlock_vma_pages_range() starts
@ -513,7 +518,7 @@ munlocking by clearing VM_LOCKED from a VMA, before munlocking all the pages
present, if one of those pages were unmapped by truncation or hole punch before present, if one of those pages were unmapped by truncation or hole punch before
mlock_pte_range() reached it, it would not be recognized as mlocked by this VMA, mlock_pte_range() reached it, it would not be recognized as mlocked by this VMA,
and would not be counted out of mlock_count. In this rare case, a page may and would not be counted out of mlock_count. In this rare case, a page may
still appear as PageMlocked after it has been fully unmapped: and it is left to still appear as PG_mlocked after it has been fully unmapped: and it is left to
release_pages() (or __page_cache_release()) to clear it and update statistics release_pages() (or __page_cache_release()) to clear it and update statistics
before freeing (this event is counted in /proc/vmstat unevictable_pgs_cleared, before freeing (this event is counted in /proc/vmstat unevictable_pgs_cleared,
which is usually 0). which is usually 0).
@ -525,7 +530,7 @@ Page Reclaim in shrink_*_list()
vmscan's shrink_active_list() culls any obviously unevictable pages - vmscan's shrink_active_list() culls any obviously unevictable pages -
i.e. !page_evictable(page) pages - diverting those to the unevictable list. i.e. !page_evictable(page) pages - diverting those to the unevictable list.
However, shrink_active_list() only sees unevictable pages that made it onto the However, shrink_active_list() only sees unevictable pages that made it onto the
active/inactive LRU lists. Note that these pages do not have PageUnevictable active/inactive LRU lists. Note that these pages do not have PG_unevictable
set - otherwise they would be on the unevictable list and shrink_active_list() set - otherwise they would be on the unevictable list and shrink_active_list()
would never see them. would never see them.
@ -547,6 +552,6 @@ and node unevictable list.
rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or
shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(), shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(),
check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_page() check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio()
to correct them. Such pages are culled to the unevictable list when released to correct them. Such pages are culled to the unevictable list when released
by the shrinker. by the shrinker.

View File

@ -78,3 +78,171 @@ Similarly, we assign zspage to:
* ZS_ALMOST_FULL when n > N / f * ZS_ALMOST_FULL when n > N / f
* ZS_EMPTY when n == 0 * ZS_EMPTY when n == 0
* ZS_FULL when n == N * ZS_FULL when n == N
Internals
=========
zsmalloc has 255 size classes, each of which can hold a number of zspages.
Each zspage can contain up to ZSMALLOC_CHAIN_SIZE physical (0-order) pages.
The optimal zspage chain size for each size class is calculated during the
creation of the zsmalloc pool (see calculate_zspage_chain_size()).
As an optimization, zsmalloc merges size classes that have similar
characteristics in terms of the number of pages per zspage and the number
of objects that each zspage can store.
For instance, consider the following size classes:::
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
94 1536 0 0 0 0 0 3 0
100 1632 0 0 0 0 0 2 0
...
Size classes #95-99 are merged with size class #100. This means that when we
need to store an object of size, say, 1568 bytes, we end up using size class
#100 instead of size class #96. Size class #100 is meant for objects of size
1632 bytes, so each object of size 1568 bytes wastes 1632-1568=64 bytes.
Size class #100 consists of zspages with 2 physical pages each, which can
hold a total of 5 objects. If we need to store 13 objects of size 1568, we
end up allocating three zspages, or 6 physical pages.
However, if we take a closer look at size class #96 (which is meant for
objects of size 1568 bytes) and trace `calculate_zspage_chain_size()`, we
find that the most optimal zspage configuration for this class is a chain
of 5 physical pages:::
pages per zspage wasted bytes used%
1 960 76
2 352 95
3 1312 89
4 704 95
5 96 99
This means that a class #96 configuration with 5 physical pages can store 13
objects of size 1568 in a single zspage, using a total of 5 physical pages.
This is more efficient than the class #100 configuration, which would use 6
physical pages to store the same number of objects.
As the zspage chain size for class #96 increases, its key characteristics
such as pages per-zspage and objects per-zspage also change. This leads to
dewer class mergers, resulting in a more compact grouping of classes, which
reduces memory wastage.
Let's take a closer look at the bottom of `/sys/kernel/debug/zsmalloc/zramX/classes`:::
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
202 3264 0 0 0 0 0 4 0
254 4096 0 0 0 0 0 1 0
...
Size class #202 stores objects of size 3264 bytes and has a maximum of 4 pages
per zspage. Any object larger than 3264 bytes is considered huge and belongs
to size class #254, which stores each object in its own physical page (objects
in huge classes do not share pages).
Increasing the size of the chain of zspages also results in a higher watermark
for the huge size class and fewer huge classes overall. This allows for more
efficient storage of large objects.
For zspage chain size of 8, huge class watermark becomes 3632 bytes:::
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
202 3264 0 0 0 0 0 4 0
211 3408 0 0 0 0 0 5 0
217 3504 0 0 0 0 0 6 0
222 3584 0 0 0 0 0 7 0
225 3632 0 0 0 0 0 8 0
254 4096 0 0 0 0 0 1 0
...
For zspage chain size of 16, huge class watermark becomes 3840 bytes:::
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
202 3264 0 0 0 0 0 4 0
206 3328 0 0 0 0 0 13 0
207 3344 0 0 0 0 0 9 0
208 3360 0 0 0 0 0 14 0
211 3408 0 0 0 0 0 5 0
212 3424 0 0 0 0 0 16 0
214 3456 0 0 0 0 0 11 0
217 3504 0 0 0 0 0 6 0
219 3536 0 0 0 0 0 13 0
222 3584 0 0 0 0 0 7 0
223 3600 0 0 0 0 0 15 0
225 3632 0 0 0 0 0 8 0
228 3680 0 0 0 0 0 9 0
230 3712 0 0 0 0 0 10 0
232 3744 0 0 0 0 0 11 0
234 3776 0 0 0 0 0 12 0
235 3792 0 0 0 0 0 13 0
236 3808 0 0 0 0 0 14 0
238 3840 0 0 0 0 0 15 0
254 4096 0 0 0 0 0 1 0
...
Overall the combined zspage chain size effect on zsmalloc pool configuration:::
pages per zspage number of size classes (clusters) huge size class watermark
4 69 3264
5 86 3408
6 93 3504
7 112 3584
8 123 3632
9 140 3680
10 143 3712
11 159 3744
12 164 3776
13 180 3792
14 183 3808
15 188 3840
16 191 3840
A synthetic test
----------------
zram as a build artifacts storage (Linux kernel compilation).
* `CONFIG_ZSMALLOC_CHAIN_SIZE=4`
zsmalloc classes stats:::
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
Total 13 51 413836 412973 159955 3
zram mm_stat:::
1691783168 628083717 655175680 0 655175680 60 0 34048 34049
* `CONFIG_ZSMALLOC_CHAIN_SIZE=8`
zsmalloc classes stats:::
class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
...
Total 18 87 414852 412978 156666 0
zram mm_stat:::
1691803648 627793930 641703936 0 641703936 60 0 33591 33591
Using larger zspage chains may result in using fewer physical pages, as seen
in the example where the number of physical pages used decreased from 159955
to 156666, at the same time maximum zsmalloc pool memory usage went down from
655175680 to 641703936 bytes.
However, this advantage may be offset by the potential for increased system
memory pressure (as some zspages have larger chain sizes) in cases where there
is heavy internal fragmentation and zspool compaction is unable to relocate
objects and release zspages. In these cases, it is recommended to decrease
the limit on the size of the zspage chains (as specified by the
CONFIG_ZSMALLOC_CHAIN_SIZE option).

View File

@ -142,14 +142,14 @@ HPAGE_RESV_OWNER标志被设置以表明该VMA拥有预留。
消耗预留/分配一个巨页 消耗预留/分配一个巨页
=========================== ===========================
当与预留相关的巨页在相应的映射中被分配和实例化时预留就被消耗了。该分配是在函数alloc_huge_page() 当与预留相关的巨页在相应的映射中被分配和实例化时预留就被消耗了。该分配是在函数alloc_hugetlb_folio()
中进行的:: 中进行的::
struct page *alloc_huge_page(struct vm_area_struct *vma, struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve) unsigned long addr, int avoid_reserve)
alloc_huge_page被传递给一个VMA指针和一个虚拟地址因此它可以查阅预留映射以确定是否存在预留。 alloc_hugetlb_folio被传递给一个VMA指针和一个虚拟地址因此它可以查阅预留映射以确定是否存在预留。
此外alloc_huge_page需要一个参数avoid_reserve该参数表示即使看起来已经为指定的地址预留了 此外alloc_hugetlb_folio需要一个参数avoid_reserve该参数表示即使看起来已经为指定的地址预留了
预留也不应该使用预留。avoid_reserve参数最常被用于写时拷贝和页面迁移的情况下即现有页面的额 预留也不应该使用预留。avoid_reserve参数最常被用于写时拷贝和页面迁移的情况下即现有页面的额
外拷贝被分配。 外拷贝被分配。
@ -162,7 +162,7 @@ vma_needs_reservation()返回的值通常为0或1。如果该地址存在预留
确定预留是否存在并可用于分配后调用dequeue_huge_page_vma()函数。这个函数需要两个与预留有关 确定预留是否存在并可用于分配后调用dequeue_huge_page_vma()函数。这个函数需要两个与预留有关
的参数: 的参数:
- avoid_reserve这是传递给alloc_huge_page()的同一个值/参数。 - avoid_reserve这是传递给alloc_hugetlb_folio()的同一个值/参数。
- chg尽管这个参数的类型是long但只有0或1的值被传递给dequeue_huge_page_vma。如果该值为0 - chg尽管这个参数的类型是long但只有0或1的值被传递给dequeue_huge_page_vma。如果该值为0
则表明存在预留(关于可能的问题,请参见 “预留和内存策略” 一节)。如果值 则表明存在预留(关于可能的问题,请参见 “预留和内存策略” 一节)。如果值
为1则表示不存在预留如果可能的话必须从全局空闲池中取出该页。 为1则表示不存在预留如果可能的话必须从全局空闲池中取出该页。
@ -179,7 +179,7 @@ free_huge_pages的值被递减。如果有一个与该页相关的预留
的剩余巨页和超额分配的问题。即使分配了一个多余的页面,也会进行与上面一样的基于预留的调整: 的剩余巨页和超额分配的问题。即使分配了一个多余的页面,也会进行与上面一样的基于预留的调整:
SetPagePrivate(page) 和 resv_huge_pages--. SetPagePrivate(page) 和 resv_huge_pages--.
在获得一个新的巨页后,(page)->private被设置为与该页面相关的子池的值,如果它存在的话。当页 在获得一个新的巨页后,(folio)->_hugetlb_subpool被设置为与该页面相关的子池的值,如果它存在的话。当页
面被释放时,这将被用于子池的计数。 面被释放时,这将被用于子池的计数。
然后调用函数vma_commit_reservation(),根据预留的消耗情况调整预留映射。一般来说,这涉及 然后调用函数vma_commit_reservation(),根据预留的消耗情况调整预留映射。一般来说,这涉及
@ -199,7 +199,7 @@ SetPagePrivate(page)和resv_huge_pages-。
已经存在,所以不做任何改变。然而,如果共享映射中没有预留,或者这是一个私有映射,则必须创建 已经存在,所以不做任何改变。然而,如果共享映射中没有预留,或者这是一个私有映射,则必须创建
一个新的条目。 一个新的条目。
在alloc_huge_page()开始调用vma_needs_reservation()和页面分配后调用 在alloc_hugetlb_folio()开始调用vma_needs_reservation()和页面分配后调用
vma_commit_reservation()之间预留映射有可能被改变。如果hugetlb_reserve_pages在共 vma_commit_reservation()之间预留映射有可能被改变。如果hugetlb_reserve_pages在共
享映射中为同一页面被调用,这将是可能的。在这种情况下,预留计数和子池空闲页计数会有一个偏差。 享映射中为同一页面被调用,这将是可能的。在这种情况下,预留计数和子池空闲页计数会有一个偏差。
这种罕见的情况可以通过比较vma_needs_reservation和vma_commit_reservation的返回值来 这种罕见的情况可以通过比较vma_needs_reservation和vma_commit_reservation的返回值来

View File

@ -51,7 +51,7 @@ page owner在默认情况下是禁用的。所以如果你想使用它
1) 构建用户空间的帮助:: 1) 构建用户空间的帮助::
cd tools/vm cd tools/mm
make page_owner_sort make page_owner_sort
2) 启用page owner: 添加 "page_owner=on" 到 boot cmdline. 2) 启用page owner: 添加 "page_owner=on" 到 boot cmdline.

View File

@ -5657,6 +5657,11 @@ M: SeongJae Park <sj@kernel.org>
L: damon@lists.linux.dev L: damon@lists.linux.dev
L: linux-mm@kvack.org L: linux-mm@kvack.org
S: Maintained S: Maintained
W: https://damonitor.github.io
P: Documentation/mm/damon/maintainer-profile.rst
T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
T: quilt git://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new
T: git git://git.kernel.org/pub/scm/linux/kernel/git/sj/linux.git damon/next
F: Documentation/ABI/testing/sysfs-kernel-mm-damon F: Documentation/ABI/testing/sysfs-kernel-mm-damon
F: Documentation/admin-guide/mm/damon/ F: Documentation/admin-guide/mm/damon/
F: Documentation/mm/damon/ F: Documentation/mm/damon/
@ -9340,7 +9345,7 @@ F: Documentation/mm/hmm.rst
F: include/linux/hmm* F: include/linux/hmm*
F: lib/test_hmm* F: lib/test_hmm*
F: mm/hmm* F: mm/hmm*
F: tools/testing/selftests/vm/*hmm* F: tools/testing/selftests/mm/*hmm*
HOST AP DRIVER HOST AP DRIVER
M: Jouni Malinen <j@w1.fi> M: Jouni Malinen <j@w1.fi>
@ -13378,7 +13383,7 @@ M: Andrew Morton <akpm@linux-foundation.org>
L: linux-mm@kvack.org L: linux-mm@kvack.org
S: Maintained S: Maintained
W: http://www.linux-mm.org W: http://www.linux-mm.org
T: git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
T: quilt git://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new T: quilt git://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new
F: include/linux/gfp.h F: include/linux/gfp.h
F: include/linux/gfp_types.h F: include/linux/gfp_types.h
@ -13387,7 +13392,8 @@ F: include/linux/mm.h
F: include/linux/mmzone.h F: include/linux/mmzone.h
F: include/linux/pagewalk.h F: include/linux/pagewalk.h
F: mm/ F: mm/
F: tools/testing/selftests/vm/ F: tools/mm/
F: tools/testing/selftests/mm/
VMALLOC VMALLOC
M: Andrew Morton <akpm@linux-foundation.org> M: Andrew Morton <akpm@linux-foundation.org>
@ -13396,7 +13402,7 @@ R: Christoph Hellwig <hch@infradead.org>
L: linux-mm@kvack.org L: linux-mm@kvack.org
S: Maintained S: Maintained
W: http://www.linux-mm.org W: http://www.linux-mm.org
T: git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
F: include/linux/vmalloc.h F: include/linux/vmalloc.h
F: mm/vmalloc.c F: mm/vmalloc.c

View File

@ -17,9 +17,8 @@
extern void clear_page(void *page); extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page) #define clear_user_page(page, vaddr, pg) clear_page(page)
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \ #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr) vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
extern void copy_page(void * _to, void * _from); extern void copy_page(void * _to, void * _from);
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from) #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
@ -87,10 +86,6 @@ typedef struct page *pgtable_t;
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT) #define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
#define virt_addr_valid(kaddr) pfn_valid((__pa(kaddr) >> PAGE_SHIFT)) #define virt_addr_valid(kaddr) pfn_valid((__pa(kaddr) >> PAGE_SHIFT))
#ifdef CONFIG_FLATMEM
#define pfn_valid(pfn) ((pfn) < max_mapnr)
#endif /* CONFIG_FLATMEM */
#include <asm-generic/memory_model.h> #include <asm-generic/memory_model.h>
#include <asm-generic/getorder.h> #include <asm-generic/getorder.h>

View File

@ -74,6 +74,9 @@ struct vm_area_struct;
#define _PAGE_DIRTY 0x20000 #define _PAGE_DIRTY 0x20000
#define _PAGE_ACCESSED 0x40000 #define _PAGE_ACCESSED 0x40000
/* We borrow bit 39 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE 0x8000000000UL
/* /*
* NOTE! The "accessed" bit isn't necessarily exact: it can be kept exactly * NOTE! The "accessed" bit isn't necessarily exact: it can be kept exactly
* by software (use the KRE/URE/KWE/UWE bits appropriately), but I'll fake it. * by software (use the KRE/URE/KWE/UWE bits appropriately), but I'll fake it.
@ -301,18 +304,47 @@ extern inline void update_mmu_cache(struct vm_area_struct * vma,
} }
/* /*
* Non-present pages: high 24 bits are offset, next 8 bits type, * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* low 32 bits zero. * are !pte_none() && !pte_present().
*
* Format of swap PTEs:
*
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
* <------------------- offset ------------------> E <--- type -->
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <--------------------------- zeroes -------------------------->
*
* E is the exclusive marker that is not stored in swap entries.
*/ */
extern inline pte_t mk_swap_pte(unsigned long type, unsigned long offset) extern inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
{ pte_t pte; pte_val(pte) = (type << 32) | (offset << 40); return pte; } { pte_t pte; pte_val(pte) = ((type & 0x7f) << 32) | (offset << 40); return pte; }
#define __swp_type(x) (((x).val >> 32) & 0xff) #define __swp_type(x) (((x).val >> 32) & 0x7f)
#define __swp_offset(x) ((x).val >> 40) #define __swp_offset(x) ((x).val >> 40)
#define __swp_entry(type, off) ((swp_entry_t) { pte_val(mk_swap_pte((type), (off))) }) #define __swp_entry(type, off) ((swp_entry_t) { pte_val(mk_swap_pte((type), (off))) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
#define pte_ERROR(e) \ #define pte_ERROR(e) \
printk("%s:%d: bad pte %016lx.\n", __FILE__, __LINE__, pte_val(e)) printk("%s:%d: bad pte %016lx.\n", __FILE__, __LINE__, pte_val(e))
#define pmd_ERROR(e) \ #define pmd_ERROR(e) \

View File

@ -109,7 +109,6 @@ extern int pfn_valid(unsigned long pfn);
#else /* CONFIG_HIGHMEM */ #else /* CONFIG_HIGHMEM */
#define ARCH_PFN_OFFSET virt_to_pfn(CONFIG_LINUX_RAM_BASE) #define ARCH_PFN_OFFSET virt_to_pfn(CONFIG_LINUX_RAM_BASE)
#define pfn_valid(pfn) (((pfn) - ARCH_PFN_OFFSET) < max_mapnr)
#endif /* CONFIG_HIGHMEM */ #endif /* CONFIG_HIGHMEM */

View File

@ -26,6 +26,9 @@
#define _PAGE_GLOBAL (1 << 8) /* ASID agnostic (H) */ #define _PAGE_GLOBAL (1 << 8) /* ASID agnostic (H) */
#define _PAGE_PRESENT (1 << 9) /* PTE/TLB Valid (H) */ #define _PAGE_PRESENT (1 << 9) /* PTE/TLB Valid (H) */
/* We borrow bit 5 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE _PAGE_DIRTY
#ifdef CONFIG_ARC_MMU_V4 #ifdef CONFIG_ARC_MMU_V4
#define _PAGE_HW_SZ (1 << 10) /* Normal/super (H) */ #define _PAGE_HW_SZ (1 << 10) /* Normal/super (H) */
#else #else
@ -106,9 +109,18 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
pte_t *ptep); pte_t *ptep);
/* Encode swap {type,off} tuple into PTE /*
* We reserve 13 bits for 5-bit @type, keeping bits 12-5 zero, ensuring that * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* PAGE_PRESENT is zero in a PTE holding swap "identifier" * are !pte_none() && !pte_present().
*
* Format of swap PTEs:
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <-------------- offset -------------> <--- zero --> E < type ->
*
* E is the exclusive marker that is not stored in swap entries.
* The zero'ed bits include _PAGE_PRESENT.
*/ */
#define __swp_entry(type, off) ((swp_entry_t) \ #define __swp_entry(type, off) ((swp_entry_t) \
{ ((type) & 0x1f) | ((off) << 13) }) { ((type) & 0x1f) | ((off) << 13) })
@ -120,6 +132,14 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
PTE_BIT_FUNC(swp_mkexclusive, |= (_PAGE_SWP_EXCLUSIVE));
PTE_BIT_FUNC(swp_clear_exclusive, &= ~(_PAGE_SWP_EXCLUSIVE));
#ifdef CONFIG_TRANSPARENT_HUGEPAGE #ifdef CONFIG_TRANSPARENT_HUGEPAGE
#include <asm/hugepage.h> #include <asm/hugepage.h>
#endif #endif

View File

@ -386,6 +386,4 @@ static inline unsigned long __virt_to_idmap(unsigned long x)
#endif #endif
#include <asm-generic/memory_model.h>
#endif #endif

View File

@ -158,6 +158,7 @@ typedef struct page *pgtable_t;
#ifdef CONFIG_HAVE_ARCH_PFN_VALID #ifdef CONFIG_HAVE_ARCH_PFN_VALID
extern int pfn_valid(unsigned long); extern int pfn_valid(unsigned long);
#define pfn_valid pfn_valid
#endif #endif
#include <asm/memory.h> #include <asm/memory.h>
@ -167,5 +168,6 @@ extern int pfn_valid(unsigned long);
#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC #define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC
#include <asm-generic/getorder.h> #include <asm-generic/getorder.h>
#include <asm-generic/memory_model.h>
#endif #endif

View File

@ -126,6 +126,9 @@
#define L_PTE_SHARED (_AT(pteval_t, 1) << 10) /* shared(v6), coherent(xsc3) */ #define L_PTE_SHARED (_AT(pteval_t, 1) << 10) /* shared(v6), coherent(xsc3) */
#define L_PTE_NONE (_AT(pteval_t, 1) << 11) #define L_PTE_NONE (_AT(pteval_t, 1) << 11)
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
#define L_PTE_SWP_EXCLUSIVE L_PTE_RDONLY
/* /*
* These are the memory types, defined to be compatible with * These are the memory types, defined to be compatible with
* pre-ARMv6 CPUs cacheable and bufferable bits: n/a,n/a,C,B * pre-ARMv6 CPUs cacheable and bufferable bits: n/a,n/a,C,B

View File

@ -76,6 +76,9 @@
#define L_PTE_NONE (_AT(pteval_t, 1) << 57) /* PROT_NONE */ #define L_PTE_NONE (_AT(pteval_t, 1) << 57) /* PROT_NONE */
#define L_PTE_RDONLY (_AT(pteval_t, 1) << 58) /* READ ONLY */ #define L_PTE_RDONLY (_AT(pteval_t, 1) << 58) /* READ ONLY */
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
#define L_PTE_SWP_EXCLUSIVE (_AT(pteval_t, 1) << 7)
#define L_PMD_SECT_VALID (_AT(pmdval_t, 1) << 0) #define L_PMD_SECT_VALID (_AT(pmdval_t, 1) << 0)
#define L_PMD_SECT_DIRTY (_AT(pmdval_t, 1) << 55) #define L_PMD_SECT_DIRTY (_AT(pmdval_t, 1) << 55)
#define L_PMD_SECT_NONE (_AT(pmdval_t, 1) << 57) #define L_PMD_SECT_NONE (_AT(pmdval_t, 1) << 57)

View File

@ -271,27 +271,47 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
} }
/* /*
* Encode and decode a swap entry. Swap entries are stored in the Linux * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* page tables as follows: * are !pte_none() && !pte_present().
*
* Format of swap PTEs:
* *
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 * 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <--------------- offset ------------------------> < type -> 0 0 * <------------------- offset ------------------> E < type -> 0 0
* *
* This gives us up to 31 swap files and 128GB per swap file. Note that * E is the exclusive marker that is not stored in swap entries.
*
* This gives us up to 31 swap files and 64GB per swap file. Note that
* the offset field is always non-zero. * the offset field is always non-zero.
*/ */
#define __SWP_TYPE_SHIFT 2 #define __SWP_TYPE_SHIFT 2
#define __SWP_TYPE_BITS 5 #define __SWP_TYPE_BITS 5
#define __SWP_TYPE_MASK ((1 << __SWP_TYPE_BITS) - 1) #define __SWP_TYPE_MASK ((1 << __SWP_TYPE_BITS) - 1)
#define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT) #define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT + 1)
#define __swp_type(x) (((x).val >> __SWP_TYPE_SHIFT) & __SWP_TYPE_MASK) #define __swp_type(x) (((x).val >> __SWP_TYPE_SHIFT) & __SWP_TYPE_MASK)
#define __swp_offset(x) ((x).val >> __SWP_OFFSET_SHIFT) #define __swp_offset(x) ((x).val >> __SWP_OFFSET_SHIFT)
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << __SWP_TYPE_SHIFT) | ((offset) << __SWP_OFFSET_SHIFT) }) #define __swp_entry(type, offset) ((swp_entry_t) { (((type) & __SWP_TYPE_MASK) << __SWP_TYPE_SHIFT) | \
((offset) << __SWP_OFFSET_SHIFT) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(swp) __pte((swp).val | PTE_TYPE_FAULT) #define __swp_entry_to_pte(swp) __pte((swp).val)
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_isset(pte, L_PTE_SWP_EXCLUSIVE);
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
return set_pte_bit(pte, __pgprot(L_PTE_SWP_EXCLUSIVE));
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
return clear_pte_bit(pte, __pgprot(L_PTE_SWP_EXCLUSIVE));
}
/* /*
* It is an error for the kernel to have more swap files than we can * It is an error for the kernel to have more swap files than we can

View File

@ -315,7 +315,7 @@ static int __init gate_vma_init(void)
gate_vma.vm_page_prot = PAGE_READONLY_EXEC; gate_vma.vm_page_prot = PAGE_READONLY_EXEC;
gate_vma.vm_start = 0xffff0000; gate_vma.vm_start = 0xffff0000;
gate_vma.vm_end = 0xffff0000 + PAGE_SIZE; gate_vma.vm_end = 0xffff0000 + PAGE_SIZE;
gate_vma.vm_flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYEXEC; vm_flags_init(&gate_vma, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYEXEC);
return 0; return 0;
} }
arch_initcall(gate_vma_init); arch_initcall(gate_vma_init);

View File

@ -29,9 +29,9 @@ void copy_user_highpage(struct page *to, struct page *from,
void copy_highpage(struct page *to, struct page *from); void copy_highpage(struct page *to, struct page *from);
#define __HAVE_ARCH_COPY_HIGHPAGE #define __HAVE_ARCH_COPY_HIGHPAGE
struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma, struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
unsigned long vaddr); unsigned long vaddr);
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE #define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
void tag_clear_highpage(struct page *to); void tag_clear_highpage(struct page *to);
#define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE

View File

@ -421,7 +421,6 @@ static inline pgprot_t mk_pmd_sect_prot(pgprot_t prot)
return __pgprot((pgprot_val(prot) & ~PMD_TABLE_BIT) | PMD_TYPE_SECT); return __pgprot((pgprot_val(prot) & ~PMD_TABLE_BIT) | PMD_TYPE_SECT);
} }
#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
static inline pte_t pte_swp_mkexclusive(pte_t pte) static inline pte_t pte_swp_mkexclusive(pte_t pte)
{ {
return set_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE)); return set_pte_bit(pte, __pgprot(PTE_SWP_EXCLUSIVE));

View File

@ -138,13 +138,11 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
mmap_read_lock(mm); mmap_read_lock(mm);
for_each_vma(vmi, vma) { for_each_vma(vmi, vma) {
unsigned long size = vma->vm_end - vma->vm_start;
if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm)) if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
zap_page_range(vma, vma->vm_start, size); zap_vma_pages(vma);
#ifdef CONFIG_COMPAT_VDSO #ifdef CONFIG_COMPAT_VDSO
if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm)) if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
zap_page_range(vma, vma->vm_start, size); zap_vma_pages(vma);
#endif #endif
} }

View File

@ -925,7 +925,7 @@ NOKPROBE_SYMBOL(do_debug_exception);
/* /*
* Used during anonymous page fault handling. * Used during anonymous page fault handling.
*/ */
struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma, struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
unsigned long vaddr) unsigned long vaddr)
{ {
gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO; gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
@ -938,7 +938,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
if (vma->vm_flags & VM_MTE) if (vma->vm_flags & VM_MTE)
flags |= __GFP_ZEROTAGS; flags |= __GFP_ZEROTAGS;
return alloc_page_vma(flags, vma, vaddr); return vma_alloc_folio(flags, 0, vma, vaddr, false);
} }
void tag_clear_highpage(struct page *page) void tag_clear_highpage(struct page *page)

View File

@ -10,6 +10,9 @@
#define _PAGE_ACCESSED (1<<3) #define _PAGE_ACCESSED (1<<3)
#define _PAGE_MODIFIED (1<<4) #define _PAGE_MODIFIED (1<<4)
/* We borrow bit 9 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (1<<9)
/* implemented in hardware */ /* implemented in hardware */
#define _PAGE_GLOBAL (1<<6) #define _PAGE_GLOBAL (1<<6)
#define _PAGE_VALID (1<<7) #define _PAGE_VALID (1<<7)
@ -26,7 +29,8 @@
#define _PAGE_PROT_NONE _PAGE_READ #define _PAGE_PROT_NONE _PAGE_READ
/* /*
* Encode and decode a swap entry * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
* *
* Format of swap PTE: * Format of swap PTE:
* bit 0: _PAGE_PRESENT (zero) * bit 0: _PAGE_PRESENT (zero)
@ -35,15 +39,16 @@
* bit 6: _PAGE_GLOBAL (zero) * bit 6: _PAGE_GLOBAL (zero)
* bit 7: _PAGE_VALID (zero) * bit 7: _PAGE_VALID (zero)
* bit 8: swap type[4] * bit 8: swap type[4]
* bit 9 - 31: swap offset * bit 9: exclusive marker
* bit 10 - 31: swap offset
*/ */
#define __swp_type(x) ((((x).val >> 2) & 0xf) | \ #define __swp_type(x) ((((x).val >> 2) & 0xf) | \
(((x).val >> 4) & 0x10)) (((x).val >> 4) & 0x10))
#define __swp_offset(x) ((x).val >> 9) #define __swp_offset(x) ((x).val >> 10)
#define __swp_entry(type, offset) ((swp_entry_t) { \ #define __swp_entry(type, offset) ((swp_entry_t) { \
((type & 0xf) << 2) | \ ((type & 0xf) << 2) | \
((type & 0x10) << 4) | \ ((type & 0x10) << 4) | \
((offset) << 9)}) ((offset) << 10)})
#define HAVE_ARCH_UNMAPPED_AREA #define HAVE_ARCH_UNMAPPED_AREA

View File

@ -10,6 +10,9 @@
#define _PAGE_PRESENT (1<<10) #define _PAGE_PRESENT (1<<10)
#define _PAGE_MODIFIED (1<<11) #define _PAGE_MODIFIED (1<<11)
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (1<<7)
/* implemented in hardware */ /* implemented in hardware */
#define _PAGE_GLOBAL (1<<0) #define _PAGE_GLOBAL (1<<0)
#define _PAGE_VALID (1<<1) #define _PAGE_VALID (1<<1)
@ -26,23 +29,25 @@
#define _PAGE_PROT_NONE _PAGE_WRITE #define _PAGE_PROT_NONE _PAGE_WRITE
/* /*
* Encode and decode a swap entry * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
* *
* Format of swap PTE: * Format of swap PTE:
* bit 0: _PAGE_GLOBAL (zero) * bit 0: _PAGE_GLOBAL (zero)
* bit 1: _PAGE_VALID (zero) * bit 1: _PAGE_VALID (zero)
* bit 2 - 6: swap type * bit 2 - 6: swap type
* bit 7 - 8: swap offset[0 - 1] * bit 7: exclusive marker
* bit 8: swap offset[0]
* bit 9: _PAGE_WRITE (zero) * bit 9: _PAGE_WRITE (zero)
* bit 10: _PAGE_PRESENT (zero) * bit 10: _PAGE_PRESENT (zero)
* bit 11 - 31: swap offset[2 - 22] * bit 11 - 31: swap offset[1 - 21]
*/ */
#define __swp_type(x) (((x).val >> 2) & 0x1f) #define __swp_type(x) (((x).val >> 2) & 0x1f)
#define __swp_offset(x) ((((x).val >> 7) & 0x3) | \ #define __swp_offset(x) ((((x).val >> 8) & 0x1) | \
(((x).val >> 9) & 0x7ffffc)) (((x).val >> 10) & 0x3ffffe))
#define __swp_entry(type, offset) ((swp_entry_t) { \ #define __swp_entry(type, offset) ((swp_entry_t) { \
((type & 0x1f) << 2) | \ ((type & 0x1f) << 2) | \
((offset & 0x3) << 7) | \ ((offset & 0x1) << 8) | \
((offset & 0x7ffffc) << 9)}) ((offset & 0x3ffffe) << 10)})
#endif /* __ASM_CSKY_PGTABLE_BITS_H */ #endif /* __ASM_CSKY_PGTABLE_BITS_H */

View File

@ -39,7 +39,6 @@
#define virt_addr_valid(kaddr) ((void *)(kaddr) >= (void *)PAGE_OFFSET && \ #define virt_addr_valid(kaddr) ((void *)(kaddr) >= (void *)PAGE_OFFSET && \
(void *)(kaddr) < high_memory) (void *)(kaddr) < high_memory)
#define pfn_valid(pfn) ((pfn) >= ARCH_PFN_OFFSET && ((pfn) - ARCH_PFN_OFFSET) < max_mapnr)
extern void *memset(void *dest, int c, size_t l); extern void *memset(void *dest, int c, size_t l);
extern void *memcpy(void *to, const void *from, size_t l); extern void *memcpy(void *to, const void *from, size_t l);

View File

@ -200,6 +200,23 @@ static inline pte_t pte_mkyoung(pte_t pte)
return pte; return pte;
} }
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
#define __HAVE_PHYS_MEM_ACCESS_PROT #define __HAVE_PHYS_MEM_ACCESS_PROT
struct file; struct file;
extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,

View File

@ -95,7 +95,6 @@ struct page;
/* Default vm area behavior is non-executable. */ /* Default vm area behavior is non-executable. */
#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_NON_EXEC #define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_NON_EXEC
#define pfn_valid(pfn) ((pfn) < max_mapnr)
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
/* Need to not use a define for linesize; may move this to another file. */ /* Need to not use a define for linesize; may move this to another file. */

View File

@ -61,6 +61,9 @@ extern unsigned long empty_zero_page;
* So we'll put up with a bit of inefficiency for now... * So we'll put up with a bit of inefficiency for now...
*/ */
/* We borrow bit 6 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (1<<6)
/* /*
* Top "FOURTH" level (pgd), which for the Hexagon VM is really * Top "FOURTH" level (pgd), which for the Hexagon VM is really
* only the second from the bottom, pgd and pud both being collapsed. * only the second from the bottom, pgd and pud both being collapsed.
@ -359,9 +362,12 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
#define ZERO_PAGE(vaddr) (virt_to_page(&empty_zero_page)) #define ZERO_PAGE(vaddr) (virt_to_page(&empty_zero_page))
/* /*
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
*
* Swap/file PTE definitions. If _PAGE_PRESENT is zero, the rest of the PTE is * Swap/file PTE definitions. If _PAGE_PRESENT is zero, the rest of the PTE is
* interpreted as swap information. The remaining free bits are interpreted as * interpreted as swap information. The remaining free bits are interpreted as
* swap type/offset tuple. Rather than have the TLB fill handler test * listed below. Rather than have the TLB fill handler test
* _PAGE_PRESENT, we're going to reserve the permissions bits and set them to * _PAGE_PRESENT, we're going to reserve the permissions bits and set them to
* all zeros for swap entries, which speeds up the miss handler at the cost of * all zeros for swap entries, which speeds up the miss handler at the cost of
* 3 bits of offset. That trade-off can be revisited if necessary, but Hexagon * 3 bits of offset. That trade-off can be revisited if necessary, but Hexagon
@ -371,9 +377,10 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
* Format of swap PTE: * Format of swap PTE:
* bit 0: Present (zero) * bit 0: Present (zero)
* bits 1-5: swap type (arch independent layer uses 5 bits max) * bits 1-5: swap type (arch independent layer uses 5 bits max)
* bits 6-9: bits 3:0 of offset * bit 6: exclusive marker
* bits 7-9: bits 2:0 of offset
* bits 10-12: effectively _PAGE_PROTNONE (all zero) * bits 10-12: effectively _PAGE_PROTNONE (all zero)
* bits 13-31: bits 22:4 of swap offset * bits 13-31: bits 21:3 of swap offset
* *
* The split offset makes some of the following macros a little gnarly, * The split offset makes some of the following macros a little gnarly,
* but there's plenty of precedent for this sort of thing. * but there's plenty of precedent for this sort of thing.
@ -383,11 +390,28 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
#define __swp_type(swp_pte) (((swp_pte).val >> 1) & 0x1f) #define __swp_type(swp_pte) (((swp_pte).val >> 1) & 0x1f)
#define __swp_offset(swp_pte) \ #define __swp_offset(swp_pte) \
((((swp_pte).val >> 6) & 0xf) | (((swp_pte).val >> 9) & 0x7ffff0)) ((((swp_pte).val >> 7) & 0x7) | (((swp_pte).val >> 10) & 0x3ffff8))
#define __swp_entry(type, offset) \ #define __swp_entry(type, offset) \
((swp_entry_t) { \ ((swp_entry_t) { \
((type << 1) | \ (((type & 0x1f) << 1) | \
((offset & 0x7ffff0) << 9) | ((offset & 0xf) << 6)) }) ((offset & 0x3ffff8) << 10) | ((offset & 0x7) << 7)) })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
#endif #endif

View File

@ -82,25 +82,19 @@ do { \
} while (0) } while (0)
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \ #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
({ \ ({ \
struct page *page = alloc_page_vma( \ struct folio *folio = vma_alloc_folio( \
GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr); \ GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false); \
if (page) \ if (folio) \
flush_dcache_page(page); \ flush_dcache_folio(folio); \
page; \ folio; \
}) })
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
#include <asm-generic/memory_model.h> #include <asm-generic/memory_model.h>
#ifdef CONFIG_FLATMEM
# define pfn_valid(pfn) ((pfn) < max_mapnr)
#endif
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT) #define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT) #define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT) #define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)

View File

@ -58,6 +58,9 @@
#define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */ #define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */
#define _PAGE_PROTNONE (__IA64_UL(1) << 63) #define _PAGE_PROTNONE (__IA64_UL(1) << 63)
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (1 << 7)
#define _PFN_MASK _PAGE_PPN_MASK #define _PFN_MASK _PAGE_PPN_MASK
/* Mask of bits which may be changed by pte_modify(); the odd bits are there for _PAGE_PROTNONE */ /* Mask of bits which may be changed by pte_modify(); the odd bits are there for _PAGE_PROTNONE */
#define _PAGE_CHG_MASK (_PAGE_P | _PAGE_PROTNONE | _PAGE_PL_MASK | _PAGE_AR_MASK | _PAGE_ED) #define _PAGE_CHG_MASK (_PAGE_P | _PAGE_PROTNONE | _PAGE_PL_MASK | _PAGE_AR_MASK | _PAGE_ED)
@ -399,6 +402,9 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
extern void paging_init (void); extern void paging_init (void);
/* /*
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
*
* Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of * Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of
* bits in the swap-type field of the swap pte. It would be nice to * bits in the swap-type field of the swap pte. It would be nice to
* enforce that, but we can't easily include <linux/swap.h> here. * enforce that, but we can't easily include <linux/swap.h> here.
@ -406,16 +412,35 @@ extern void paging_init (void);
* *
* Format of swap pte: * Format of swap pte:
* bit 0 : present bit (must be zero) * bit 0 : present bit (must be zero)
* bits 1- 7: swap-type * bits 1- 6: swap type
* bit 7 : exclusive marker
* bits 8-62: swap offset * bits 8-62: swap offset
* bit 63 : _PAGE_PROTNONE bit * bit 63 : _PAGE_PROTNONE bit
*/ */
#define __swp_type(entry) (((entry).val >> 1) & 0x7f) #define __swp_type(entry) (((entry).val >> 1) & 0x3f)
#define __swp_offset(entry) (((entry).val << 1) >> 9) #define __swp_offset(entry) (((entry).val << 1) >> 9)
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 1) | ((long) (offset) << 8) }) #define __swp_entry(type, offset) ((swp_entry_t) { ((type & 0x3f) << 1) | \
((long) (offset) << 8) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
/* /*
* ZERO_PAGE is a global shared page that is always zero: used * ZERO_PAGE is a global shared page that is always zero: used
* for zero-mapped memory areas etc.. * for zero-mapped memory areas etc..

View File

@ -109,7 +109,7 @@ ia64_init_addr_space (void)
vma_set_anonymous(vma); vma_set_anonymous(vma);
vma->vm_start = current->thread.rbs_bot & PAGE_MASK; vma->vm_start = current->thread.rbs_bot & PAGE_MASK;
vma->vm_end = vma->vm_start + PAGE_SIZE; vma->vm_end = vma->vm_start + PAGE_SIZE;
vma->vm_flags = VM_DATA_DEFAULT_FLAGS|VM_GROWSUP|VM_ACCOUNT; vm_flags_init(vma, VM_DATA_DEFAULT_FLAGS|VM_GROWSUP|VM_ACCOUNT);
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags); vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
mmap_write_lock(current->mm); mmap_write_lock(current->mm);
if (insert_vm_struct(current->mm, vma)) { if (insert_vm_struct(current->mm, vma)) {
@ -127,8 +127,8 @@ ia64_init_addr_space (void)
vma_set_anonymous(vma); vma_set_anonymous(vma);
vma->vm_end = PAGE_SIZE; vma->vm_end = PAGE_SIZE;
vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT); vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | vm_flags_init(vma, VM_READ | VM_MAYREAD | VM_IO |
VM_DONTEXPAND | VM_DONTDUMP; VM_DONTEXPAND | VM_DONTDUMP);
mmap_write_lock(current->mm); mmap_write_lock(current->mm);
if (insert_vm_struct(current->mm, vma)) { if (insert_vm_struct(current->mm, vma)) {
mmap_write_unlock(current->mm); mmap_write_unlock(current->mm);
@ -272,7 +272,7 @@ static int __init gate_vma_init(void)
vma_init(&gate_vma, NULL); vma_init(&gate_vma, NULL);
gate_vma.vm_start = FIXADDR_USER_START; gate_vma.vm_start = FIXADDR_USER_START;
gate_vma.vm_end = FIXADDR_USER_END; gate_vma.vm_end = FIXADDR_USER_END;
gate_vma.vm_flags = VM_READ | VM_MAYREAD | VM_EXEC | VM_MAYEXEC; vm_flags_init(&gate_vma, VM_READ | VM_MAYREAD | VM_EXEC | VM_MAYEXEC);
gate_vma.vm_page_prot = __pgprot(__ACCESS_BITS | _PAGE_PL_3 | _PAGE_AR_RX); gate_vma.vm_page_prot = __pgprot(__ACCESS_BITS | _PAGE_PL_3 | _PAGE_AR_RX);
return 0; return 0;

View File

@ -82,19 +82,6 @@ typedef struct { unsigned long pgprot; } pgprot_t;
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT) #define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
#ifdef CONFIG_FLATMEM
static inline int pfn_valid(unsigned long pfn)
{
/* avoid <linux/mm.h> include hell */
extern unsigned long max_mapnr;
unsigned long pfn_offset = ARCH_PFN_OFFSET;
return pfn >= pfn_offset && pfn < max_mapnr;
}
#endif
#define virt_to_pfn(kaddr) PFN_DOWN(PHYSADDR(kaddr)) #define virt_to_pfn(kaddr) PFN_DOWN(PHYSADDR(kaddr))
#define virt_to_page(kaddr) pfn_to_page(virt_to_pfn(kaddr)) #define virt_to_page(kaddr) pfn_to_page(virt_to_pfn(kaddr))

View File

@ -20,6 +20,7 @@
#define _PAGE_SPECIAL_SHIFT 11 #define _PAGE_SPECIAL_SHIFT 11
#define _PAGE_HGLOBAL_SHIFT 12 /* HGlobal is a PMD bit */ #define _PAGE_HGLOBAL_SHIFT 12 /* HGlobal is a PMD bit */
#define _PAGE_PFN_SHIFT 12 #define _PAGE_PFN_SHIFT 12
#define _PAGE_SWP_EXCLUSIVE_SHIFT 23
#define _PAGE_PFN_END_SHIFT 48 #define _PAGE_PFN_END_SHIFT 48
#define _PAGE_NO_READ_SHIFT 61 #define _PAGE_NO_READ_SHIFT 61
#define _PAGE_NO_EXEC_SHIFT 62 #define _PAGE_NO_EXEC_SHIFT 62
@ -33,6 +34,9 @@
#define _PAGE_PROTNONE (_ULCAST_(1) << _PAGE_PROTNONE_SHIFT) #define _PAGE_PROTNONE (_ULCAST_(1) << _PAGE_PROTNONE_SHIFT)
#define _PAGE_SPECIAL (_ULCAST_(1) << _PAGE_SPECIAL_SHIFT) #define _PAGE_SPECIAL (_ULCAST_(1) << _PAGE_SPECIAL_SHIFT)
/* We borrow bit 23 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (_ULCAST_(1) << _PAGE_SWP_EXCLUSIVE_SHIFT)
/* Used by TLB hardware (placed in EntryLo*) */ /* Used by TLB hardware (placed in EntryLo*) */
#define _PAGE_VALID (_ULCAST_(1) << _PAGE_VALID_SHIFT) #define _PAGE_VALID (_ULCAST_(1) << _PAGE_VALID_SHIFT)
#define _PAGE_DIRTY (_ULCAST_(1) << _PAGE_DIRTY_SHIFT) #define _PAGE_DIRTY (_ULCAST_(1) << _PAGE_DIRTY_SHIFT)

View File

@ -249,13 +249,26 @@ extern void pud_init(void *addr);
extern void pmd_init(void *addr); extern void pmd_init(void *addr);
/* /*
* Non-present pages: high 40 bits are offset, next 8 bits type, * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* low 16 bits zero. * are !pte_none() && !pte_present().
*
* Format of swap PTEs:
*
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
* <--------------------------- offset ---------------------------
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* --------------> E <--- type ---> <---------- zeroes ---------->
*
* E is the exclusive marker that is not stored in swap entries.
* The zero'ed bits include _PAGE_PRESENT and _PAGE_PROTNONE.
*/ */
static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset) static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
{ pte_t pte; pte_val(pte) = (type << 16) | (offset << 24); return pte; } { pte_t pte; pte_val(pte) = ((type & 0x7f) << 16) | (offset << 24); return pte; }
#define __swp_type(x) (((x).val >> 16) & 0xff) #define __swp_type(x) (((x).val >> 16) & 0x7f)
#define __swp_offset(x) ((x).val >> 24) #define __swp_offset(x) ((x).val >> 24)
#define __swp_entry(type, offset) ((swp_entry_t) { pte_val(mk_swap_pte((type), (offset))) }) #define __swp_entry(type, offset) ((swp_entry_t) { pte_val(mk_swap_pte((type), (offset))) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
@ -263,6 +276,23 @@ static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) }) #define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) })
#define __swp_entry_to_pmd(x) ((pmd_t) { (x).val | _PAGE_HUGE }) #define __swp_entry_to_pmd(x) ((pmd_t) { (x).val | _PAGE_HUGE })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
extern void paging_init(void); extern void paging_init(void);
#define pte_none(pte) (!(pte_val(pte) & ~_PAGE_GLOBAL)) #define pte_none(pte) (!(pte_val(pte) & ~_PAGE_GLOBAL))

View File

@ -149,7 +149,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)
struct vm_area_struct vma; struct vm_area_struct vma;
vma.vm_mm = tlb->mm; vma.vm_mm = tlb->mm;
vma.vm_flags = 0; vm_flags_init(&vma, 0);
if (tlb->fullmm) { if (tlb->fullmm) {
flush_tlb_mm(tlb->mm); flush_tlb_mm(tlb->mm);
return; return;

View File

@ -46,6 +46,9 @@
#define _CACHEMASK040 (~0x060) #define _CACHEMASK040 (~0x060)
#define _PAGE_GLOBAL040 0x400 /* 68040 global bit, used for kva descs */ #define _PAGE_GLOBAL040 0x400 /* 68040 global bit, used for kva descs */
/* We borrow bit 24 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE CF_PAGE_NOCACHE
/* /*
* Externally used page protection values. * Externally used page protection values.
*/ */
@ -254,15 +257,41 @@ static inline pte_t pte_mkcache(pte_t pte)
extern pgd_t kernel_pg_dir[PTRS_PER_PGD]; extern pgd_t kernel_pg_dir[PTRS_PER_PGD];
/* /*
* Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e)) * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
*
* Format of swap PTEs:
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <------------------ offset -------------> 0 0 0 E <-- type --->
*
* E is the exclusive marker that is not stored in swap entries.
*/ */
#define __swp_type(x) ((x).val & 0xFF) #define __swp_type(x) ((x).val & 0x7f)
#define __swp_offset(x) ((x).val >> 11) #define __swp_offset(x) ((x).val >> 11)
#define __swp_entry(typ, off) ((swp_entry_t) { (typ) | \ #define __swp_entry(typ, off) ((swp_entry_t) { ((typ) & 0x7f) | \
(off << 11) }) (off << 11) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) (__pte((x).val)) #define __swp_entry_to_pte(x) (__pte((x).val))
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
#define pmd_pfn(pmd) (pmd_val(pmd) >> PAGE_SHIFT) #define pmd_pfn(pmd) (pmd_val(pmd) >> PAGE_SHIFT)
#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)) #define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))

View File

@ -41,6 +41,9 @@
#define _PAGE_PROTNONE 0x004 #define _PAGE_PROTNONE 0x004
/* We borrow bit 11 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE 0x800
#ifndef __ASSEMBLY__ #ifndef __ASSEMBLY__
/* This is the cache mode to be used for pages containing page descriptors for /* This is the cache mode to be used for pages containing page descriptors for
@ -124,7 +127,7 @@ static inline void pud_set(pud_t *pudp, pmd_t *pmdp)
* expects pmd_page() to exists, only to then DCE it all. Provide a dummy to * expects pmd_page() to exists, only to then DCE it all. Provide a dummy to
* make the compiler happy. * make the compiler happy.
*/ */
#define pmd_page(pmd) NULL #define pmd_page(pmd) ((struct page *)NULL)
#define pud_none(pud) (!pud_val(pud)) #define pud_none(pud) (!pud_val(pud))
@ -169,12 +172,40 @@ static inline pte_t pte_mkcache(pte_t pte)
#define swapper_pg_dir kernel_pg_dir #define swapper_pg_dir kernel_pg_dir
extern pgd_t kernel_pg_dir[128]; extern pgd_t kernel_pg_dir[128];
/* Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e)) */ /*
#define __swp_type(x) (((x).val >> 4) & 0xff) * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
*
* Format of swap PTEs:
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <----------------- offset ------------> E <-- type ---> 0 0 0 0
*
* E is the exclusive marker that is not stored in swap entries.
*/
#define __swp_type(x) (((x).val >> 4) & 0x7f)
#define __swp_offset(x) ((x).val >> 12) #define __swp_offset(x) ((x).val >> 12)
#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 4) | ((offset) << 12) }) #define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x7f) << 4) | ((offset) << 12) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
#endif /* !__ASSEMBLY__ */ #endif /* !__ASSEMBLY__ */
#endif /* _MOTOROLA_PGTABLE_H */ #endif /* _MOTOROLA_PGTABLE_H */

View File

@ -62,11 +62,7 @@ extern unsigned long _ramend;
#include <asm/page_no.h> #include <asm/page_no.h>
#endif #endif
#ifndef CONFIG_MMU
#define __phys_to_pfn(paddr) ((unsigned long)((paddr) >> PAGE_SHIFT))
#define __pfn_to_phys(pfn) PFN_PHYS(pfn)
#endif
#include <asm-generic/getorder.h> #include <asm-generic/getorder.h>
#include <asm-generic/memory_model.h>
#endif /* _M68K_PAGE_H */ #endif /* _M68K_PAGE_H */

View File

@ -134,7 +134,6 @@ extern int m68k_virt_to_node_shift;
}) })
#define ARCH_PFN_OFFSET (m68k_memory[0].addr >> PAGE_SHIFT) #define ARCH_PFN_OFFSET (m68k_memory[0].addr >> PAGE_SHIFT)
#include <asm-generic/memory_model.h>
#define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && (unsigned long)(kaddr) < (unsigned long)high_memory) #define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && (unsigned long)(kaddr) < (unsigned long)high_memory)
#define pfn_valid(pfn) virt_addr_valid(pfn_to_virt(pfn)) #define pfn_valid(pfn) virt_addr_valid(pfn_to_virt(pfn))

View File

@ -13,9 +13,8 @@ extern unsigned long memory_end;
#define clear_user_page(page, vaddr, pg) clear_page(page) #define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from) #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \ #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr) vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
#define __pa(vaddr) ((unsigned long)(vaddr)) #define __pa(vaddr) ((unsigned long)(vaddr))
#define __va(paddr) ((void *)((unsigned long)(paddr))) #define __va(paddr) ((void *)((unsigned long)(paddr)))
@ -26,13 +25,11 @@ extern unsigned long memory_end;
#define virt_to_page(addr) (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT)) #define virt_to_page(addr) (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))
#define page_to_virt(page) __va(((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET)) #define page_to_virt(page) __va(((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET))
#define pfn_to_page(pfn) virt_to_page(pfn_to_virt(pfn))
#define page_to_pfn(page) virt_to_pfn(page_to_virt(page))
#define pfn_valid(pfn) ((pfn) < max_mapnr)
#define virt_addr_valid(kaddr) (((unsigned long)(kaddr) >= PAGE_OFFSET) && \ #define virt_addr_valid(kaddr) (((unsigned long)(kaddr) >= PAGE_OFFSET) && \
((unsigned long)(kaddr) < memory_end)) ((unsigned long)(kaddr) < memory_end))
#define ARCH_PFN_OFFSET PHYS_PFN(PAGE_OFFSET_RAW)
#endif /* __ASSEMBLY__ */ #endif /* __ASSEMBLY__ */
#endif /* _M68K_PAGE_NO_H */ #endif /* _M68K_PAGE_NO_H */

View File

@ -31,12 +31,6 @@
extern void paging_init(void); extern void paging_init(void);
#define swapper_pg_dir ((pgd_t *) 0) #define swapper_pg_dir ((pgd_t *) 0)
#define __swp_type(x) (0)
#define __swp_offset(x) (0)
#define __swp_entry(typ,off) ((swp_entry_t) { ((typ) | ((off) << 7)) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
/* /*
* ZERO_PAGE is a global shared page that is always zero: used * ZERO_PAGE is a global shared page that is always zero: used
* for zero-mapped memory areas etc.. * for zero-mapped memory areas etc..

View File

@ -71,6 +71,9 @@
#define SUN3_PMD_MASK (0x0000003F) #define SUN3_PMD_MASK (0x0000003F)
#define SUN3_PMD_MAGIC (0x0000002B) #define SUN3_PMD_MAGIC (0x0000002B)
/* We borrow bit 6 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE 0x040
#ifndef __ASSEMBLY__ #ifndef __ASSEMBLY__
/* /*
@ -152,12 +155,41 @@ static inline pte_t pte_mkcache(pte_t pte) { return pte; }
extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
extern pgd_t kernel_pg_dir[PTRS_PER_PGD]; extern pgd_t kernel_pg_dir[PTRS_PER_PGD];
/* Macros to (de)construct the fake PTEs representing swap pages. */ /*
#define __swp_type(x) ((x).val & 0x7F) * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
*
* Format of swap PTEs:
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* 0 <--------------------- offset ----------------> E <- type -->
*
* E is the exclusive marker that is not stored in swap entries.
*/
#define __swp_type(x) ((x).val & 0x3f)
#define __swp_offset(x) (((x).val) >> 7) #define __swp_offset(x) (((x).val) >> 7)
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) | ((offset) << 7)) }) #define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x3f) | \
(((offset) << 7) & ~SUN3_PAGE_VALID)) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
#endif /* !__ASSEMBLY__ */ #endif /* !__ASSEMBLY__ */
#endif /* !_SUN3_PGTABLE_H */ #endif /* !_SUN3_PGTABLE_H */

View File

@ -112,7 +112,6 @@ extern int page_is_ram(unsigned long pfn);
# define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT) # define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
# define ARCH_PFN_OFFSET (memory_start >> PAGE_SHIFT) # define ARCH_PFN_OFFSET (memory_start >> PAGE_SHIFT)
# define pfn_valid(pfn) ((pfn) >= ARCH_PFN_OFFSET && (pfn) < (max_mapnr + ARCH_PFN_OFFSET))
# endif /* __ASSEMBLY__ */ # endif /* __ASSEMBLY__ */
#define virt_addr_valid(vaddr) (pfn_valid(virt_to_pfn(vaddr))) #define virt_addr_valid(vaddr) (pfn_valid(virt_to_pfn(vaddr)))

View File

@ -131,10 +131,10 @@ extern pte_t *va_to_pte(unsigned long address);
* of the 16 available. Bit 24-26 of the TLB are cleared in the TLB * of the 16 available. Bit 24-26 of the TLB are cleared in the TLB
* miss handler. Bit 27 is PAGE_USER, thus selecting the correct * miss handler. Bit 27 is PAGE_USER, thus selecting the correct
* zone. * zone.
* - PRESENT *must* be in the bottom two bits because swap cache * - PRESENT *must* be in the bottom two bits because swap PTEs use the top
* entries use the top 30 bits. Because 4xx doesn't support SMP * 30 bits. Because 4xx doesn't support SMP anyway, M is irrelevant so we
* anyway, M is irrelevant so we borrow it for PAGE_PRESENT. Bit 30 * borrow it for PAGE_PRESENT. Bit 30 is cleared in the TLB miss handler
* is cleared in the TLB miss handler before the TLB entry is loaded. * before the TLB entry is loaded.
* - All other bits of the PTE are loaded into TLBLO without * - All other bits of the PTE are loaded into TLBLO without
* * modification, leaving us only the bits 20, 21, 24, 25, 26, 30 for * * modification, leaving us only the bits 20, 21, 24, 25, 26, 30 for
* software PTE bits. We actually use bits 21, 24, 25, and * software PTE bits. We actually use bits 21, 24, 25, and
@ -155,6 +155,9 @@ extern pte_t *va_to_pte(unsigned long address);
#define _PAGE_ACCESSED 0x400 /* software: R: page referenced */ #define _PAGE_ACCESSED 0x400 /* software: R: page referenced */
#define _PMD_PRESENT PAGE_MASK #define _PMD_PRESENT PAGE_MASK
/* We borrow bit 24 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE _PAGE_DIRTY
/* /*
* Some bits are unused... * Some bits are unused...
*/ */
@ -393,18 +396,39 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
/* /*
* Encode and decode a swap entry. * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* Note that the bits we use in a PTE for representing a swap entry * are !pte_none() && !pte_present().
* must not include the _PAGE_PRESENT bit, or the _PAGE_HASHPTE bit *
* (if used). -- paulus * 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* <------------------ offset -------------------> E < type -> 0 0
*
* E is the exclusive marker that is not stored in swap entries.
*/ */
#define __swp_type(entry) ((entry).val & 0x3f) #define __swp_type(entry) ((entry).val & 0x1f)
#define __swp_offset(entry) ((entry).val >> 6) #define __swp_offset(entry) ((entry).val >> 6)
#define __swp_entry(type, offset) \ #define __swp_entry(type, offset) \
((swp_entry_t) { (type) | ((offset) << 6) }) ((swp_entry_t) { ((type) & 0x1f) | ((offset) << 6) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 2 }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 2 })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val << 2 }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << 2 })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
extern unsigned long iopa(unsigned long addr); extern unsigned long iopa(unsigned long addr);
/* Values for nocacheflag and cmode */ /* Values for nocacheflag and cmode */

View File

@ -224,34 +224,6 @@ extern phys_addr_t __phys_addr_symbol(unsigned long x);
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT) #define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
#ifdef CONFIG_FLATMEM
static inline int pfn_valid(unsigned long pfn)
{
/* avoid <linux/mm.h> include hell */
extern unsigned long max_mapnr;
unsigned long pfn_offset = ARCH_PFN_OFFSET;
return pfn >= pfn_offset && pfn < max_mapnr;
}
#elif defined(CONFIG_SPARSEMEM)
/* pfn_valid is defined in linux/mmzone.h */
#elif defined(CONFIG_NUMA)
#define pfn_valid(pfn) \
({ \
unsigned long __pfn = (pfn); \
int __n = pfn_to_nid(__pfn); \
((__n >= 0) ? (__pfn < NODE_DATA(__n)->node_start_pfn + \
NODE_DATA(__n)->node_spanned_pages) \
: 0); \
})
#endif
#define virt_to_pfn(kaddr) PFN_DOWN(virt_to_phys((void *)(kaddr))) #define virt_to_pfn(kaddr) PFN_DOWN(virt_to_phys((void *)(kaddr)))
#define virt_to_page(kaddr) pfn_to_page(virt_to_pfn(kaddr)) #define virt_to_page(kaddr) pfn_to_page(virt_to_pfn(kaddr))

View File

@ -191,49 +191,113 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t prot)
#define pte_page(x) pfn_to_page(pte_pfn(x)) #define pte_page(x) pfn_to_page(pte_pfn(x))
/*
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
*/
#if defined(CONFIG_CPU_R3K_TLB) #if defined(CONFIG_CPU_R3K_TLB)
/* Swap entries must have VALID bit cleared. */ /*
* Format of swap PTEs:
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <----------- offset ------------> < type -> V G E 0 0 0 0 0 0 P
*
* E is the exclusive marker that is not stored in swap entries.
* _PAGE_PRESENT (P), _PAGE_VALID (V) and_PAGE_GLOBAL (G) have to remain
* unused.
*/
#define __swp_type(x) (((x).val >> 10) & 0x1f) #define __swp_type(x) (((x).val >> 10) & 0x1f)
#define __swp_offset(x) ((x).val >> 15) #define __swp_offset(x) ((x).val >> 15)
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 10) | ((offset) << 15) }) #define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x1f) << 10) | ((offset) << 15) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (1 << 7)
#else #else
#if defined(CONFIG_XPA) #if defined(CONFIG_XPA)
/* Swap entries must have VALID and GLOBAL bits cleared. */ /*
* Format of swap PTEs:
*
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
* 0 0 0 0 0 0 E P <------------------ zeroes ------------------->
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <----------------- offset ------------------> < type -> V G 0 0
*
* E is the exclusive marker that is not stored in swap entries.
* _PAGE_PRESENT (P), _PAGE_VALID (V) and_PAGE_GLOBAL (G) have to remain
* unused.
*/
#define __swp_type(x) (((x).val >> 4) & 0x1f) #define __swp_type(x) (((x).val >> 4) & 0x1f)
#define __swp_offset(x) ((x).val >> 9) #define __swp_offset(x) ((x).val >> 9)
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 4) | ((offset) << 9) }) #define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x1f) << 4) | ((offset) << 9) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_high }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_high })
#define __swp_entry_to_pte(x) ((pte_t) { 0, (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { 0, (x).val })
/*
* We borrow bit 57 (bit 25 in the low PTE) to store the exclusive marker in
* swap PTEs.
*/
#define _PAGE_SWP_EXCLUSIVE (1 << 25)
#elif defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32) #elif defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
/* Swap entries must have VALID and GLOBAL bits cleared. */ /*
* Format of swap PTEs:
*
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
* <------------------ zeroes -------------------> E P 0 0 0 0 0 0
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <------------------- offset --------------------> < type -> V G
*
* E is the exclusive marker that is not stored in swap entries.
* _PAGE_PRESENT (P), _PAGE_VALID (V) and_PAGE_GLOBAL (G) have to remain
* unused.
*/
#define __swp_type(x) (((x).val >> 2) & 0x1f) #define __swp_type(x) (((x).val >> 2) & 0x1f)
#define __swp_offset(x) ((x).val >> 7) #define __swp_offset(x) ((x).val >> 7)
#define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 2) | ((offset) << 7) }) #define __swp_entry(type, offset) ((swp_entry_t) { (((type) & 0x1f) << 2) | ((offset) << 7) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_high }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_high })
#define __swp_entry_to_pte(x) ((pte_t) { 0, (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { 0, (x).val })
/*
* We borrow bit 39 (bit 7 in the low PTE) to store the exclusive marker in swap
* PTEs.
*/
#define _PAGE_SWP_EXCLUSIVE (1 << 7)
#else #else
/* /*
* Constraints: * Format of swap PTEs:
* _PAGE_PRESENT at bit 0 *
* _PAGE_MODIFIED at bit 4 * 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* _PAGE_GLOBAL at bit 6 * 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* _PAGE_VALID at bit 7 * <------------- offset --------------> < type -> 0 0 0 0 0 0 E P
*
* E is the exclusive marker that is not stored in swap entries.
* _PAGE_PRESENT (P), _PAGE_VALID (V) and_PAGE_GLOBAL (G) have to remain
* unused. The location of V and G varies.
*/ */
#define __swp_type(x) (((x).val >> 8) & 0x1f) #define __swp_type(x) (((x).val >> 8) & 0x1f)
#define __swp_offset(x) ((x).val >> 13) #define __swp_offset(x) ((x).val >> 13)
#define __swp_entry(type,offset) ((swp_entry_t) { ((type) << 8) | ((offset) << 13) }) #define __swp_entry(type, offset) ((swp_entry_t) { ((type) << 8) | ((offset) << 13) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
/* We borrow bit 1 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (1 << 1)
#endif /* defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32) */ #endif /* defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32) */
#endif /* defined(CONFIG_CPU_R3K_TLB) */ #endif /* defined(CONFIG_CPU_R3K_TLB) */

View File

@ -320,16 +320,31 @@ extern void pud_init(void *addr);
extern void pmd_init(void *addr); extern void pmd_init(void *addr);
/* /*
* Non-present pages: high 40 bits are offset, next 8 bits type, * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* low 16 bits zero. * are !pte_none() && !pte_present().
*
* Format of swap PTEs:
*
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
* <--------------------------- offset ---------------------------
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* --------------> E <-- type ---> <---------- zeroes ----------->
*
* E is the exclusive marker that is not stored in swap entries.
*/ */
static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset) static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
{ pte_t pte; pte_val(pte) = (type << 16) | (offset << 24); return pte; } { pte_t pte; pte_val(pte) = ((type & 0x7f) << 16) | (offset << 24); return pte; }
#define __swp_type(x) (((x).val >> 16) & 0xff) #define __swp_type(x) (((x).val >> 16) & 0x7f)
#define __swp_offset(x) ((x).val >> 24) #define __swp_offset(x) ((x).val >> 24)
#define __swp_entry(type, offset) ((swp_entry_t) { pte_val(mk_swap_pte((type), (offset))) }) #define __swp_entry(type, offset) ((swp_entry_t) { pte_val(mk_swap_pte((type), (offset))) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
/* We borrow bit 23 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (1 << 23)
#endif /* _ASM_PGTABLE_64_H */ #endif /* _ASM_PGTABLE_64_H */

View File

@ -528,6 +528,41 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
} }
#endif #endif
#if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
static inline int pte_swp_exclusive(pte_t pte)
{
return pte.pte_low & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte.pte_low |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte.pte_low &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
#else
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
#endif
extern void __update_tlb(struct vm_area_struct *vma, unsigned long address, extern void __update_tlb(struct vm_area_struct *vma, unsigned long address,
pte_t pte); pte_t pte);

View File

@ -86,15 +86,6 @@ extern struct page *mem_map;
# define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT) # define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
static inline bool pfn_valid(unsigned long pfn)
{
/* avoid <linux/mm.h> include hell */
extern unsigned long max_mapnr;
unsigned long pfn_offset = ARCH_PFN_OFFSET;
return pfn >= pfn_offset && pfn < max_mapnr;
}
# define virt_to_page(vaddr) pfn_to_page(PFN_DOWN(virt_to_phys(vaddr))) # define virt_to_page(vaddr) pfn_to_page(PFN_DOWN(virt_to_phys(vaddr)))
# define virt_addr_valid(vaddr) pfn_valid(PFN_DOWN(virt_to_phys(vaddr))) # define virt_addr_valid(vaddr) pfn_valid(PFN_DOWN(virt_to_phys(vaddr)))

View File

@ -31,4 +31,7 @@
#define _PAGE_ACCESSED (1<<26) /* page referenced */ #define _PAGE_ACCESSED (1<<26) /* page referenced */
#define _PAGE_DIRTY (1<<27) /* dirty page */ #define _PAGE_DIRTY (1<<27) /* dirty page */
/* We borrow bit 31 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (1<<31)
#endif /* _ASM_NIOS2_PGTABLE_BITS_H */ #endif /* _ASM_NIOS2_PGTABLE_BITS_H */

View File

@ -232,23 +232,44 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
__FILE__, __LINE__, pgd_val(e)) __FILE__, __LINE__, pgd_val(e))
/* /*
* Encode and decode a swap entry (must be !pte_none(pte) && !pte_present(pte): * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
* *
* 31 30 29 28 27 26 25 24 23 22 21 20 19 18 ... 1 0 * Format of swap PTEs:
* 0 0 0 0 type. 0 0 0 0 0 0 offset.........
* *
* This gives us up to 2**2 = 4 swap files and 2**20 * 4K = 4G per swap file. * 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* E < type -> 0 0 0 0 0 0 <-------------- offset --------------->
* *
* Note that the offset field is always non-zero, thus !pte_none(pte) is always * E is the exclusive marker that is not stored in swap entries.
* true. *
* Note that the offset field is always non-zero if the swap type is 0, thus
* !pte_none() is always true.
*/ */
#define __swp_type(swp) (((swp).val >> 26) & 0x3) #define __swp_type(swp) (((swp).val >> 26) & 0x1f)
#define __swp_offset(swp) ((swp).val & 0xfffff) #define __swp_offset(swp) ((swp).val & 0xfffff)
#define __swp_entry(type, off) ((swp_entry_t) { (((type) & 0x3) << 26) \ #define __swp_entry(type, off) ((swp_entry_t) { (((type) & 0x1f) << 26) \
| ((off) & 0xfffff) }) | ((off) & 0xfffff) })
#define __swp_entry_to_pte(swp) ((pte_t) { (swp).val }) #define __swp_entry_to_pte(swp) ((pte_t) { (swp).val })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
extern void __init paging_init(void); extern void __init paging_init(void);
extern void __init mmu_init(void); extern void __init mmu_init(void);

View File

@ -80,8 +80,6 @@ typedef struct page *pgtable_t;
#define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT) #define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)
#define pfn_valid(pfn) ((pfn) < max_mapnr)
#define virt_addr_valid(kaddr) (pfn_valid(virt_to_pfn(kaddr))) #define virt_addr_valid(kaddr) (pfn_valid(virt_to_pfn(kaddr)))
#endif /* __ASSEMBLY__ */ #endif /* __ASSEMBLY__ */

View File

@ -154,6 +154,9 @@ extern void paging_init(void);
#define _KERNPG_TABLE \ #define _KERNPG_TABLE \
(_PAGE_BASE | _PAGE_SRE | _PAGE_SWE | _PAGE_ACCESSED | _PAGE_DIRTY) (_PAGE_BASE | _PAGE_SRE | _PAGE_SWE | _PAGE_ACCESSED | _PAGE_DIRTY)
/* We borrow bit 11 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE _PAGE_U_SHARED
#define PAGE_NONE __pgprot(_PAGE_ALL) #define PAGE_NONE __pgprot(_PAGE_ALL)
#define PAGE_READONLY __pgprot(_PAGE_ALL | _PAGE_URE | _PAGE_SRE) #define PAGE_READONLY __pgprot(_PAGE_ALL | _PAGE_URE | _PAGE_SRE)
#define PAGE_READONLY_X __pgprot(_PAGE_ALL | _PAGE_URE | _PAGE_SRE | _PAGE_EXEC) #define PAGE_READONLY_X __pgprot(_PAGE_ALL | _PAGE_URE | _PAGE_SRE | _PAGE_EXEC)
@ -385,16 +388,43 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
/* __PHX__ FIXME, SWAP, this probably doesn't work */ /* __PHX__ FIXME, SWAP, this probably doesn't work */
/* Encode and de-code a swap entry (must be !pte_none(e) && !pte_present(e)) */ /*
/* Since the PAGE_PRESENT bit is bit 4, we can use the bits above */ * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
#define __swp_type(x) (((x).val >> 5) & 0x7f) *
* Format of swap PTEs:
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <-------------- offset ---------------> E <- type --> 0 0 0 0 0
*
* E is the exclusive marker that is not stored in swap entries.
* The zero'ed bits include _PAGE_PRESENT.
*/
#define __swp_type(x) (((x).val >> 5) & 0x3f)
#define __swp_offset(x) ((x).val >> 12) #define __swp_offset(x) ((x).val >> 12)
#define __swp_entry(type, offset) \ #define __swp_entry(type, offset) \
((swp_entry_t) { ((type) << 5) | ((offset) << 12) }) ((swp_entry_t) { (((type) & 0x3f) << 5) | ((offset) << 12) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
typedef pte_t *pte_addr_t; typedef pte_t *pte_addr_t;
#endif /* __ASSEMBLY__ */ #endif /* __ASSEMBLY__ */

View File

@ -155,10 +155,6 @@ extern int npmem_ranges;
#define __pa(x) ((unsigned long)(x)-PAGE_OFFSET) #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET)) #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
#ifndef CONFIG_SPARSEMEM
#define pfn_valid(pfn) ((pfn) < max_mapnr)
#endif
#ifdef CONFIG_HUGETLB_PAGE #ifdef CONFIG_HUGETLB_PAGE
#define HPAGE_SHIFT PMD_SHIFT /* fixed for transparent huge pages */ #define HPAGE_SHIFT PMD_SHIFT /* fixed for transparent huge pages */
#define HPAGE_SIZE ((1UL) << HPAGE_SHIFT) #define HPAGE_SIZE ((1UL) << HPAGE_SHIFT)

View File

@ -218,6 +218,9 @@ extern void __update_cache(pte_t pte);
#define _PAGE_KERNEL_RWX (_PAGE_KERNEL_EXEC | _PAGE_WRITE) #define _PAGE_KERNEL_RWX (_PAGE_KERNEL_EXEC | _PAGE_WRITE)
#define _PAGE_KERNEL (_PAGE_KERNEL_RO | _PAGE_WRITE) #define _PAGE_KERNEL (_PAGE_KERNEL_RO | _PAGE_WRITE)
/* We borrow bit 23 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE _PAGE_ACCESSED
/* The pgd/pmd contains a ptr (in phys addr space); since all pgds/pmds /* The pgd/pmd contains a ptr (in phys addr space); since all pgds/pmds
* are page-aligned, we don't care about the PAGE_OFFSET bits, except * are page-aligned, we don't care about the PAGE_OFFSET bits, except
* for a few meta-information bits, so we shift the address to be * for a few meta-information bits, so we shift the address to be
@ -394,17 +397,48 @@ extern void paging_init (void);
#define update_mmu_cache(vms,addr,ptep) __update_cache(*ptep) #define update_mmu_cache(vms,addr,ptep) __update_cache(*ptep)
/* Encode and de-code a swap entry */ /*
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
*
* Format of swap PTEs (32bit):
*
* 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* <---------------- offset -----------------> P E <ofs> < type ->
*
* E is the exclusive marker that is not stored in swap entries.
* _PAGE_PRESENT (P) must be 0.
*
* For the 64bit version, the offset is extended by 32bit.
*/
#define __swp_type(x) ((x).val & 0x1f) #define __swp_type(x) ((x).val & 0x1f)
#define __swp_offset(x) ( (((x).val >> 6) & 0x7) | \ #define __swp_offset(x) ( (((x).val >> 6) & 0x7) | \
(((x).val >> 8) & ~0x7) ) (((x).val >> 8) & ~0x7) )
#define __swp_entry(type, offset) ((swp_entry_t) { (type) | \ #define __swp_entry(type, offset) ((swp_entry_t) { \
((type) & 0x1f) | \
((offset & 0x7) << 6) | \ ((offset & 0x7) << 6) | \
((offset & ~0x7) << 8) }) ((offset & ~0x7) << 8) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
return pte;
}
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
{ {
pte_t pte; pte_t pte;

View File

@ -42,6 +42,9 @@
#define _PMD_PRESENT_MASK (PAGE_MASK) #define _PMD_PRESENT_MASK (PAGE_MASK)
#define _PMD_BAD (~PAGE_MASK) #define _PMD_BAD (~PAGE_MASK)
/* We borrow the _PAGE_USER bit to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE _PAGE_USER
/* And here we include common definitions */ /* And here we include common definitions */
#define _PAGE_KERNEL_RO 0 #define _PAGE_KERNEL_RO 0
@ -363,17 +366,41 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma,
#define pmd_page(pmd) pfn_to_page(pmd_pfn(pmd)) #define pmd_page(pmd) pfn_to_page(pmd_pfn(pmd))
/* /*
* Encode and decode a swap entry. * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* Note that the bits we use in a PTE for representing a swap entry * are !pte_none() && !pte_present().
* must not include the _PAGE_PRESENT bit or the _PAGE_HASHPTE bit (if used). *
* -- paulus * Format of swap PTEs (32bit PTEs):
*
* 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* <----------------- offset --------------------> < type -> E H P
*
* E is the exclusive marker that is not stored in swap entries.
* _PAGE_PRESENT (P) and __PAGE_HASHPTE (H) must be 0.
*
* For 64bit PTEs, the offset is extended by 32bit.
*/ */
#define __swp_type(entry) ((entry).val & 0x1f) #define __swp_type(entry) ((entry).val & 0x1f)
#define __swp_offset(entry) ((entry).val >> 5) #define __swp_offset(entry) ((entry).val >> 5)
#define __swp_entry(type, offset) ((swp_entry_t) { (type) | ((offset) << 5) }) #define __swp_entry(type, offset) ((swp_entry_t) { ((type) & 0x1f) | ((offset) << 5) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 3 }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 3 })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val << 3 }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << 3 })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
return __pte(pte_val(pte) | _PAGE_SWP_EXCLUSIVE);
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE);
}
/* Generic accessors to PTE bits */ /* Generic accessors to PTE bits */
static inline int pte_write(pte_t pte) { return !!(pte_val(pte) & _PAGE_RW);} static inline int pte_write(pte_t pte) { return !!(pte_val(pte) & _PAGE_RW);}
static inline int pte_read(pte_t pte) { return 1; } static inline int pte_read(pte_t pte) { return 1; }

View File

@ -717,7 +717,6 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
} }
#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
static inline pte_t pte_swp_mkexclusive(pte_t pte) static inline pte_t pte_swp_mkexclusive(pte_t pte)
{ {
return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SWP_EXCLUSIVE)); return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SWP_EXCLUSIVE));

View File

@ -360,18 +360,30 @@ static inline int pte_young(pte_t pte)
#endif #endif
#define pmd_page(pmd) pfn_to_page(pmd_pfn(pmd)) #define pmd_page(pmd) pfn_to_page(pmd_pfn(pmd))
/* /*
* Encode and decode a swap entry. * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* Note that the bits we use in a PTE for representing a swap entry * are !pte_none() && !pte_present().
* must not include the _PAGE_PRESENT bit. *
* -- paulus * Format of swap PTEs (32bit PTEs):
*
* 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* <------------------ offset -------------------> < type -> E 0 0
*
* E is the exclusive marker that is not stored in swap entries.
*
* For 64bit PTEs, the offset is extended by 32bit.
*/ */
#define __swp_type(entry) ((entry).val & 0x1f) #define __swp_type(entry) ((entry).val & 0x1f)
#define __swp_offset(entry) ((entry).val >> 5) #define __swp_offset(entry) ((entry).val >> 5)
#define __swp_entry(type, offset) ((swp_entry_t) { (type) | ((offset) << 5) }) #define __swp_entry(type, offset) ((swp_entry_t) { ((type) & 0x1f) | ((offset) << 5) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 3 }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 3 })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val << 3 }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << 3 })
/* We borrow LSB 2 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE 0x000004
#endif /* !__ASSEMBLY__ */ #endif /* !__ASSEMBLY__ */
#endif /* __ASM_POWERPC_NOHASH_32_PGTABLE_H */ #endif /* __ASM_POWERPC_NOHASH_32_PGTABLE_H */

View File

@ -27,9 +27,9 @@
* of the 16 available. Bit 24-26 of the TLB are cleared in the TLB * of the 16 available. Bit 24-26 of the TLB are cleared in the TLB
* miss handler. Bit 27 is PAGE_USER, thus selecting the correct * miss handler. Bit 27 is PAGE_USER, thus selecting the correct
* zone. * zone.
* - PRESENT *must* be in the bottom two bits because swap cache * - PRESENT *must* be in the bottom two bits because swap PTEs
* entries use the top 30 bits. Because 40x doesn't support SMP * use the top 30 bits. Because 40x doesn't support SMP anyway, M is
* anyway, M is irrelevant so we borrow it for PAGE_PRESENT. Bit 30 * irrelevant so we borrow it for PAGE_PRESENT. Bit 30
* is cleared in the TLB miss handler before the TLB entry is loaded. * is cleared in the TLB miss handler before the TLB entry is loaded.
* - All other bits of the PTE are loaded into TLBLO without * - All other bits of the PTE are loaded into TLBLO without
* modification, leaving us only the bits 20, 21, 24, 25, 26, 30 for * modification, leaving us only the bits 20, 21, 24, 25, 26, 30 for

View File

@ -56,20 +56,10 @@
* above bits. Note that the bit values are CPU specific, not architecture * above bits. Note that the bit values are CPU specific, not architecture
* specific. * specific.
* *
* The kernel PTE entry holds an arch-dependent swp_entry structure under * The kernel PTE entry can be an ordinary PTE mapping a page or a special swap
* certain situations. In other words, in such situations some portion of * PTE. In case of a swap PTE, LSB 2-24 are used to store information regarding
* the PTE bits are used as a swp_entry. In the PPC implementation, the * the swap entry. However LSB 0-1 still hold protection values, for example,
* 3-24th LSB are shared with swp_entry, however the 0-2nd three LSB still * to distinguish swap PTEs from ordinary PTEs, and must be used with care.
* hold protection values. That means the three protection bits are
* reserved for both PTE and SWAP entry at the most significant three
* LSBs.
*
* There are three protection bits available for SWAP entry:
* _PAGE_PRESENT
* _PAGE_HASHPTE (if HW has)
*
* So those three bits have to be inside of 0-2nd LSB of PTE.
*
*/ */
#define _PAGE_PRESENT 0x00000001 /* S: PTE valid */ #define _PAGE_PRESENT 0x00000001 /* S: PTE valid */

View File

@ -11,8 +11,8 @@
32 33 34 35 36 ... 50 51 52 53 54 55 56 57 58 59 60 61 62 63 32 33 34 35 36 ... 50 51 52 53 54 55 56 57 58 59 60 61 62 63
RPN...................... 0 0 U0 U1 U2 U3 UX SX UW SW UR SR RPN...................... 0 0 U0 U1 U2 U3 UX SX UW SW UR SR
- PRESENT *must* be in the bottom three bits because swap cache - PRESENT *must* be in the bottom two bits because swap PTEs use
entries use the top 29 bits. the top 30 bits.
*/ */

View File

@ -276,22 +276,40 @@ static inline void __ptep_set_access_flags(struct vm_area_struct *vma,
#define pgd_ERROR(e) \ #define pgd_ERROR(e) \
pr_err("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e)) pr_err("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e))
/* Encode and de-code a swap entry */ /*
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
*
* Format of swap PTEs:
*
* 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* <-------------------------- offset ----------------------------
*
* 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6
* 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
* --------------> <----------- zero ------------> E < type -> 0 0
*
* E is the exclusive marker that is not stored in swap entries.
*/
#define MAX_SWAPFILES_CHECK() do { \ #define MAX_SWAPFILES_CHECK() do { \
BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS); \ BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS); \
} while (0) } while (0)
#define SWP_TYPE_BITS 5 #define SWP_TYPE_BITS 5
#define __swp_type(x) (((x).val >> _PAGE_BIT_SWAP_TYPE) \ #define __swp_type(x) (((x).val >> 2) \
& ((1UL << SWP_TYPE_BITS) - 1)) & ((1UL << SWP_TYPE_BITS) - 1))
#define __swp_offset(x) ((x).val >> PTE_RPN_SHIFT) #define __swp_offset(x) ((x).val >> PTE_RPN_SHIFT)
#define __swp_entry(type, offset) ((swp_entry_t) { \ #define __swp_entry(type, offset) ((swp_entry_t) { \
((type) << _PAGE_BIT_SWAP_TYPE) \ (((type) & 0x1f) << 2) \
| ((offset) << PTE_RPN_SHIFT) }) | ((offset) << PTE_RPN_SHIFT) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val((pte)) })
#define __swp_entry_to_pte(x) __pte((x).val) #define __swp_entry_to_pte(x) __pte((x).val)
/* We borrow MSB 56 (LSB 7) to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE 0x80
int map_kernel_page(unsigned long ea, unsigned long pa, pgprot_t prot); int map_kernel_page(unsigned long ea, unsigned long pa, pgprot_t prot);
void unmap_kernel_page(unsigned long va); void unmap_kernel_page(unsigned long va);
extern int __meminit vmemmap_create_mapping(unsigned long start, extern int __meminit vmemmap_create_mapping(unsigned long start,

View File

@ -151,6 +151,21 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot)); return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
} }
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
return __pte(pte_val(pte) | _PAGE_SWP_EXCLUSIVE);
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE);
}
/* Insert a PTE, top-level function is out of line. It uses an inline /* Insert a PTE, top-level function is out of line. It uses an inline
* low level function in the respective pgtable-* files * low level function in the respective pgtable-* files
*/ */

View File

@ -12,7 +12,6 @@
/* Architected bits */ /* Architected bits */
#define _PAGE_PRESENT 0x000001 /* software: pte contains a translation */ #define _PAGE_PRESENT 0x000001 /* software: pte contains a translation */
#define _PAGE_SW1 0x000002 #define _PAGE_SW1 0x000002
#define _PAGE_BIT_SWAP_TYPE 2
#define _PAGE_BAP_SR 0x000004 #define _PAGE_BAP_SR 0x000004
#define _PAGE_BAP_UR 0x000008 #define _PAGE_BAP_UR 0x000008
#define _PAGE_BAP_SW 0x000010 #define _PAGE_BAP_SW 0x000010

View File

@ -117,15 +117,6 @@ extern long long virt_phys_offset;
#ifdef CONFIG_FLATMEM #ifdef CONFIG_FLATMEM
#define ARCH_PFN_OFFSET ((unsigned long)(MEMORY_START >> PAGE_SHIFT)) #define ARCH_PFN_OFFSET ((unsigned long)(MEMORY_START >> PAGE_SHIFT))
#ifndef __ASSEMBLY__
extern unsigned long max_mapnr;
static inline bool pfn_valid(unsigned long pfn)
{
unsigned long min_pfn = ARCH_PFN_OFFSET;
return pfn >= min_pfn && pfn < max_mapnr;
}
#endif
#endif #endif
#define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT) #define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT)

View File

@ -132,7 +132,7 @@ void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size)
* address decoding but I'd rather not deal with those outside of the * address decoding but I'd rather not deal with those outside of the
* reserved 64K legacy region. * reserved 64K legacy region.
*/ */
area = __get_vm_area_caller(size, 0, PHB_IO_BASE, PHB_IO_END, area = __get_vm_area_caller(size, VM_IOREMAP, PHB_IO_BASE, PHB_IO_END,
__builtin_return_address(0)); __builtin_return_address(0));
if (!area) if (!area)
return NULL; return NULL;

View File

@ -120,10 +120,8 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
mmap_read_lock(mm); mmap_read_lock(mm);
for_each_vma(vmi, vma) { for_each_vma(vmi, vma) {
unsigned long size = vma->vm_end - vma->vm_start;
if (vma_is_special_mapping(vma, &vvar_spec)) if (vma_is_special_mapping(vma, &vvar_spec))
zap_page_range(vma, vma->vm_start, size); zap_vma_pages(vma);
} }
mmap_read_unlock(mm); mmap_read_unlock(mm);

View File

@ -393,6 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
{ {
unsigned long gfn = memslot->base_gfn; unsigned long gfn = memslot->base_gfn;
unsigned long end, start = gfn_to_hva(kvm, gfn); unsigned long end, start = gfn_to_hva(kvm, gfn);
unsigned long vm_flags;
int ret = 0; int ret = 0;
struct vm_area_struct *vma; struct vm_area_struct *vma;
int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE; int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
@ -409,12 +410,15 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
ret = H_STATE; ret = H_STATE;
break; break;
} }
/* Copy vm_flags to avoid partial modifications in ksm_madvise */
vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end, ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
merge_flag, &vma->vm_flags); merge_flag, &vm_flags);
if (ret) { if (ret) {
ret = H_STATE; ret = H_STATE;
break; break;
} }
vm_flags_reset(vma, vm_flags);
start = vma->vm_end; start = vma->vm_end;
} while (end > vma->vm_end); } while (end > vma->vm_end);

View File

@ -324,7 +324,7 @@ static int kvmppc_xive_native_mmap(struct kvm_device *dev,
return -EINVAL; return -EINVAL;
} }
vma->vm_flags |= VM_IO | VM_PFNMAP; vm_flags_set(vma, VM_IO | VM_PFNMAP);
vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot); vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
/* /*

View File

@ -156,7 +156,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
* VM_NOHUGEPAGE and split them. * VM_NOHUGEPAGE and split them.
*/ */
for_each_vma_range(vmi, vma, addr + len) { for_each_vma_range(vmi, vma, addr + len) {
vma->vm_flags |= VM_NOHUGEPAGE; vm_flags_set(vma, VM_NOHUGEPAGE);
walk_page_vma(vma, &subpage_walk_ops, NULL); walk_page_vma(vma, &subpage_walk_ops, NULL);
} }
} }

View File

@ -414,7 +414,7 @@ static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
/* /*
* When the LPAR lost credits due to core removal or during * When the LPAR lost credits due to core removal or during
* migration, invalidate the existing mapping for the current * migration, invalidate the existing mapping for the current
* paste addresses and set windows in-active (zap_page_range in * paste addresses and set windows in-active (zap_vma_pages in
* reconfig_close_windows()). * reconfig_close_windows()).
* New mapping will be done later after migration or new credits * New mapping will be done later after migration or new credits
* available. So continue to receive faults if the user space * available. So continue to receive faults if the user space
@ -525,7 +525,7 @@ static int coproc_mmap(struct file *fp, struct vm_area_struct *vma)
pfn = paste_addr >> PAGE_SHIFT; pfn = paste_addr >> PAGE_SHIFT;
/* flags, page_prot from cxl_mmap(), except we want cachable */ /* flags, page_prot from cxl_mmap(), except we want cachable */
vma->vm_flags |= VM_IO | VM_PFNMAP; vm_flags_set(vma, VM_IO | VM_PFNMAP);
vma->vm_page_prot = pgprot_cached(vma->vm_page_prot); vma->vm_page_prot = pgprot_cached(vma->vm_page_prot);
prot = __pgprot(pgprot_val(vma->vm_page_prot) | _PAGE_DIRTY); prot = __pgprot(pgprot_val(vma->vm_page_prot) | _PAGE_DIRTY);

View File

@ -291,7 +291,7 @@ static int spufs_mem_mmap(struct file *file, struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_SHARED)) if (!(vma->vm_flags & VM_SHARED))
return -EINVAL; return -EINVAL;
vma->vm_flags |= VM_IO | VM_PFNMAP; vm_flags_set(vma, VM_IO | VM_PFNMAP);
vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot); vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
vma->vm_ops = &spufs_mem_mmap_vmops; vma->vm_ops = &spufs_mem_mmap_vmops;
@ -381,7 +381,7 @@ static int spufs_cntl_mmap(struct file *file, struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_SHARED)) if (!(vma->vm_flags & VM_SHARED))
return -EINVAL; return -EINVAL;
vma->vm_flags |= VM_IO | VM_PFNMAP; vm_flags_set(vma, VM_IO | VM_PFNMAP);
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_ops = &spufs_cntl_mmap_vmops; vma->vm_ops = &spufs_cntl_mmap_vmops;
@ -1043,7 +1043,7 @@ static int spufs_signal1_mmap(struct file *file, struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_SHARED)) if (!(vma->vm_flags & VM_SHARED))
return -EINVAL; return -EINVAL;
vma->vm_flags |= VM_IO | VM_PFNMAP; vm_flags_set(vma, VM_IO | VM_PFNMAP);
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_ops = &spufs_signal1_mmap_vmops; vma->vm_ops = &spufs_signal1_mmap_vmops;
@ -1179,7 +1179,7 @@ static int spufs_signal2_mmap(struct file *file, struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_SHARED)) if (!(vma->vm_flags & VM_SHARED))
return -EINVAL; return -EINVAL;
vma->vm_flags |= VM_IO | VM_PFNMAP; vm_flags_set(vma, VM_IO | VM_PFNMAP);
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_ops = &spufs_signal2_mmap_vmops; vma->vm_ops = &spufs_signal2_mmap_vmops;
@ -1302,7 +1302,7 @@ static int spufs_mss_mmap(struct file *file, struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_SHARED)) if (!(vma->vm_flags & VM_SHARED))
return -EINVAL; return -EINVAL;
vma->vm_flags |= VM_IO | VM_PFNMAP; vm_flags_set(vma, VM_IO | VM_PFNMAP);
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_ops = &spufs_mss_mmap_vmops; vma->vm_ops = &spufs_mss_mmap_vmops;
@ -1364,7 +1364,7 @@ static int spufs_psmap_mmap(struct file *file, struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_SHARED)) if (!(vma->vm_flags & VM_SHARED))
return -EINVAL; return -EINVAL;
vma->vm_flags |= VM_IO | VM_PFNMAP; vm_flags_set(vma, VM_IO | VM_PFNMAP);
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_ops = &spufs_psmap_mmap_vmops; vma->vm_ops = &spufs_psmap_mmap_vmops;
@ -1424,7 +1424,7 @@ static int spufs_mfc_mmap(struct file *file, struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_SHARED)) if (!(vma->vm_flags & VM_SHARED))
return -EINVAL; return -EINVAL;
vma->vm_flags |= VM_IO | VM_PFNMAP; vm_flags_set(vma, VM_IO | VM_PFNMAP);
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_ops = &spufs_mfc_mmap_vmops; vma->vm_ops = &spufs_mfc_mmap_vmops;

View File

@ -760,8 +760,7 @@ static int reconfig_close_windows(struct vas_caps *vcap, int excess_creds,
* is done before the original mmap() and after the ioctl. * is done before the original mmap() and after the ioctl.
*/ */
if (vma) if (vma)
zap_page_range(vma, vma->vm_start, zap_vma_pages(vma);
vma->vm_end - vma->vm_start);
mmap_write_unlock(task_ref->mm); mmap_write_unlock(task_ref->mm);
mutex_unlock(&task_ref->mmap_mutex); mutex_unlock(&task_ref->mmap_mutex);

View File

@ -171,11 +171,6 @@ extern phys_addr_t __phys_addr_symbol(unsigned long x);
#define sym_to_pfn(x) __phys_to_pfn(__pa_symbol(x)) #define sym_to_pfn(x) __phys_to_pfn(__pa_symbol(x))
#ifdef CONFIG_FLATMEM
#define pfn_valid(pfn) \
(((pfn) >= ARCH_PFN_OFFSET) && (((pfn) - ARCH_PFN_OFFSET) < max_mapnr))
#endif
#endif /* __ASSEMBLY__ */ #endif /* __ASSEMBLY__ */
#define virt_addr_valid(vaddr) ({ \ #define virt_addr_valid(vaddr) ({ \

View File

@ -27,6 +27,9 @@
*/ */
#define _PAGE_PROT_NONE _PAGE_GLOBAL #define _PAGE_PROT_NONE _PAGE_GLOBAL
/* Used for swap PTEs only. */
#define _PAGE_SWP_EXCLUSIVE _PAGE_ACCESSED
#define _PAGE_PFN_SHIFT 10 #define _PAGE_PFN_SHIFT 10
/* /*

View File

@ -728,16 +728,18 @@ extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/* /*
* Encode and decode a swap entry * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
* *
* Format of swap PTE: * Format of swap PTE:
* bit 0: _PAGE_PRESENT (zero) * bit 0: _PAGE_PRESENT (zero)
* bit 1 to 3: _PAGE_LEAF (zero) * bit 1 to 3: _PAGE_LEAF (zero)
* bit 5: _PAGE_PROT_NONE (zero) * bit 5: _PAGE_PROT_NONE (zero)
* bits 6 to 10: swap type * bit 6: exclusive marker
* bits 10 to XLEN-1: swap offset * bits 7 to 11: swap type
* bits 11 to XLEN-1: swap offset
*/ */
#define __SWP_TYPE_SHIFT 6 #define __SWP_TYPE_SHIFT 7
#define __SWP_TYPE_BITS 5 #define __SWP_TYPE_BITS 5
#define __SWP_TYPE_MASK ((1UL << __SWP_TYPE_BITS) - 1) #define __SWP_TYPE_MASK ((1UL << __SWP_TYPE_BITS) - 1)
#define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT) #define __SWP_OFFSET_SHIFT (__SWP_TYPE_BITS + __SWP_TYPE_SHIFT)
@ -748,11 +750,27 @@ extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
#define __swp_type(x) (((x).val >> __SWP_TYPE_SHIFT) & __SWP_TYPE_MASK) #define __swp_type(x) (((x).val >> __SWP_TYPE_SHIFT) & __SWP_TYPE_MASK)
#define __swp_offset(x) ((x).val >> __SWP_OFFSET_SHIFT) #define __swp_offset(x) ((x).val >> __SWP_OFFSET_SHIFT)
#define __swp_entry(type, offset) ((swp_entry_t) \ #define __swp_entry(type, offset) ((swp_entry_t) \
{ ((type) << __SWP_TYPE_SHIFT) | ((offset) << __SWP_OFFSET_SHIFT) }) { (((type) & __SWP_TYPE_MASK) << __SWP_TYPE_SHIFT) | \
((offset) << __SWP_OFFSET_SHIFT) })
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val })
static inline int pte_swp_exclusive(pte_t pte)
{
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
}
static inline pte_t pte_swp_mkexclusive(pte_t pte)
{
return __pte(pte_val(pte) | _PAGE_SWP_EXCLUSIVE);
}
static inline pte_t pte_swp_clear_exclusive(pte_t pte)
{
return __pte(pte_val(pte) & ~_PAGE_SWP_EXCLUSIVE);
}
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
#define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) }) #define __pmd_to_swp_entry(pmd) ((swp_entry_t) { pmd_val(pmd) })
#define __swp_entry_to_pmd(swp) __pmd((swp).val) #define __swp_entry_to_pmd(swp) __pmd((swp).val)

View File

@ -124,13 +124,11 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
mmap_read_lock(mm); mmap_read_lock(mm);
for_each_vma(vmi, vma) { for_each_vma(vmi, vma) {
unsigned long size = vma->vm_end - vma->vm_start;
if (vma_is_special_mapping(vma, vdso_info.dm)) if (vma_is_special_mapping(vma, vdso_info.dm))
zap_page_range(vma, vma->vm_start, size); zap_vma_pages(vma);
#ifdef CONFIG_COMPAT #ifdef CONFIG_COMPAT
if (vma_is_special_mapping(vma, compat_vdso_info.dm)) if (vma_is_special_mapping(vma, compat_vdso_info.dm))
zap_page_range(vma, vma->vm_start, size); zap_vma_pages(vma);
#endif #endif
} }

View File

@ -73,9 +73,8 @@ static inline void copy_page(void *to, void *from)
#define clear_user_page(page, vaddr, pg) clear_page(page) #define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from) #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
#define alloc_zeroed_user_highpage_movable(vma, vaddr) \ #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr) vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
/* /*
* These are used to make use of C type-checking.. * These are used to make use of C type-checking..

View File

@ -827,7 +827,6 @@ static inline int pmd_protnone(pmd_t pmd)
} }
#endif #endif
#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
static inline int pte_swp_exclusive(pte_t pte) static inline int pte_swp_exclusive(pte_t pte)
{ {
return pte_val(pte) & _PAGE_SWP_EXCLUSIVE; return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;

View File

@ -59,11 +59,9 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
mmap_read_lock(mm); mmap_read_lock(mm);
for_each_vma(vmi, vma) { for_each_vma(vmi, vma) {
unsigned long size = vma->vm_end - vma->vm_start;
if (!vma_is_special_mapping(vma, &vvar_mapping)) if (!vma_is_special_mapping(vma, &vvar_mapping))
continue; continue;
zap_page_range(vma, vma->vm_start, size); zap_vma_pages(vma);
break; break;
} }
mmap_read_unlock(mm); mmap_read_unlock(mm);

View File

@ -722,7 +722,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
if (is_vm_hugetlb_page(vma)) if (is_vm_hugetlb_page(vma))
continue; continue;
size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK)); size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
zap_page_range(vma, vmaddr, size); zap_page_range_single(vma, vmaddr, size, NULL);
} }
mmap_read_unlock(gmap->mm); mmap_read_unlock(gmap->mm);
} }
@ -2522,8 +2522,7 @@ static inline void thp_split_mm(struct mm_struct *mm)
VMA_ITERATOR(vmi, mm, 0); VMA_ITERATOR(vmi, mm, 0);
for_each_vma(vmi, vma) { for_each_vma(vmi, vma) {
vma->vm_flags &= ~VM_HUGEPAGE; vm_flags_mod(vma, VM_NOHUGEPAGE, VM_HUGEPAGE);
vma->vm_flags |= VM_NOHUGEPAGE;
walk_page_vma(vma, &thp_split_walk_ops, NULL); walk_page_vma(vma, &thp_split_walk_ops, NULL);
} }
mm->def_flags |= VM_NOHUGEPAGE; mm->def_flags |= VM_NOHUGEPAGE;
@ -2588,14 +2587,18 @@ int gmap_mark_unmergeable(void)
{ {
struct mm_struct *mm = current->mm; struct mm_struct *mm = current->mm;
struct vm_area_struct *vma; struct vm_area_struct *vma;
unsigned long vm_flags;
int ret; int ret;
VMA_ITERATOR(vmi, mm, 0); VMA_ITERATOR(vmi, mm, 0);
for_each_vma(vmi, vma) { for_each_vma(vmi, vma) {
/* Copy vm_flags to avoid partial modifications in ksm_madvise */
vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end, ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
MADV_UNMERGEABLE, &vma->vm_flags); MADV_UNMERGEABLE, &vm_flags);
if (ret) if (ret)
return ret; return ret;
vm_flags_reset(vma, vm_flags);
} }
mm->def_flags &= ~VM_MERGEABLE; mm->def_flags &= ~VM_MERGEABLE;
return 0; return 0;

View File

@ -169,9 +169,6 @@ typedef struct page *pgtable_t;
#define PFN_START (__MEMORY_START >> PAGE_SHIFT) #define PFN_START (__MEMORY_START >> PAGE_SHIFT)
#define ARCH_PFN_OFFSET (PFN_START) #define ARCH_PFN_OFFSET (PFN_START)
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT) #define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
#ifdef CONFIG_FLATMEM
#define pfn_valid(pfn) ((pfn) >= min_low_pfn && (pfn) < max_low_pfn)
#endif
#define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
#include <asm-generic/memory_model.h> #include <asm-generic/memory_model.h>

View File

@ -423,40 +423,69 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
#endif #endif
/* /*
* Encode and de-code a swap entry * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
* *
* Constraints: * Constraints:
* _PAGE_PRESENT at bit 8 * _PAGE_PRESENT at bit 8
* _PAGE_PROTNONE at bit 9 * _PAGE_PROTNONE at bit 9
* *
* For the normal case, we encode the swap type into bits 0:7 and the * For the normal case, we encode the swap type and offset into the swap PTE
* swap offset into bits 10:30. For the 64-bit PTE case, we keep the * such that bits 8 and 9 stay zero. For the 64-bit PTE case, we use the
* preserved bits in the low 32-bits and use the upper 32 as the swap * upper 32 for the swap offset and swap type, following the same approach as
* offset (along with a 5-bit type), following the same approach as x86 * x86 PAE. This keeps the logic quite simple.
* PAE. This keeps the logic quite simple.
* *
* As is evident by the Alpha code, if we ever get a 64-bit unsigned * As is evident by the Alpha code, if we ever get a 64-bit unsigned
* long (swp_entry_t) to match up with the 64-bit PTEs, this all becomes * long (swp_entry_t) to match up with the 64-bit PTEs, this all becomes
* much cleaner.. * much cleaner..
*
* NOTE: We should set ZEROs at the position of _PAGE_PRESENT
* and _PAGE_PROTNONE bits
*/ */
#ifdef CONFIG_X2TLB #ifdef CONFIG_X2TLB
/*
* Format of swap PTEs:
*
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
* <--------------------- offset ----------------------> < type ->
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <------------------- zeroes --------------------> E 0 0 0 0 0 0
*/
#define __swp_type(x) ((x).val & 0x1f) #define __swp_type(x) ((x).val & 0x1f)
#define __swp_offset(x) ((x).val >> 5) #define __swp_offset(x) ((x).val >> 5)
#define __swp_entry(type, offset) ((swp_entry_t){ (type) | (offset) << 5}) #define __swp_entry(type, offset) ((swp_entry_t){ ((type) & 0x1f) | (offset) << 5})
#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high }) #define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high })
#define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val }) #define __swp_entry_to_pte(x) ((pte_t){ 0, (x).val })
#else #else
#define __swp_type(x) ((x).val & 0xff) /*
* Format of swap PTEs:
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <--------------- offset ----------------> 0 0 0 0 E < type -> 0
*
* E is the exclusive marker that is not stored in swap entries.
*/
#define __swp_type(x) ((x).val & 0x1f)
#define __swp_offset(x) ((x).val >> 10) #define __swp_offset(x) ((x).val >> 10)
#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) <<10}) #define __swp_entry(type, offset) ((swp_entry_t){((type) & 0x1f) | (offset) << 10})
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 1 }) #define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) >> 1 })
#define __swp_entry_to_pte(x) ((pte_t) { (x).val << 1 }) #define __swp_entry_to_pte(x) ((pte_t) { (x).val << 1 })
#endif #endif
/* In both cases, we borrow bit 6 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE _PAGE_USER
static inline int pte_swp_exclusive(pte_t pte)
{
return pte.pte_low & _PAGE_SWP_EXCLUSIVE;
}
PTE_BIT_FUNC(low, swp_mkexclusive, |= _PAGE_SWP_EXCLUSIVE);
PTE_BIT_FUNC(low, swp_clear_exclusive, &= ~_PAGE_SWP_EXCLUSIVE);
#endif /* __ASSEMBLY__ */ #endif /* __ASSEMBLY__ */
#endif /* __ASM_SH_PGTABLE_32_H */ #endif /* __ASM_SH_PGTABLE_32_H */

Some files were not shown because too many files have changed in this diff Show More