Many singleton patches against the MM code. The patch series which

are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
This commit is contained in:
Linus Torvalds 2024-01-09 11:18:47 -08:00
commit fb46e22a9e
303 changed files with 11387 additions and 5088 deletions

View File

@ -25,12 +25,14 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
stops, respectively. Reading the file returns the keywords stops, respectively. Reading the file returns the keywords
based on the current status. Writing 'commit' to this file based on the current status. Writing 'commit' to this file
makes the kdamond reads the user inputs in the sysfs files makes the kdamond reads the user inputs in the sysfs files
except 'state' again. Writing 'update_schemes_stats' to the except 'state' again. Writing 'commit_schemes_quota_goals' to
file updates contents of schemes stats files of the kdamond. this file makes the kdamond reads the quota goal files again.
Writing 'update_schemes_tried_regions' to the file updates Writing 'update_schemes_stats' to the file updates contents of
contents of 'tried_regions' directory of every scheme directory schemes stats files of the kdamond. Writing
of this kdamond. Writing 'update_schemes_tried_bytes' to the 'update_schemes_tried_regions' to the file updates contents of
file updates only '.../tried_regions/total_bytes' files of this 'tried_regions' directory of every scheme directory of this
kdamond. Writing 'update_schemes_tried_bytes' to the file
updates only '.../tried_regions/total_bytes' files of this
kdamond. Writing 'clear_schemes_tried_regions' to the file kdamond. Writing 'clear_schemes_tried_regions' to the file
removes contents of the 'tried_regions' directory. removes contents of the 'tried_regions' directory.
@ -212,6 +214,25 @@ Contact: SeongJae Park <sj@kernel.org>
Description: Writing to and reading from this file sets and gets the quotas Description: Writing to and reading from this file sets and gets the quotas
charge reset interval of the scheme in milliseconds. charge reset interval of the scheme in milliseconds.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/nr_goals
Date: Nov 2023
Contact: SeongJae Park <sj@kernel.org>
Description: Writing a number 'N' to this file creates the number of
directories for setting automatic tuning of the scheme's
aggressiveness named '0' to 'N-1' under the goals/ directory.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/target_value
Date: Nov 2023
Contact: SeongJae Park <sj@kernel.org>
Description: Writing to and reading from this file sets and gets the target
value of the goal metric.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/current_value
Date: Nov 2023
Contact: SeongJae Park <sj@kernel.org>
Description: Writing to and reading from this file sets and gets the current
value of the goal metric.
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil
Date: Mar 2022 Date: Mar 2022
Contact: SeongJae Park <sj@kernel.org> Contact: SeongJae Park <sj@kernel.org>

View File

@ -328,7 +328,7 @@ as idle::
From now on, any pages on zram are idle pages. The idle mark From now on, any pages on zram are idle pages. The idle mark
will be removed until someone requests access of the block. will be removed until someone requests access of the block.
IOW, unless there is access request, those pages are still idle pages. IOW, unless there is access request, those pages are still idle pages.
Additionally, when CONFIG_ZRAM_MEMORY_TRACKING is enabled pages can be Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be
marked as idle based on how long (in seconds) it's been since they were marked as idle based on how long (in seconds) it's been since they were
last accessed:: last accessed::

View File

@ -1693,6 +1693,21 @@ PAGE_SIZE multiple when read back.
limit, it will refuse to take any more stores before existing limit, it will refuse to take any more stores before existing
entries fault back in or are written out to disk. entries fault back in or are written out to disk.
memory.zswap.writeback
A read-write single value file. The default value is "1". The
initial value of the root cgroup is 1, and when a new cgroup is
created, it inherits the current value of its parent.
When this is set to 0, all swapping attempts to swapping devices
are disabled. This included both zswap writebacks, and swapping due
to zswap store failures. If the zswap store failures are recurring
(for e.g if the pages are incompressible), users can observe
reclaim inefficiency after disabling writeback (because the same
pages might be rejected again and again).
Note that this is subtly different from setting memory.swap.max to
0, as it still allows for pages to be written to the zswap pool.
memory.pressure memory.pressure
A read-only nested-keyed file. A read-only nested-keyed file.

View File

@ -172,7 +172,7 @@ variables.
Offset of the free_list's member. This value is used to compute the number Offset of the free_list's member. This value is used to compute the number
of free pages. of free pages.
Each zone has a free_area structure array called free_area[MAX_ORDER + 1]. Each zone has a free_area structure array called free_area[NR_PAGE_ORDERS].
The free_list represents a linked list of free page blocks. The free_list represents a linked list of free page blocks.
(list_head, next|prev) (list_head, next|prev)
@ -189,11 +189,11 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
information. Makedumpfile gets the start address of the vmalloc region information. Makedumpfile gets the start address of the vmalloc region
from this. from this.
(zone.free_area, MAX_ORDER + 1) (zone.free_area, NR_PAGE_ORDERS)
------------------------------- --------------------------------
Free areas descriptor. User-space tools use this value to iterate the Free areas descriptor. User-space tools use this value to iterate the
free_area ranges. MAX_ORDER is used by the zone buddy allocator. free_area ranges. NR_PAGE_ORDERS is used by the zone buddy allocator.
prb prb
--- ---

View File

@ -970,17 +970,17 @@
buddy allocator. Bigger value increase the probability buddy allocator. Bigger value increase the probability
of catching random memory corruption, but reduce the of catching random memory corruption, but reduce the
amount of memory for normal system use. The maximum amount of memory for normal system use. The maximum
possible value is MAX_ORDER/2. Setting this parameter possible value is MAX_PAGE_ORDER/2. Setting this
to 1 or 2 should be enough to identify most random parameter to 1 or 2 should be enough to identify most
memory corruption problems caused by bugs in kernel or random memory corruption problems caused by bugs in
driver code when a CPU writes to (or reads from) a kernel or driver code when a CPU writes to (or reads
random memory location. Note that there exists a class from) a random memory location. Note that there exists
of memory corruptions problems caused by buggy H/W or a class of memory corruptions problems caused by buggy
F/W or by drivers badly programming DMA (basically when H/W or F/W or by drivers badly programming DMA
memory is written at bus level and the CPU MMU is (basically when memory is written at bus level and the
bypassed) which are not detectable by CPU MMU is bypassed) which are not detectable by
CONFIG_DEBUG_PAGEALLOC, hence this option will not help CONFIG_DEBUG_PAGEALLOC, hence this option will not
tracking down these problems. help tracking down these problems.
debug_pagealloc= debug_pagealloc=
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
@ -4136,7 +4136,7 @@
[KNL] Minimal page reporting order [KNL] Minimal page reporting order
Format: <integer> Format: <integer>
Adjust the minimal page reporting order. The page Adjust the minimal page reporting order. The page
reporting is disabled when it exceeds MAX_ORDER. reporting is disabled when it exceeds MAX_PAGE_ORDER.
panic= [KNL] Kernel behaviour on panic: delay <timeout> panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting timeout > 0: seconds before rebooting

View File

@ -59,41 +59,47 @@ Files Hierarchy
The files hierarchy of DAMON sysfs interface is shown below. In the below The files hierarchy of DAMON sysfs interface is shown below. In the below
figure, parents-children relations are represented with indentations, each figure, parents-children relations are represented with indentations, each
directory is having ``/`` suffix, and files in each directory are separated by directory is having ``/`` suffix, and files in each directory are separated by
comma (","). :: comma (",").
/sys/kernel/mm/damon/admin .. parsed-literal::
│ kdamonds/nr_kdamonds
│ │ 0/state,pid :ref:`/sys/kernel/mm/damon <sysfs_root>`/admin
│ │ │ contexts/nr_contexts :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds
│ │ │ │ 0/avail_operations,operations │ │ :ref:`0 <sysfs_kdamond>`/state,pid
│ │ │ │ │ monitoring_attrs/ │ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts
│ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
│ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
│ │ │ │ │ │ nr_regions/min,max │ │ │ │ │ │ nr_regions/min,max
│ │ │ │ │ targets/nr_targets │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
│ │ │ │ │ │ 0/pid_target │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target
│ │ │ │ │ │ │ regions/nr_regions │ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions
│ │ │ │ │ │ │ │ 0/start,end │ │ │ │ │ │ │ │ :ref:`0 <sysfs_region>`/start,end
│ │ │ │ │ │ │ │ ... │ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ... │ │ │ │ │ │ ...
│ │ │ │ │ schemes/nr_schemes │ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes
│ │ │ │ │ │ 0/action,apply_interval_us │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us
│ │ │ │ │ │ │ access_pattern/ │ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/
│ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max
│ │ │ │ │ │ │ │ age/min,max │ │ │ │ │ │ │ │ age/min,max
│ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
│ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
│ │ │ │ │ │ │ filters/nr_filters │ │ │ │ │ │ │ │ │ 0/target_value,current_value
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
│ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id │ │ │ │ │ │ │ │ 0/type,matching,memcg_id
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
│ │ │ │ │ │ │ tried_regions/total_bytes │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
│ │ │ │ │ │ │ │ ... │ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ... │ │ │ │ │ │ ...
│ │ │ │ ... │ │ │ │ ...
│ │ ... │ │ ...
.. _sysfs_root:
Root Root
---- ----
@ -102,6 +108,8 @@ has one directory named ``admin``. The directory contains the files for
privileged user space programs' control of DAMON. User space tools or daemons privileged user space programs' control of DAMON. User space tools or daemons
having the root permission could use this directory. having the root permission could use this directory.
.. _sysfs_kdamonds:
kdamonds/ kdamonds/
--------- ---------
@ -113,6 +121,8 @@ details) exists. In the beginning, this directory has only one file,
child directories named ``0`` to ``N-1``. Each directory represents each child directories named ``0`` to ``N-1``. Each directory represents each
kdamond. kdamond.
.. _sysfs_kdamond:
kdamonds/<N>/ kdamonds/<N>/
------------- -------------
@ -120,29 +130,37 @@ In each kdamond directory, two files (``state`` and ``pid``) and one directory
(``contexts``) exist. (``contexts``) exist.
Reading ``state`` returns ``on`` if the kdamond is currently running, or Reading ``state`` returns ``on`` if the kdamond is currently running, or
``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be ``off`` if it is not running.
in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
user inputs in the sysfs files except ``state`` file again. Writing
``update_schemes_stats`` to ``state`` file updates the contents of stats files
for each DAMON-based operation scheme of the kdamond. For details of the
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
Writing ``update_schemes_tried_regions`` to ``state`` file updates the Users can write below commands for the kdamond to the ``state`` file.
DAMON-based operation scheme action tried regions directory for each
DAMON-based operation scheme of the kdamond. Writing - ``on``: Start running.
``update_schemes_tried_bytes`` to ``state`` file updates only - ``off``: Stop running.
``.../tried_regions/total_bytes`` files. Writing - ``commit``: Read the user inputs in the sysfs files except ``state`` file
``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based again.
operating scheme action tried regions directory for each DAMON-based operation - ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes'
scheme of the kdamond. For details of the DAMON-based operation scheme action :ref:`quota goals <sysfs_schemes_quota_goals>`.
tried regions directory, please refer to :ref:`tried_regions section - ``update_schemes_stats``: Update the contents of stats files for each
<sysfs_schemes_tried_regions>`. DAMON-based operation scheme of the kdamond. For details of the stats,
please refer to :ref:`stats section <sysfs_schemes_stats>`.
- ``update_schemes_tried_regions``: Update the DAMON-based operation scheme
action tried regions directory for each DAMON-based operation scheme of the
kdamond. For details of the DAMON-based operation scheme action tried
regions directory, please refer to
:ref:`tried_regions section <sysfs_schemes_tried_regions>`.
- ``update_schemes_tried_bytes``: Update only ``.../tried_regions/total_bytes``
files.
- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme
action tried regions directory for each DAMON-based operation scheme of the
kdamond.
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
``contexts`` directory contains files for controlling the monitoring contexts ``contexts`` directory contains files for controlling the monitoring contexts
that this kdamond will execute. that this kdamond will execute.
.. _sysfs_contexts:
kdamonds/<N>/contexts/ kdamonds/<N>/contexts/
---------------------- ----------------------
@ -153,7 +171,7 @@ number (``N``) to the file creates the number of child directories named as
details). At the moment, only one context per kdamond is supported, so only details). At the moment, only one context per kdamond is supported, so only
``0`` or ``1`` can be written to the file. ``0`` or ``1`` can be written to the file.
.. _sysfs_contexts: .. _sysfs_context:
contexts/<N>/ contexts/<N>/
------------- -------------
@ -203,6 +221,8 @@ writing to and rading from the files.
For more details about the intervals and monitoring regions range, please refer For more details about the intervals and monitoring regions range, please refer
to the Design document (:doc:`/mm/damon/design`). to the Design document (:doc:`/mm/damon/design`).
.. _sysfs_targets:
contexts/<N>/targets/ contexts/<N>/targets/
--------------------- ---------------------
@ -210,6 +230,8 @@ In the beginning, this directory has only one file, ``nr_targets``. Writing a
number (``N``) to the file creates the number of child directories named ``0`` number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each monitoring target. to ``N-1``. Each directory represents each monitoring target.
.. _sysfs_target:
targets/<N>/ targets/<N>/
------------ ------------
@ -244,6 +266,8 @@ In the beginning, this directory has only one file, ``nr_regions``. Writing a
number (``N``) to the file creates the number of child directories named ``0`` number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each initial monitoring target region. to ``N-1``. Each directory represents each initial monitoring target region.
.. _sysfs_region:
regions/<N>/ regions/<N>/
------------ ------------
@ -254,6 +278,8 @@ region by writing to and reading from the files, respectively.
Each region should not overlap with others. ``end`` of directory ``N`` should Each region should not overlap with others. ``end`` of directory ``N`` should
be equal or smaller than ``start`` of directory ``N+1``. be equal or smaller than ``start`` of directory ``N+1``.
.. _sysfs_schemes:
contexts/<N>/schemes/ contexts/<N>/schemes/
--------------------- ---------------------
@ -265,6 +291,8 @@ In the beginning, this directory has only one file, ``nr_schemes``. Writing a
number (``N``) to the file creates the number of child directories named ``0`` number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each DAMON-based operation scheme. to ``N-1``. Each directory represents each DAMON-based operation scheme.
.. _sysfs_scheme:
schemes/<N>/ schemes/<N>/
------------ ------------
@ -277,7 +305,7 @@ The ``action`` file is for setting and getting the scheme's :ref:`action
from the file and their meaning are as below. from the file and their meaning are as below.
Note that support of each action depends on the running DAMON operations set Note that support of each action depends on the running DAMON operations set
:ref:`implementation <sysfs_contexts>`. :ref:`implementation <sysfs_context>`.
- ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``. - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
Supported by ``vaddr`` and ``fvaddr`` operations set. Supported by ``vaddr`` and ``fvaddr`` operations set.
@ -299,6 +327,8 @@ Note that support of each action depends on the running DAMON operations set
The ``apply_interval_us`` file is for setting and getting the scheme's The ``apply_interval_us`` file is for setting and getting the scheme's
:ref:`apply_interval <damon_design_damos>` in microseconds. :ref:`apply_interval <damon_design_damos>` in microseconds.
.. _sysfs_access_pattern:
schemes/<N>/access_pattern/ schemes/<N>/access_pattern/
--------------------------- ---------------------------
@ -312,6 +342,8 @@ to and reading from the ``min`` and ``max`` files under ``sz``,
``nr_accesses``, and ``age`` directories, respectively. Note that the ``min`` ``nr_accesses``, and ``age`` directories, respectively. Note that the ``min``
and the ``max`` form a closed interval. and the ``max`` form a closed interval.
.. _sysfs_quotas:
schemes/<N>/quotas/ schemes/<N>/quotas/
------------------- -------------------
@ -319,8 +351,7 @@ The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
DAMON-based operation scheme. DAMON-based operation scheme.
Under ``quotas`` directory, three files (``ms``, ``bytes``, Under ``quotas`` directory, three files (``ms``, ``bytes``,
``reset_interval_ms``) and one directory (``weights``) having three files ``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist.
(``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) in it exist.
You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
``reset interval`` in milliseconds by writing the values to the three files, ``reset interval`` in milliseconds by writing the values to the three files,
@ -330,11 +361,37 @@ apply the action to only up to ``bytes`` bytes of memory regions within the
``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the ``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the
quota limits. quota limits.
You can also set the :ref:`prioritization weights Under ``weights`` directory, three files (``sz_permil``,
``nr_accesses_permil``, and ``age_permil``) exist.
You can set the :ref:`prioritization weights
<damon_design_damos_quotas_prioritization>` for size, access frequency, and age <damon_design_damos_quotas_prioritization>` for size, access frequency, and age
in per-thousand unit by writing the values to the three files under the in per-thousand unit by writing the values to the three files under the
``weights`` directory. ``weights`` directory.
.. _sysfs_schemes_quota_goals:
schemes/<N>/quotas/goals/
-------------------------
The directory for the :ref:`automatic quota tuning goals
<damon_design_damos_quotas_auto_tuning>` of the given DAMON-based operation
scheme.
In the beginning, this directory has only one file, ``nr_goals``. Writing a
number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each goal and current achievement.
Among the multiple feedback, the best one is used.
Each goal directory contains two files, namely ``target_value`` and
``current_value``. Users can set and get any number to those files to set the
feedback. User space main workload's latency or throughput, system metrics
like free memory ratio or memory pressure stall time (PSI) could be example
metrics for the values. Note that users should write
``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
directory <sysfs_kdamond>` to pass the feedback to DAMON.
.. _sysfs_watermarks:
schemes/<N>/watermarks/ schemes/<N>/watermarks/
----------------------- -----------------------
@ -354,6 +411,8 @@ as below.
The ``interval`` should written in microseconds unit. The ``interval`` should written in microseconds unit.
.. _sysfs_filters:
schemes/<N>/filters/ schemes/<N>/filters/
-------------------- --------------------
@ -394,7 +453,7 @@ pages of all memory cgroups except ``/having_care_already``.::
echo N > 1/matching echo N > 1/matching
Note that ``anon`` and ``memcg`` filters are currently supported only when Note that ``anon`` and ``memcg`` filters are currently supported only when
``paddr`` :ref:`implementation <sysfs_contexts>` is being used. ``paddr`` :ref:`implementation <sysfs_context>` is being used.
Also, memory regions that are filtered out by ``addr`` or ``target`` filters Also, memory regions that are filtered out by ``addr`` or ``target`` filters
are not counted as the scheme has tried to those, while regions that filtered are not counted as the scheme has tried to those, while regions that filtered
@ -449,6 +508,8 @@ and query-like efficient data access monitoring results retrievals. For the
latter use case, in particular, users can set the ``action`` as ``stat`` and latter use case, in particular, users can set the ``action`` as ``stat`` and
set the ``access pattern`` as their interested pattern that they want to query. set the ``access pattern`` as their interested pattern that they want to query.
.. _sysfs_schemes_tried_region:
tried_regions/<N>/ tried_regions/<N>/
------------------ ------------------

View File

@ -80,6 +80,9 @@ pages_to_scan
how many pages to scan before ksmd goes to sleep how many pages to scan before ksmd goes to sleep
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``. e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
The pages_to_scan value cannot be changed if ``advisor_mode`` has
been set to scan-time.
Default: 100 (chosen for demonstration purposes) Default: 100 (chosen for demonstration purposes)
sleep_millisecs sleep_millisecs
@ -164,6 +167,29 @@ smart_scan
optimization is enabled. The ``pages_skipped`` metric shows how optimization is enabled. The ``pages_skipped`` metric shows how
effective the setting is. effective the setting is.
advisor_mode
The ``advisor_mode`` selects the current advisor. Two modes are
supported: none and scan-time. The default is none. By setting
``advisor_mode`` to scan-time, the scan time advisor is enabled.
The section about ``advisor`` explains in detail how the scan time
advisor works.
adivsor_max_cpu
specifies the upper limit of the cpu percent usage of the ksmd
background thread. The default is 70.
advisor_target_scan_time
specifies the target scan time in seconds to scan all the candidate
pages. The default value is 200 seconds.
advisor_min_pages_to_scan
specifies the lower limit of the ``pages_to_scan`` parameter of the
scan time advisor. The default is 500.
adivsor_max_pages_to_scan
specifies the upper limit of the ``pages_to_scan`` parameter of the
scan time advisor. The default is 30000.
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
general_profit general_profit
@ -263,6 +289,35 @@ ksm_swpin_copy
note that KSM page might be copied when swapping in because do_swap_page() note that KSM page might be copied when swapping in because do_swap_page()
cannot do all the locking needed to reconstitute a cross-anon_vma KSM page. cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
Advisor
=======
The number of candidate pages for KSM is dynamic. It can be often observed
that during the startup of an application more candidate pages need to be
processed. Without an advisor the ``pages_to_scan`` parameter needs to be
sized for the maximum number of candidate pages. The scan time advisor can
changes the ``pages_to_scan`` parameter based on demand.
The advisor can be enabled, so KSM can automatically adapt to changes in the
number of candidate pages to scan. Two advisors are implemented: none and
scan-time. With none, no advisor is enabled. The default is none.
The scan time advisor changes the ``pages_to_scan`` parameter based on the
observed scan times. The possible values for the ``pages_to_scan`` parameter is
limited by the ``advisor_max_cpu`` parameter. In addition there is also the
``advisor_target_scan_time`` parameter. This parameter sets the target time to
scan all the KSM candidate pages. The parameter ``advisor_target_scan_time``
decides how aggressive the scan time advisor scans candidate pages. Lower
values make the scan time advisor to scan more aggresively. This is the most
important parameter for the configuration of the scan time advisor.
The initial value and the maximum value can be changed with
``advisor_min_pages_to_scan`` and ``advisor_max_pages_to_scan``. The default
values are sufficient for most workloads and use cases.
The ``pages_to_scan`` parameter is re-calculated after a scan has been completed.
-- --
Izik Eidus, Izik Eidus,
Hugh Dickins, 17 Nov 2009 Hugh Dickins, 17 Nov 2009

View File

@ -253,6 +253,7 @@ Following flags about pages are currently supported:
- ``PAGE_IS_SWAPPED`` - Page is in swapped - ``PAGE_IS_SWAPPED`` - Page is in swapped
- ``PAGE_IS_PFNZERO`` - Page has zero PFN - ``PAGE_IS_PFNZERO`` - Page has zero PFN
- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed - ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
The ``struct pm_scan_arg`` is used as the argument of the IOCTL. The ``struct pm_scan_arg`` is used as the argument of the IOCTL.

View File

@ -45,10 +45,25 @@ components:
the two is using hugepages just because of the fact the TLB miss is the two is using hugepages just because of the fact the TLB miss is
going to run faster. going to run faster.
Modern kernels support "multi-size THP" (mTHP), which introduces the
ability to allocate memory in blocks that are bigger than a base page
but smaller than traditional PMD-size (as described above), in
increments of a power-of-2 number of pages. mTHP can back anonymous
memory (for example 16K, 32K, 64K, etc). These THPs continue to be
PTE-mapped, but in many cases can still provide similar benefits to
those outlined above: Page faults are significantly reduced (by a
factor of e.g. 4, 8, 16, etc), but latency spikes are much less
prominent because the size of each page isn't as huge as the PMD-sized
variant and there is less memory to clear in each page fault. Some
architectures also employ TLB compression mechanisms to squeeze more
entries in when a set of PTEs are virtually and physically contiguous
and approporiately aligned. In this case, TLB misses will occur less
often.
THP can be enabled system wide or restricted to certain tasks or even THP can be enabled system wide or restricted to certain tasks or even
memory ranges inside task's address space. Unless THP is completely memory ranges inside task's address space. Unless THP is completely
disabled, there is ``khugepaged`` daemon that scans memory and disabled, there is ``khugepaged`` daemon that scans memory and
collapses sequences of basic pages into huge pages. collapses sequences of basic pages into PMD-sized huge pages.
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
interface and using madvise(2) and prctl(2) system calls. interface and using madvise(2) and prctl(2) system calls.
@ -95,12 +110,40 @@ Global THP controls
Transparent Hugepage Support for anonymous memory can be entirely disabled Transparent Hugepage Support for anonymous memory can be entirely disabled
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
regions (to avoid the risk of consuming more memory resources) or enabled regions (to avoid the risk of consuming more memory resources) or enabled
system wide. This can be achieved with one of:: system wide. This can be achieved per-supported-THP-size with one of::
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
where <size> is the hugepage size being addressed, the available sizes
for which vary by system.
For example::
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
Alternatively it is possible to specify that a given hugepage size
will inherit the top-level "enabled" value::
echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
For example::
echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
The top-level setting (for use with "inherit") can be set by issuing
one of the following commands::
echo always >/sys/kernel/mm/transparent_hugepage/enabled echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled echo never >/sys/kernel/mm/transparent_hugepage/enabled
By default, PMD-sized hugepages have enabled="inherit" and all other
hugepage sizes have enabled="never". If enabling multiple hugepage
sizes, the kernel will select the most appropriate enabled size for a
given allocation.
It's also possible to limit defrag efforts in the VM to generate It's also possible to limit defrag efforts in the VM to generate
anonymous hugepages in case they're not immediately free to madvise anonymous hugepages in case they're not immediately free to madvise
regions or to never try to defrag memory and simply fallback to regular regions or to never try to defrag memory and simply fallback to regular
@ -146,25 +189,34 @@ madvise
never never
should be self-explanatory. should be self-explanatory.
By default kernel tries to use huge zero page on read page fault to By default kernel tries to use huge, PMD-mappable zero page on read
anonymous mapping. It's possible to disable huge zero page by writing 0 page fault to anonymous mapping. It's possible to disable huge zero
or enable it back by writing 1:: page by writing 0 or enable it back by writing 1::
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
Some userspace (such as a test program, or an optimized memory allocation Some userspace (such as a test program, or an optimized memory
library) may want to know the size (in bytes) of a transparent hugepage:: allocation library) may want to know the size (in bytes) of a
PMD-mappable transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
khugepaged will be automatically started when khugepaged will be automatically started when one or more hugepage
transparent_hugepage/enabled is set to "always" or "madvise, and it'll sizes are enabled (either by directly setting "always" or "madvise",
be automatically shutdown if it's set to "never". or by setting "inherit" while the top-level enabled is set to "always"
or "madvise"), and it'll be automatically shutdown when the last
hugepage size is disabled (either by directly setting "never", or by
setting "inherit" while the top-level enabled is set to "never").
Khugepaged controls Khugepaged controls
------------------- -------------------
.. note::
khugepaged currently only searches for opportunities to collapse to
PMD-sized THP and no attempt is made to collapse to other THP
sizes.
khugepaged runs usually at low frequency so while one may not want to khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's should be worth invoking defrag at least in khugepaged. However it's
@ -282,19 +334,26 @@ force
Need of application restart Need of application restart
=========================== ===========================
The transparent_hugepage/enabled values and tmpfs mount option only affect The transparent_hugepage/enabled and
future behavior. So to make them effective you need to restart any transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
application that could have been using hugepages. This also applies to the option only affect future behavior. So to make them effective you need
regions registered in khugepaged. to restart any application that could have been using hugepages. This
also applies to the regions registered in khugepaged.
Monitoring usage Monitoring usage
================ ================
The number of anonymous transparent huge pages currently used by the .. note::
Currently the below counters only record events relating to
PMD-sized THP. Events relating to other THP sizes are not included.
The number of PMD-sized anonymous transparent huge pages currently used by the
system is available by reading the AnonHugePages field in ``/proc/meminfo``. system is available by reading the AnonHugePages field in ``/proc/meminfo``.
To identify what applications are using anonymous transparent huge pages, To identify what applications are using PMD-sized anonymous transparent huge
it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages
for each mapping. fields for each mapping. (Note that AnonHugePages only applies to traditional
PMD-sized THP for historical reasons and should have been called
AnonHugePmdMapped).
The number of file transparent huge pages mapped to userspace is available The number of file transparent huge pages mapped to userspace is available
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
@ -413,7 +472,7 @@ for huge pages.
Optimizing the applications Optimizing the applications
=========================== ===========================
To be guaranteed that the kernel will map a 2M page immediately in any To be guaranteed that the kernel will map a THP immediately in any
memory region, the mmap region has to be hugepage naturally memory region, the mmap region has to be hugepage naturally
aligned. posix_memalign() can provide that guarantee. aligned. posix_memalign() can provide that guarantee.

View File

@ -113,6 +113,9 @@ events, except page fault notifications, may be generated:
areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
support for shmem virtual memory areas. support for shmem virtual memory areas.
- ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an
existing page contents from userspace.
The userland application should set the feature flags it intends to use The userland application should set the feature flags it intends to use
when invoking the ``UFFDIO_API`` ioctl, to request that those features be when invoking the ``UFFDIO_API`` ioctl, to request that those features be
enabled if supported. enabled if supported.

View File

@ -153,6 +153,26 @@ attribute, e. g.::
Setting this parameter to 100 will disable the hysteresis. Setting this parameter to 100 will disable the hysteresis.
Some users cannot tolerate the swapping that comes with zswap store failures
and zswap writebacks. Swapping can be disabled entirely (without disabling
zswap itself) on a cgroup-basis as follows:
echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
Note that if the store failures are recurring (for e.g if the pages are
incompressible), users can observe reclaim inefficiency after disabling
writeback (because the same pages might be rejected again and again).
When there is a sizable amount of cold memory residing in the zswap pool, it
can be advantageous to proactively write these cold pages to swap and reclaim
the memory for other use cases. By default, the zswap shrinker is disabled.
User can enable it as follows:
echo Y > /sys/module/zswap/parameters/shrinker_enabled
This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is
selected.
A debugfs interface is provided for various statistic about pool size, number A debugfs interface is provided for various statistic about pool size, number
of pages stored, same-value filled pages and various counters for the reasons of pages stored, same-value filled pages and various counters for the reasons
pages are rejected. pages are rejected.

View File

@ -81,6 +81,9 @@ section.
Sometimes it is necessary to ensure the next call to store to a maple tree does Sometimes it is necessary to ensure the next call to store to a maple tree does
not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case. not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
You can use mtree_dup() to duplicate an entire maple tree. It is a more
efficient way than inserting all elements one by one into a new tree.
Finally, you can remove all entries from a maple tree by calling Finally, you can remove all entries from a maple tree by calling
mtree_destroy(). If the maple tree entries are pointers, you may wish to free mtree_destroy(). If the maple tree entries are pointers, you may wish to free
the entries first. the entries first.
@ -112,6 +115,7 @@ Takes ma_lock internally:
* mtree_insert() * mtree_insert()
* mtree_insert_range() * mtree_insert_range()
* mtree_erase() * mtree_erase()
* mtree_dup()
* mtree_destroy() * mtree_destroy()
* mt_set_in_rcu() * mt_set_in_rcu()
* mt_clear_in_rcu() * mt_clear_in_rcu()

View File

@ -261,7 +261,7 @@ prototypes::
struct folio *src, enum migrate_mode); struct folio *src, enum migrate_mode);
int (*launder_folio)(struct folio *); int (*launder_folio)(struct folio *);
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count); bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
int (*error_remove_page)(struct address_space *, struct page *); int (*error_remove_folio)(struct address_space *, struct folio *);
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span) int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *); int (*swap_deactivate)(struct file *);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
@ -287,7 +287,7 @@ direct_IO:
migrate_folio: yes (both) migrate_folio: yes (both)
launder_folio: yes launder_folio: yes
is_partially_uptodate: yes is_partially_uptodate: yes
error_remove_page: yes error_remove_folio: yes
swap_activate: no swap_activate: no
swap_deactivate: no swap_deactivate: no
swap_rw: yes, unlocks swap_rw: yes, unlocks

View File

@ -528,9 +528,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
does not take into account swapped out page of underlying shmem objects. does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not. "Locked" indicates whether the mapping is locked in memory or not.
"THPeligible" indicates whether the mapping is eligible for allocating THP "THPeligible" indicates whether the mapping is eligible for allocating
pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise. naturally aligned THP pages of any currently enabled size. 1 if true, 0
It just shows the current status. otherwise.
"VmFlags" field deserves a separate description. This member represents the "VmFlags" field deserves a separate description. This member represents the
kernel flags associated with the particular virtual memory area in two letter kernel flags associated with the particular virtual memory area in two letter

View File

@ -823,7 +823,7 @@ cache in your filesystem. The following members are defined:
bool (*is_partially_uptodate) (struct folio *, size_t from, bool (*is_partially_uptodate) (struct folio *, size_t from,
size_t count); size_t count);
void (*is_dirty_writeback)(struct folio *, bool *, bool *); void (*is_dirty_writeback)(struct folio *, bool *, bool *);
int (*error_remove_page) (struct mapping *mapping, struct page *page); int (*error_remove_folio)(struct mapping *mapping, struct folio *);
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span) int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *); int (*swap_deactivate)(struct file *);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
@ -1034,8 +1034,8 @@ cache in your filesystem. The following members are defined:
VM if a folio should be treated as dirty or writeback for the VM if a folio should be treated as dirty or writeback for the
purposes of stalling. purposes of stalling.
``error_remove_page`` ``error_remove_folio``
normally set to generic_error_remove_page if truncation is ok normally set to generic_error_remove_folio if truncation is ok
for this address space. Used for memory failure handling. for this address space. Used for memory failure handling.
Setting this implies you deal with pages going away under you, Setting this implies you deal with pages going away under you,
unless you have them locked or reference counts increased. unless you have them locked or reference counts increased.

View File

@ -18,8 +18,6 @@ PTE Page Table Helpers
+---------------------------+--------------------------------------------------+ +---------------------------+--------------------------------------------------+
| pte_same | Tests whether both PTE entries are the same | | pte_same | Tests whether both PTE entries are the same |
+---------------------------+--------------------------------------------------+ +---------------------------+--------------------------------------------------+
| pte_bad | Tests a non-table mapped PTE |
+---------------------------+--------------------------------------------------+
| pte_present | Tests a valid mapped PTE | | pte_present | Tests a valid mapped PTE |
+---------------------------+--------------------------------------------------+ +---------------------------+--------------------------------------------------+
| pte_young | Tests a young PTE | | pte_young | Tests a young PTE |

View File

@ -5,6 +5,18 @@ Design
====== ======
.. _damon_design_execution_model_and_data_structures:
Execution Model and Data Structures
===================================
The monitoring-related information including the monitoring request
specification and DAMON-based operation schemes are stored in a data structure
called DAMON ``context``. DAMON executes each context with a kernel thread
called ``kdamond``. Multiple kdamonds could run in parallel, for different
types of monitoring.
Overall Architecture Overall Architecture
==================== ====================
@ -346,6 +358,19 @@ the weight will be respected are up to the underlying prioritization mechanism
implementation. implementation.
.. _damon_design_damos_quotas_auto_tuning:
Aim-oriented Feedback-driven Auto-tuning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Automatic feedback-driven quota tuning. Instead of setting the absolute quota
value, users can repeatedly provide numbers representing how much of their goal
for the scheme is achieved as feedback. DAMOS then automatically tunes the
aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS
is under achieving the goal, DAMOS automatically increases the quota. If DAMOS
is over achieving the goal, it decreases the quota.
.. _damon_design_damos_watermarks: .. _damon_design_damos_watermarks:
Watermarks Watermarks
@ -477,15 +502,3 @@ modules for proactive reclamation and LRU lists manipulation are provided. For
more detail, please read the usage documents for those more detail, please read the usage documents for those
(:doc:`/admin-guide/mm/damon/reclaim` and (:doc:`/admin-guide/mm/damon/reclaim` and
:doc:`/admin-guide/mm/damon/lru_sort`). :doc:`/admin-guide/mm/damon/lru_sort`).
.. _damon_design_execution_model_and_data_structures:
Execution Model and Data Structures
===================================
The monitoring-related information including the monitoring request
specification and DAMON-based operation schemes are stored in a data structure
called DAMON ``context``. DAMON executes each context with a kernel thread
called ``kdamond``. Multiple kdamonds could run in parallel, for different
types of monitoring.

View File

@ -117,7 +117,7 @@ pages:
- map/unmap of a PMD entry for the whole THP increment/decrement - map/unmap of a PMD entry for the whole THP increment/decrement
folio->_entire_mapcount and also increment/decrement folio->_entire_mapcount and also increment/decrement
folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount
goes from -1 to 0 or 0 to -1. goes from -1 to 0 or 0 to -1.
- map/unmap of individual pages with PTE entry increment/decrement - map/unmap of individual pages with PTE entry increment/decrement
@ -156,7 +156,7 @@ Partial unmap and deferred_split_folio()
Unmapping part of THP (with munmap() or other way) is not going to free Unmapping part of THP (with munmap() or other way) is not going to free
memory immediately. Instead, we detect that a subpage of THP is not in use memory immediately. Instead, we detect that a subpage of THP is not in use
in page_remove_rmap() and queue the THP for splitting if memory pressure in folio_remove_rmap_*() and queue the THP for splitting if memory pressure
comes. Splitting will free up unused subpages. comes. Splitting will free up unused subpages.
Splitting the page right away is not an option due to locking context in Splitting the page right away is not an option due to locking context in

View File

@ -486,7 +486,7 @@ munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
Before the unevictable/mlock changes, mlocking did not mark the pages in any Before the unevictable/mlock changes, mlocking did not mark the pages in any
way, so unmapping them required no processing. way, so unmapping them required no processing.
For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls For each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
(unless it was a PTE mapping of a part of a transparent huge page). (unless it was a PTE mapping of a part of a transparent huge page).
@ -511,7 +511,7 @@ userspace; truncation even unmaps and deletes any private anonymous pages
which had been Copied-On-Write from the file pages now being truncated. which had been Copied-On-Write from the file pages now being truncated.
Mlocked pages can be munlocked and deleted in this way: like with munmap(), Mlocked pages can be munlocked and deleted in this way: like with munmap(),
for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls for each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
(unless it was a PTE mapping of a part of a transparent huge page). (unless it was a PTE mapping of a part of a transparent huge page).

View File

@ -263,20 +263,20 @@ the name indicates, this function allocates pages of memory, and the second
argument is "order" or a power of two number of pages, that is argument is "order" or a power of two number of pages, that is
(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
order=2 ==> 16384 bytes, etc. The maximum size of a order=2 ==> 16384 bytes, etc. The maximum size of a
region allocated by __get_free_pages is determined by the MAX_ORDER macro. More region allocated by __get_free_pages is determined by the MAX_PAGE_ORDER macro.
precisely the limit can be calculated as:: More precisely the limit can be calculated as::
PAGE_SIZE << MAX_ORDER PAGE_SIZE << MAX_PAGE_ORDER
In a i386 architecture PAGE_SIZE is 4096 bytes In a i386 architecture PAGE_SIZE is 4096 bytes
In a 2.4/i386 kernel MAX_ORDER is 10 In a 2.4/i386 kernel MAX_PAGE_ORDER is 10
In a 2.6/i386 kernel MAX_ORDER is 11 In a 2.6/i386 kernel MAX_PAGE_ORDER is 11
So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
respectively, with an i386 architecture. respectively, with an i386 architecture.
User space programs can include /usr/include/sys/user.h and User space programs can include /usr/include/sys/user.h and
/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_PAGE_ORDER declarations.
The pagesize can also be determined dynamically with the getpagesize (2) The pagesize can also be determined dynamically with the getpagesize (2)
system call. system call.
@ -324,7 +324,7 @@ Definitions:
(see /proc/slabinfo) (see /proc/slabinfo)
<pointer size> depends on the architecture -- ``sizeof(void *)`` <pointer size> depends on the architecture -- ``sizeof(void *)``
<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2) <page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
<max-order> is the value defined with MAX_ORDER <max-order> is the value defined with MAX_PAGE_ORDER
<frame size> it's an upper bound of frame's capture size (more on this later) <frame size> it's an upper bound of frame's capture size (more on this later)
============== ================================================================ ============== ================================================================

View File

@ -5339,6 +5339,7 @@ L: linux-mm@kvack.org
S: Maintained S: Maintained
F: mm/memcontrol.c F: mm/memcontrol.c
F: mm/swap_cgroup.c F: mm/swap_cgroup.c
F: samples/cgroup/*
F: tools/testing/selftests/cgroup/memcg_protection.m F: tools/testing/selftests/cgroup/memcg_protection.m
F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
F: tools/testing/selftests/cgroup/test_kmem.c F: tools/testing/selftests/cgroup/test_kmem.c

View File

@ -1470,6 +1470,14 @@ config DYNAMIC_SIGFRAME
config HAVE_ARCH_NODE_DEV_GROUP config HAVE_ARCH_NODE_DEV_GROUP
bool bool
config ARCH_HAS_HW_PTE_YOUNG
bool
help
Architectures that select this option are capable of setting the
accessed bit in PTE entries when using them as part of linear address
translations. Architectures that require runtime check should select
this option and override arch_has_hw_pte_young().
config ARCH_HAS_NONLEAF_PMD_YOUNG config ARCH_HAS_NONLEAF_PMD_YOUNG
bool bool
help help

View File

@ -1362,7 +1362,7 @@ config ARCH_FORCE_MAX_ORDER
default "10" default "10"
help help
The kernel page allocator limits the size of maximal physically The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it contiguous allocations. The limit is called MAX_PAGE_ORDER and it
defines the maximal power of two of number of pages that can be defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very overriding the default setting when ability to allocate very

View File

@ -36,6 +36,7 @@ config ARM64
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PTE_DEVMAP select ARCH_HAS_PTE_DEVMAP
select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_HW_PTE_YOUNG
select ARCH_HAS_SETUP_DMA_OPS select ARCH_HAS_SETUP_DMA_OPS
select ARCH_HAS_SET_DIRECT_MAP select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_SET_MEMORY select ARCH_HAS_SET_MEMORY
@ -1519,15 +1520,15 @@ config XEN
# include/linux/mmzone.h requires the following to be true: # include/linux/mmzone.h requires the following to be true:
# #
# MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS # MAX_PAGE_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
# #
# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT: # so the maximum value of MAX_PAGE_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT:
# #
# | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_ORDER | default MAX_ORDER | # | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_PAGE_ORDER | default MAX_PAGE_ORDER |
# ----+-------------------+--------------+-----------------+--------------------+ # ----+-------------------+--------------+----------------------+-------------------------+
# 4K | 27 | 12 | 15 | 10 | # 4K | 27 | 12 | 15 | 10 |
# 16K | 27 | 14 | 13 | 11 | # 16K | 27 | 14 | 13 | 11 |
# 64K | 29 | 16 | 13 | 13 | # 64K | 29 | 16 | 13 | 13 |
config ARCH_FORCE_MAX_ORDER config ARCH_FORCE_MAX_ORDER
int int
default "13" if ARM64_64K_PAGES default "13" if ARM64_64K_PAGES
@ -1535,16 +1536,16 @@ config ARCH_FORCE_MAX_ORDER
default "10" default "10"
help help
The kernel page allocator limits the size of maximal physically The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it contiguous allocations. The limit is called MAX_PAGE_ORDER and it
defines the maximal power of two of number of pages that can be defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very overriding the default setting when ability to allocate very
large blocks of physically contiguous memory is required. large blocks of physically contiguous memory is required.
The maximal size of allocation cannot exceed the size of the The maximal size of allocation cannot exceed the size of the
section, so the value of MAX_ORDER should satisfy section, so the value of MAX_PAGE_ORDER should satisfy
MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS MAX_PAGE_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
Don't change if unsure. Don't change if unsure.

View File

@ -15,29 +15,9 @@
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
void kasan_init(void);
/*
* KASAN_SHADOW_START: beginning of the kernel virtual addresses.
* KASAN_SHADOW_END: KASAN_SHADOW_START + 1/N of kernel virtual addresses,
* where N = (1 << KASAN_SHADOW_SCALE_SHIFT).
*
* KASAN_SHADOW_OFFSET:
* This value is used to map an address to the corresponding shadow
* address by the following formula:
* shadow_addr = (address >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
*
* (1 << (64 - KASAN_SHADOW_SCALE_SHIFT)) shadow addresses that lie in range
* [KASAN_SHADOW_OFFSET, KASAN_SHADOW_END) cover all 64-bits of virtual
* addresses. So KASAN_SHADOW_OFFSET should satisfy the following equation:
* KASAN_SHADOW_OFFSET = KASAN_SHADOW_END -
* (1ULL << (64 - KASAN_SHADOW_SCALE_SHIFT))
*/
#define _KASAN_SHADOW_START(va) (KASAN_SHADOW_END - (1UL << ((va) - KASAN_SHADOW_SCALE_SHIFT)))
#define KASAN_SHADOW_START _KASAN_SHADOW_START(vabits_actual)
void kasan_copy_shadow(pgd_t *pgdir);
asmlinkage void kasan_early_init(void); asmlinkage void kasan_early_init(void);
void kasan_init(void);
void kasan_copy_shadow(pgd_t *pgdir);
#else #else
static inline void kasan_init(void) { } static inline void kasan_init(void) { }

View File

@ -65,15 +65,41 @@
#define KERNEL_END _end #define KERNEL_END _end
/* /*
* Generic and tag-based KASAN require 1/8th and 1/16th of the kernel virtual * Generic and Software Tag-Based KASAN modes require 1/8th and 1/16th of the
* address space for the shadow region respectively. They can bloat the stack * kernel virtual address space for storing the shadow memory respectively.
* significantly, so double the (minimum) stack size when they are in use. *
* The mapping between a virtual memory address and its corresponding shadow
* memory address is defined based on the formula:
*
* shadow_addr = (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
*
* where KASAN_SHADOW_SCALE_SHIFT is the order of the number of bits that map
* to a single shadow byte and KASAN_SHADOW_OFFSET is a constant that offsets
* the mapping. Note that KASAN_SHADOW_OFFSET does not point to the start of
* the shadow memory region.
*
* Based on this mapping, we define two constants:
*
* KASAN_SHADOW_START: the start of the shadow memory region;
* KASAN_SHADOW_END: the end of the shadow memory region.
*
* KASAN_SHADOW_END is defined first as the shadow address that corresponds to
* the upper bound of possible virtual kernel memory addresses UL(1) << 64
* according to the mapping formula.
*
* KASAN_SHADOW_START is defined second based on KASAN_SHADOW_END. The shadow
* memory start must map to the lowest possible kernel virtual memory address
* and thus it depends on the actual bitness of the address space.
*
* As KASAN inserts redzones between stack variables, this increases the stack
* memory usage significantly. Thus, we double the (minimum) stack size.
*/ */
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL) #define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
#define KASAN_SHADOW_END ((UL(1) << (64 - KASAN_SHADOW_SCALE_SHIFT)) \ #define KASAN_SHADOW_END ((UL(1) << (64 - KASAN_SHADOW_SCALE_SHIFT)) + KASAN_SHADOW_OFFSET)
+ KASAN_SHADOW_OFFSET) #define _KASAN_SHADOW_START(va) (KASAN_SHADOW_END - (UL(1) << ((va) - KASAN_SHADOW_SCALE_SHIFT)))
#define PAGE_END (KASAN_SHADOW_END - (1UL << (vabits_actual - KASAN_SHADOW_SCALE_SHIFT))) #define KASAN_SHADOW_START _KASAN_SHADOW_START(vabits_actual)
#define PAGE_END KASAN_SHADOW_START
#define KASAN_THREAD_SHIFT 1 #define KASAN_THREAD_SHIFT 1
#else #else
#define KASAN_THREAD_SHIFT 0 #define KASAN_THREAD_SHIFT 0

View File

@ -10,7 +10,7 @@
/* /*
* Section size must be at least 512MB for 64K base * Section size must be at least 512MB for 64K base
* page size config. Otherwise it will be less than * page size config. Otherwise it will be less than
* MAX_ORDER and the build process will fail. * MAX_PAGE_ORDER and the build process will fail.
*/ */
#ifdef CONFIG_ARM64_64K_PAGES #ifdef CONFIG_ARM64_64K_PAGES
#define SECTION_SIZE_BITS 29 #define SECTION_SIZE_BITS 29

View File

@ -16,7 +16,7 @@ struct hyp_pool {
* API at EL2. * API at EL2.
*/ */
hyp_spinlock_t lock; hyp_spinlock_t lock;
struct list_head free_area[MAX_ORDER + 1]; struct list_head free_area[NR_PAGE_ORDERS];
phys_addr_t range_start; phys_addr_t range_start;
phys_addr_t range_end; phys_addr_t range_end;
unsigned short max_order; unsigned short max_order;

View File

@ -228,7 +228,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
int i; int i;
hyp_spin_lock_init(&pool->lock); hyp_spin_lock_init(&pool->lock);
pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT)); pool->max_order = min(MAX_PAGE_ORDER,
get_order(nr_pages << PAGE_SHIFT));
for (i = 0; i <= pool->max_order; i++) for (i = 0; i <= pool->max_order; i++)
INIT_LIST_HEAD(&pool->free_area[i]); INIT_LIST_HEAD(&pool->free_area[i]);
pool->range_start = phys; pool->range_start = phys;

View File

@ -51,7 +51,7 @@ void __init arm64_hugetlb_cma_reserve(void)
* page allocator. Just warn if there is any change * page allocator. Just warn if there is any change
* breaking this assumption. * breaking this assumption.
*/ */
WARN_ON(order <= MAX_ORDER); WARN_ON(order <= MAX_PAGE_ORDER);
hugetlb_cma_reserve(order); hugetlb_cma_reserve(order);
} }
#endif /* CONFIG_CMA */ #endif /* CONFIG_CMA */

View File

@ -170,6 +170,11 @@ asmlinkage void __init kasan_early_init(void)
{ {
BUILD_BUG_ON(KASAN_SHADOW_OFFSET != BUILD_BUG_ON(KASAN_SHADOW_OFFSET !=
KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT))); KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT)));
/*
* We cannot check the actual value of KASAN_SHADOW_START during build,
* as it depends on vabits_actual. As a best-effort approach, check
* potential values calculated based on VA_BITS and VA_BITS_MIN.
*/
BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS), PGDIR_SIZE)); BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS), PGDIR_SIZE));
BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS_MIN), PGDIR_SIZE)); BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS_MIN), PGDIR_SIZE));
BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE)); BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE));

View File

@ -523,6 +523,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
return pmd; return pmd;
} }
#define pmd_dirty pmd_dirty
static inline int pmd_dirty(pmd_t pmd) static inline int pmd_dirty(pmd_t pmd)
{ {
return !!(pmd_val(pmd) & (_PAGE_DIRTY | _PAGE_MODIFIED)); return !!(pmd_val(pmd) & (_PAGE_DIRTY | _PAGE_MODIFIED));

View File

@ -226,32 +226,6 @@ static void __init node_mem_init(unsigned int node)
#ifdef CONFIG_ACPI_NUMA #ifdef CONFIG_ACPI_NUMA
/*
* Sanity check to catch more bad NUMA configurations (they are amazingly
* common). Make sure the nodes cover all memory.
*/
static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
{
int i;
u64 numaram, biosram;
numaram = 0;
for (i = 0; i < mi->nr_blks; i++) {
u64 s = mi->blk[i].start >> PAGE_SHIFT;
u64 e = mi->blk[i].end >> PAGE_SHIFT;
numaram += e - s;
numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
if ((s64)numaram < 0)
numaram = 0;
}
max_pfn = max_low_pfn;
biosram = max_pfn - absent_pages_in_range(0, max_pfn);
BUG_ON((s64)(biosram - numaram) >= (1 << (20 - PAGE_SHIFT)));
return true;
}
static void __init add_node_intersection(u32 node, u64 start, u64 size, u32 type) static void __init add_node_intersection(u32 node, u64 start, u64 size, u32 type)
{ {
static unsigned long num_physpages; static unsigned long num_physpages;
@ -396,7 +370,7 @@ int __init init_numa_memory(void)
return -EINVAL; return -EINVAL;
init_node_memblock(); init_node_memblock();
if (numa_meminfo_cover_memory(&numa_meminfo) == false) if (!memblock_validate_numa_coverage(SZ_1M))
return -EINVAL; return -EINVAL;
for_each_node_mask(node, node_possible_map) { for_each_node_mask(node, node_possible_map) {

View File

@ -402,7 +402,7 @@ config ARCH_FORCE_MAX_ORDER
default "10" default "10"
help help
The kernel page allocator limits the size of maximal physically The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it contiguous allocations. The limit is called MAX_PAGE_ORDER and it
defines the maximal power of two of number of pages that can be defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very overriding the default setting when ability to allocate very

View File

@ -655,6 +655,7 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
return pmd; return pmd;
} }
#define pmd_dirty pmd_dirty
static inline int pmd_dirty(pmd_t pmd) static inline int pmd_dirty(pmd_t pmd)
{ {
return !!(pmd_val(pmd) & _PAGE_MODIFIED); return !!(pmd_val(pmd) & _PAGE_MODIFIED);

View File

@ -50,7 +50,7 @@ config ARCH_FORCE_MAX_ORDER
default "10" default "10"
help help
The kernel page allocator limits the size of maximal physically The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it contiguous allocations. The limit is called MAX_PAGE_ORDER and it
defines the maximal power of two of number of pages that can be defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very overriding the default setting when ability to allocate very

View File

@ -916,7 +916,7 @@ config ARCH_FORCE_MAX_ORDER
default "10" default "10"
help help
The kernel page allocator limits the size of maximal physically The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it contiguous allocations. The limit is called MAX_PAGE_ORDER and it
defines the maximal power of two of number of pages that can be defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very overriding the default setting when ability to allocate very

View File

@ -97,7 +97,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
} }
mmap_read_lock(mm); mmap_read_lock(mm);
chunk = (1UL << (PAGE_SHIFT + MAX_ORDER)) / chunk = (1UL << (PAGE_SHIFT + MAX_PAGE_ORDER)) /
sizeof(struct vm_area_struct *); sizeof(struct vm_area_struct *);
chunk = min(chunk, entries); chunk = min(chunk, entries);
for (entry = 0; entry < entries; entry += chunk) { for (entry = 0; entry < entries; entry += chunk) {

View File

@ -615,7 +615,7 @@ void __init gigantic_hugetlb_cma_reserve(void)
order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT; order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT;
if (order) { if (order) {
VM_WARN_ON(order <= MAX_ORDER); VM_WARN_ON(order <= MAX_PAGE_ORDER);
hugetlb_cma_reserve(order); hugetlb_cma_reserve(order);
} }
} }

View File

@ -1389,7 +1389,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
* DMA window can be larger than available memory, which will * DMA window can be larger than available memory, which will
* cause errors later. * cause errors later.
*/ */
const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER); const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_PAGE_ORDER);
/* /*
* We create the default window as big as we can. The constraint is * We create the default window as big as we can. The constraint is

View File

@ -673,6 +673,7 @@ static inline int pmd_write(pmd_t pmd)
return pte_write(pmd_pte(pmd)); return pte_write(pmd_pte(pmd));
} }
#define pmd_dirty pmd_dirty
static inline int pmd_dirty(pmd_t pmd) static inline int pmd_dirty(pmd_t pmd)
{ {
return pte_dirty(pmd_pte(pmd)); return pte_dirty(pmd_pte(pmd));

View File

@ -770,6 +770,7 @@ static inline int pud_write(pud_t pud)
return (pud_val(pud) & _REGION3_ENTRY_WRITE) != 0; return (pud_val(pud) & _REGION3_ENTRY_WRITE) != 0;
} }
#define pmd_dirty pmd_dirty
static inline int pmd_dirty(pmd_t pmd) static inline int pmd_dirty(pmd_t pmd)
{ {
return (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY) != 0; return (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY) != 0;

View File

@ -26,7 +26,7 @@ config ARCH_FORCE_MAX_ORDER
default "10" default "10"
help help
The kernel page allocator limits the size of maximal physically The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it contiguous allocations. The limit is called MAX_PAGE:_ORDER and it
defines the maximal power of two of number of pages that can be defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very overriding the default setting when ability to allocate very

View File

@ -277,7 +277,7 @@ config ARCH_FORCE_MAX_ORDER
default "12" default "12"
help help
The kernel page allocator limits the size of maximal physically The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it contiguous allocations. The limit is called MAX_PAGE_ORDER and it
defines the maximal power of two of number of pages that can be defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very overriding the default setting when ability to allocate very

View File

@ -706,6 +706,7 @@ static inline unsigned long pmd_write(pmd_t pmd)
#define pud_write(pud) pte_write(__pte(pud_val(pud))) #define pud_write(pud) pte_write(__pte(pud_val(pud)))
#ifdef CONFIG_TRANSPARENT_HUGEPAGE #ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define pmd_dirty pmd_dirty
static inline unsigned long pmd_dirty(pmd_t pmd) static inline unsigned long pmd_dirty(pmd_t pmd)
{ {
pte_t pte = __pte(pmd_val(pmd)); pte_t pte = __pte(pmd_val(pmd));

View File

@ -194,7 +194,7 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
size = IO_PAGE_ALIGN(size); size = IO_PAGE_ALIGN(size);
order = get_order(size); order = get_order(size);
if (unlikely(order > MAX_ORDER)) if (unlikely(order > MAX_PAGE_ORDER))
return NULL; return NULL;
npages = size >> IO_PAGE_SHIFT; npages = size >> IO_PAGE_SHIFT;

View File

@ -897,7 +897,7 @@ void __init cheetah_ecache_flush_init(void)
/* Now allocate error trap reporting scoreboard. */ /* Now allocate error trap reporting scoreboard. */
sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info)); sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info));
for (order = 0; order <= MAX_ORDER; order++) { for (order = 0; order < NR_PAGE_ORDERS; order++) {
if ((PAGE_SIZE << order) >= sz) if ((PAGE_SIZE << order) >= sz)
break; break;
} }

View File

@ -402,8 +402,8 @@ void tsb_grow(struct mm_struct *mm, unsigned long tsb_index, unsigned long rss)
unsigned long new_rss_limit; unsigned long new_rss_limit;
gfp_t gfp_flags; gfp_t gfp_flags;
if (max_tsb_size > PAGE_SIZE << MAX_ORDER) if (max_tsb_size > PAGE_SIZE << MAX_PAGE_ORDER)
max_tsb_size = PAGE_SIZE << MAX_ORDER; max_tsb_size = PAGE_SIZE << MAX_PAGE_ORDER;
new_cache_index = 0; new_cache_index = 0;
for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) { for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) {

View File

@ -373,10 +373,10 @@ int __init linux_main(int argc, char **argv)
max_physmem = TASK_SIZE - uml_physmem - iomem_size - MIN_VMALLOC; max_physmem = TASK_SIZE - uml_physmem - iomem_size - MIN_VMALLOC;
/* /*
* Zones have to begin on a 1 << MAX_ORDER page boundary, * Zones have to begin on a 1 << MAX_PAGE_ORDER page boundary,
* so this makes sure that's true for highmem * so this makes sure that's true for highmem
*/ */
max_physmem &= ~((1 << (PAGE_SHIFT + MAX_ORDER)) - 1); max_physmem &= ~((1 << (PAGE_SHIFT + MAX_PAGE_ORDER)) - 1);
if (physmem_size + iomem_size > max_physmem) { if (physmem_size + iomem_size > max_physmem) {
highmem = physmem_size + iomem_size - max_physmem; highmem = physmem_size + iomem_size - max_physmem;
physmem_size -= highmem; physmem_size -= highmem;

View File

@ -88,6 +88,7 @@ config X86
select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64
select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_HW_PTE_YOUNG
select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2 select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
select ARCH_HAS_COPY_MC if X86_64 select ARCH_HAS_COPY_MC if X86_64

View File

@ -141,6 +141,7 @@ static inline int pte_young(pte_t pte)
return pte_flags(pte) & _PAGE_ACCESSED; return pte_flags(pte) & _PAGE_ACCESSED;
} }
#define pmd_dirty pmd_dirty
static inline bool pmd_dirty(pmd_t pmd) static inline bool pmd_dirty(pmd_t pmd)
{ {
return pmd_flags(pmd) & _PAGE_DIRTY_BITS; return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
@ -1679,12 +1680,6 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF); return boot_cpu_has_bug(X86_BUG_L1TF);
} }
#define arch_has_hw_pte_young arch_has_hw_pte_young
static inline bool arch_has_hw_pte_young(void)
{
return true;
}
#define arch_check_zapped_pte arch_check_zapped_pte #define arch_check_zapped_pte arch_check_zapped_pte
void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte); void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);

View File

@ -449,37 +449,6 @@ int __node_distance(int from, int to)
} }
EXPORT_SYMBOL(__node_distance); EXPORT_SYMBOL(__node_distance);
/*
* Sanity check to catch more bad NUMA configurations (they are amazingly
* common). Make sure the nodes cover all memory.
*/
static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
{
u64 numaram, e820ram;
int i;
numaram = 0;
for (i = 0; i < mi->nr_blks; i++) {
u64 s = mi->blk[i].start >> PAGE_SHIFT;
u64 e = mi->blk[i].end >> PAGE_SHIFT;
numaram += e - s;
numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
if ((s64)numaram < 0)
numaram = 0;
}
e820ram = max_pfn - absent_pages_in_range(0, max_pfn);
/* We seem to lose 3 pages somewhere. Allow 1M of slack. */
if ((s64)(e820ram - numaram) >= (1 << (20 - PAGE_SHIFT))) {
printk(KERN_ERR "NUMA: nodes only cover %LuMB of your %LuMB e820 RAM. Not used.\n",
(numaram << PAGE_SHIFT) >> 20,
(e820ram << PAGE_SHIFT) >> 20);
return false;
}
return true;
}
/* /*
* Mark all currently memblock-reserved physical memory (which covers the * Mark all currently memblock-reserved physical memory (which covers the
* kernel's own memory ranges) as hot-unswappable. * kernel's own memory ranges) as hot-unswappable.
@ -585,7 +554,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL; return -EINVAL;
} }
} }
if (!numa_meminfo_cover_memory(mi))
if (!memblock_validate_numa_coverage(SZ_1M))
return -EINVAL; return -EINVAL;
/* Finally register nodes. */ /* Finally register nodes. */

View File

@ -793,7 +793,7 @@ config ARCH_FORCE_MAX_ORDER
default "10" default "10"
help help
The kernel page allocator limits the size of maximal physically The kernel page allocator limits the size of maximal physically
contiguous allocations. The limit is called MAX_ORDER and it contiguous allocations. The limit is called MAX_PAGE_ORDER and it
defines the maximal power of two of number of pages that can be defines the maximal power of two of number of pages that can be
allocated as a single contiguous block. This option allows allocated as a single contiguous block. This option allows
overriding the default setting when ability to allocate very overriding the default setting when ability to allocate very

View File

@ -18,6 +18,8 @@
#define KASAN_SHADOW_START (XCHAL_PAGE_TABLE_VADDR + XCHAL_PAGE_TABLE_SIZE) #define KASAN_SHADOW_START (XCHAL_PAGE_TABLE_VADDR + XCHAL_PAGE_TABLE_SIZE)
/* Size of the shadow map */ /* Size of the shadow map */
#define KASAN_SHADOW_SIZE (-KASAN_START_VADDR >> KASAN_SHADOW_SCALE_SHIFT) #define KASAN_SHADOW_SIZE (-KASAN_START_VADDR >> KASAN_SHADOW_SCALE_SHIFT)
/* End of the shadow map */
#define KASAN_SHADOW_END (KASAN_SHADOW_START + KASAN_SHADOW_SIZE)
/* Offset for mem to shadow address transformation */ /* Offset for mem to shadow address transformation */
#define KASAN_SHADOW_OFFSET __XTENSA_UL_CONST(CONFIG_KASAN_SHADOW_OFFSET) #define KASAN_SHADOW_OFFSET __XTENSA_UL_CONST(CONFIG_KASAN_SHADOW_OFFSET)

View File

@ -410,9 +410,24 @@ static int blkdev_get_block(struct inode *inode, sector_t iblock,
return 0; return 0;
} }
static int blkdev_writepage(struct page *page, struct writeback_control *wbc) /*
* We cannot call mpage_writepages() as it does not take the buffer lock.
* We must use block_write_full_folio() directly which holds the buffer
* lock. The buffer lock provides the synchronisation with writeback
* that filesystems rely on when they use the blockdev's mapping.
*/
static int blkdev_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{ {
return block_write_full_page(page, blkdev_get_block, wbc); struct blk_plug plug;
int err;
blk_start_plug(&plug);
err = write_cache_pages(mapping, wbc, block_write_full_folio,
blkdev_get_block);
blk_finish_plug(&plug);
return err;
} }
static int blkdev_read_folio(struct file *file, struct folio *folio) static int blkdev_read_folio(struct file *file, struct folio *folio)
@ -449,7 +464,7 @@ const struct address_space_operations def_blk_aops = {
.invalidate_folio = block_invalidate_folio, .invalidate_folio = block_invalidate_folio,
.read_folio = blkdev_read_folio, .read_folio = blkdev_read_folio,
.readahead = blkdev_readahead, .readahead = blkdev_readahead,
.writepage = blkdev_writepage, .writepages = blkdev_writepages,
.write_begin = blkdev_write_begin, .write_begin = blkdev_write_begin,
.write_end = blkdev_write_end, .write_end = blkdev_write_end,
.migrate_folio = buffer_migrate_folio_norefs, .migrate_folio = buffer_migrate_folio_norefs,
@ -500,7 +515,7 @@ const struct address_space_operations def_blk_aops = {
.readahead = blkdev_readahead, .readahead = blkdev_readahead,
.writepages = blkdev_writepages, .writepages = blkdev_writepages,
.is_partially_uptodate = iomap_is_partially_uptodate, .is_partially_uptodate = iomap_is_partially_uptodate,
.error_remove_page = generic_error_remove_page, .error_remove_folio = generic_error_remove_folio,
.migrate_folio = filemap_migrate_folio, .migrate_folio = filemap_migrate_folio,
}; };
#endif /* CONFIG_BUFFER_HEAD */ #endif /* CONFIG_BUFFER_HEAD */

View File

@ -451,7 +451,7 @@ static int create_sgt(struct qaic_device *qdev, struct sg_table **sgt_out, u64 s
* later * later
*/ */
buf_extra = (PAGE_SIZE - size % PAGE_SIZE) % PAGE_SIZE; buf_extra = (PAGE_SIZE - size % PAGE_SIZE) % PAGE_SIZE;
max_order = min(MAX_ORDER - 1, get_order(size)); max_order = min(MAX_PAGE_ORDER - 1, get_order(size));
} else { } else {
/* allocate a single page for book keeping */ /* allocate a single page for book keeping */
nr_pages = 1; nr_pages = 1;

View File

@ -234,7 +234,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate,
if (page->page_ptr) { if (page->page_ptr) {
trace_binder_alloc_lru_start(alloc, index); trace_binder_alloc_lru_start(alloc, index);
on_lru = list_lru_del(&binder_alloc_lru, &page->lru); on_lru = list_lru_del_obj(&binder_alloc_lru, &page->lru);
WARN_ON(!on_lru); WARN_ON(!on_lru);
trace_binder_alloc_lru_end(alloc, index); trace_binder_alloc_lru_end(alloc, index);
@ -285,7 +285,7 @@ free_range:
trace_binder_free_lru_start(alloc, index); trace_binder_free_lru_start(alloc, index);
ret = list_lru_add(&binder_alloc_lru, &page->lru); ret = list_lru_add_obj(&binder_alloc_lru, &page->lru);
WARN_ON(!ret); WARN_ON(!ret);
trace_binder_free_lru_end(alloc, index); trace_binder_free_lru_end(alloc, index);
@ -848,7 +848,7 @@ void binder_alloc_deferred_release(struct binder_alloc *alloc)
if (!alloc->pages[i].page_ptr) if (!alloc->pages[i].page_ptr)
continue; continue;
on_lru = list_lru_del(&binder_alloc_lru, on_lru = list_lru_del_obj(&binder_alloc_lru,
&alloc->pages[i].lru); &alloc->pages[i].lru);
page_addr = alloc->buffer + i * PAGE_SIZE; page_addr = alloc->buffer + i * PAGE_SIZE;
binder_alloc_debug(BINDER_DEBUG_BUFFER_ALLOC, binder_alloc_debug(BINDER_DEBUG_BUFFER_ALLOC,
@ -1287,4 +1287,3 @@ int binder_alloc_copy_from_buffer(struct binder_alloc *alloc,
return binder_alloc_do_buffer_copy(alloc, false, buffer, buffer_offset, return binder_alloc_do_buffer_copy(alloc, false, buffer, buffer_offset,
dest, bytes); dest, bytes);
} }

View File

@ -226,8 +226,8 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
if (*ppos < 0 || !count) if (*ppos < 0 || !count)
return -EINVAL; return -EINVAL;
if (count > (PAGE_SIZE << MAX_ORDER)) if (count > (PAGE_SIZE << MAX_PAGE_ORDER))
count = PAGE_SIZE << MAX_ORDER; count = PAGE_SIZE << MAX_PAGE_ORDER;
buf = kmalloc(count, GFP_KERNEL); buf = kmalloc(count, GFP_KERNEL);
if (!buf) if (!buf)
@ -373,8 +373,8 @@ static ssize_t regmap_reg_ranges_read_file(struct file *file,
if (*ppos < 0 || !count) if (*ppos < 0 || !count)
return -EINVAL; return -EINVAL;
if (count > (PAGE_SIZE << MAX_ORDER)) if (count > (PAGE_SIZE << MAX_PAGE_ORDER))
count = PAGE_SIZE << MAX_ORDER; count = PAGE_SIZE << MAX_PAGE_ORDER;
buf = kmalloc(count, GFP_KERNEL); buf = kmalloc(count, GFP_KERNEL);
if (!buf) if (!buf)

View File

@ -3079,7 +3079,7 @@ static void raw_cmd_free(struct floppy_raw_cmd **ptr)
} }
} }
#define MAX_LEN (1UL << MAX_ORDER << PAGE_SHIFT) #define MAX_LEN (1UL << MAX_PAGE_ORDER << PAGE_SHIFT)
static int raw_cmd_copyin(int cmd, void __user *param, static int raw_cmd_copyin(int cmd, void __user *param,
struct floppy_raw_cmd **rcmd) struct floppy_raw_cmd **rcmd)

View File

@ -59,8 +59,8 @@ config ZRAM_WRITEBACK
bool "Write back incompressible or idle page to backing device" bool "Write back incompressible or idle page to backing device"
depends on ZRAM depends on ZRAM
help help
With incompressible page, there is no memory saving to keep it This lets zram entries (incompressible or idle pages) be written
in memory. Instead, write it out to backing device. back to a backing device, helping save memory.
For this feature, admin should set up backing device via For this feature, admin should set up backing device via
/sys/block/zramX/backing_dev. /sys/block/zramX/backing_dev.
@ -69,9 +69,18 @@ config ZRAM_WRITEBACK
See Documentation/admin-guide/blockdev/zram.rst for more information. See Documentation/admin-guide/blockdev/zram.rst for more information.
config ZRAM_TRACK_ENTRY_ACTIME
bool "Track access time of zram entries"
depends on ZRAM
help
With this feature zram tracks access time of every stored
entry (page), which can be used for a more fine grained IDLE
pages writeback.
config ZRAM_MEMORY_TRACKING config ZRAM_MEMORY_TRACKING
bool "Track zRam block status" bool "Track zRam block status"
depends on ZRAM && DEBUG_FS depends on ZRAM && DEBUG_FS
select ZRAM_TRACK_ENTRY_ACTIME
help help
With this feature, admin can track the state of allocated blocks With this feature, admin can track the state of allocated blocks
of zRAM. Admin could see the information via of zRAM. Admin could see the information via
@ -86,4 +95,4 @@ config ZRAM_MULTI_COMP
This will enable multi-compression streams, so that ZRAM can This will enable multi-compression streams, so that ZRAM can
re-compress pages using a potentially slower but more effective re-compress pages using a potentially slower but more effective
compression algorithm. Note, that IDLE page recompression compression algorithm. Note, that IDLE page recompression
requires ZRAM_MEMORY_TRACKING. requires ZRAM_TRACK_ENTRY_ACTIME.

View File

@ -174,6 +174,14 @@ static inline u32 zram_get_priority(struct zram *zram, u32 index)
return prio & ZRAM_COMP_PRIORITY_MASK; return prio & ZRAM_COMP_PRIORITY_MASK;
} }
static void zram_accessed(struct zram *zram, u32 index)
{
zram_clear_flag(zram, index, ZRAM_IDLE);
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
zram->table[index].ac_time = ktime_get_boottime();
#endif
}
static inline void update_used_max(struct zram *zram, static inline void update_used_max(struct zram *zram,
const unsigned long pages) const unsigned long pages)
{ {
@ -293,8 +301,9 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
zram_slot_lock(zram, index); zram_slot_lock(zram, index);
if (zram_allocated(zram, index) && if (zram_allocated(zram, index) &&
!zram_test_flag(zram, index, ZRAM_UNDER_WB)) { !zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
#ifdef CONFIG_ZRAM_MEMORY_TRACKING #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
is_idle = !cutoff || ktime_after(cutoff, zram->table[index].ac_time); is_idle = !cutoff || ktime_after(cutoff,
zram->table[index].ac_time);
#endif #endif
if (is_idle) if (is_idle)
zram_set_flag(zram, index, ZRAM_IDLE); zram_set_flag(zram, index, ZRAM_IDLE);
@ -317,7 +326,7 @@ static ssize_t idle_store(struct device *dev,
*/ */
u64 age_sec; u64 age_sec;
if (IS_ENABLED(CONFIG_ZRAM_MEMORY_TRACKING) && !kstrtoull(buf, 0, &age_sec)) if (IS_ENABLED(CONFIG_ZRAM_TRACK_ENTRY_ACTIME) && !kstrtoull(buf, 0, &age_sec))
cutoff_time = ktime_sub(ktime_get_boottime(), cutoff_time = ktime_sub(ktime_get_boottime(),
ns_to_ktime(age_sec * NSEC_PER_SEC)); ns_to_ktime(age_sec * NSEC_PER_SEC));
else else
@ -841,12 +850,6 @@ static void zram_debugfs_destroy(void)
debugfs_remove_recursive(zram_debugfs_root); debugfs_remove_recursive(zram_debugfs_root);
} }
static void zram_accessed(struct zram *zram, u32 index)
{
zram_clear_flag(zram, index, ZRAM_IDLE);
zram->table[index].ac_time = ktime_get_boottime();
}
static ssize_t read_block_state(struct file *file, char __user *buf, static ssize_t read_block_state(struct file *file, char __user *buf,
size_t count, loff_t *ppos) size_t count, loff_t *ppos)
{ {
@ -930,10 +933,6 @@ static void zram_debugfs_unregister(struct zram *zram)
#else #else
static void zram_debugfs_create(void) {}; static void zram_debugfs_create(void) {};
static void zram_debugfs_destroy(void) {}; static void zram_debugfs_destroy(void) {};
static void zram_accessed(struct zram *zram, u32 index)
{
zram_clear_flag(zram, index, ZRAM_IDLE);
};
static void zram_debugfs_register(struct zram *zram) {}; static void zram_debugfs_register(struct zram *zram) {};
static void zram_debugfs_unregister(struct zram *zram) {}; static void zram_debugfs_unregister(struct zram *zram) {};
#endif #endif
@ -1254,7 +1253,7 @@ static void zram_free_page(struct zram *zram, size_t index)
{ {
unsigned long handle; unsigned long handle;
#ifdef CONFIG_ZRAM_MEMORY_TRACKING #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
zram->table[index].ac_time = 0; zram->table[index].ac_time = 0;
#endif #endif
if (zram_test_flag(zram, index, ZRAM_IDLE)) if (zram_test_flag(zram, index, ZRAM_IDLE))
@ -1322,9 +1321,9 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
void *mem; void *mem;
value = handle ? zram_get_element(zram, index) : 0; value = handle ? zram_get_element(zram, index) : 0;
mem = kmap_atomic(page); mem = kmap_local_page(page);
zram_fill_page(mem, PAGE_SIZE, value); zram_fill_page(mem, PAGE_SIZE, value);
kunmap_atomic(mem); kunmap_local(mem);
return 0; return 0;
} }
@ -1337,14 +1336,14 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO); src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
if (size == PAGE_SIZE) { if (size == PAGE_SIZE) {
dst = kmap_atomic(page); dst = kmap_local_page(page);
memcpy(dst, src, PAGE_SIZE); memcpy(dst, src, PAGE_SIZE);
kunmap_atomic(dst); kunmap_local(dst);
ret = 0; ret = 0;
} else { } else {
dst = kmap_atomic(page); dst = kmap_local_page(page);
ret = zcomp_decompress(zstrm, src, size, dst); ret = zcomp_decompress(zstrm, src, size, dst);
kunmap_atomic(dst); kunmap_local(dst);
zcomp_stream_put(zram->comps[prio]); zcomp_stream_put(zram->comps[prio]);
} }
zs_unmap_object(zram->mem_pool, handle); zs_unmap_object(zram->mem_pool, handle);
@ -1417,21 +1416,21 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
unsigned long element = 0; unsigned long element = 0;
enum zram_pageflags flags = 0; enum zram_pageflags flags = 0;
mem = kmap_atomic(page); mem = kmap_local_page(page);
if (page_same_filled(mem, &element)) { if (page_same_filled(mem, &element)) {
kunmap_atomic(mem); kunmap_local(mem);
/* Free memory associated with this sector now. */ /* Free memory associated with this sector now. */
flags = ZRAM_SAME; flags = ZRAM_SAME;
atomic64_inc(&zram->stats.same_pages); atomic64_inc(&zram->stats.same_pages);
goto out; goto out;
} }
kunmap_atomic(mem); kunmap_local(mem);
compress_again: compress_again:
zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]); zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
src = kmap_atomic(page); src = kmap_local_page(page);
ret = zcomp_compress(zstrm, src, &comp_len); ret = zcomp_compress(zstrm, src, &comp_len);
kunmap_atomic(src); kunmap_local(src);
if (unlikely(ret)) { if (unlikely(ret)) {
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]); zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
@ -1495,10 +1494,10 @@ compress_again:
src = zstrm->buffer; src = zstrm->buffer;
if (comp_len == PAGE_SIZE) if (comp_len == PAGE_SIZE)
src = kmap_atomic(page); src = kmap_local_page(page);
memcpy(dst, src, comp_len); memcpy(dst, src, comp_len);
if (comp_len == PAGE_SIZE) if (comp_len == PAGE_SIZE)
kunmap_atomic(src); kunmap_local(src);
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]); zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
zs_unmap_object(zram->mem_pool, handle); zs_unmap_object(zram->mem_pool, handle);
@ -1615,9 +1614,9 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
num_recomps++; num_recomps++;
zstrm = zcomp_stream_get(zram->comps[prio]); zstrm = zcomp_stream_get(zram->comps[prio]);
src = kmap_atomic(page); src = kmap_local_page(page);
ret = zcomp_compress(zstrm, src, &comp_len_new); ret = zcomp_compress(zstrm, src, &comp_len_new);
kunmap_atomic(src); kunmap_local(src);
if (ret) { if (ret) {
zcomp_stream_put(zram->comps[prio]); zcomp_stream_put(zram->comps[prio]);

View File

@ -69,7 +69,7 @@ struct zram_table_entry {
unsigned long element; unsigned long element;
}; };
unsigned long flags; unsigned long flags;
#ifdef CONFIG_ZRAM_MEMORY_TRACKING #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
ktime_t ac_time; ktime_t ac_time;
#endif #endif
}; };

View File

@ -906,7 +906,7 @@ static int sev_ioctl_do_get_id2(struct sev_issue_cmd *argp)
/* /*
* The length of the ID shouldn't be assumed by software since * The length of the ID shouldn't be assumed by software since
* it may change in the future. The allocation size is limited * it may change in the future. The allocation size is limited
* to 1 << (PAGE_SHIFT + MAX_ORDER) by the page allocator. * to 1 << (PAGE_SHIFT + MAX_PAGE_ORDER) by the page allocator.
* If the allocation fails, simply return ENOMEM rather than * If the allocation fails, simply return ENOMEM rather than
* warning in the kernel log. * warning in the kernel log.
*/ */

View File

@ -70,11 +70,11 @@ struct hisi_acc_sgl_pool *hisi_acc_create_sgl_pool(struct device *dev,
HISI_ACC_SGL_ALIGN_SIZE); HISI_ACC_SGL_ALIGN_SIZE);
/* /*
* the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_ORDER, * the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_PAGE_ORDER,
* block size may exceed 2^31 on ia64, so the max of block size is 2^31 * block size may exceed 2^31 on ia64, so the max of block size is 2^31
*/ */
block_size = 1 << (PAGE_SHIFT + MAX_ORDER < 32 ? block_size = 1 << (PAGE_SHIFT + MAX_PAGE_ORDER < 32 ?
PAGE_SHIFT + MAX_ORDER : 31); PAGE_SHIFT + MAX_PAGE_ORDER : 31);
sgl_num_per_block = block_size / sgl_size; sgl_num_per_block = block_size / sgl_size;
block_num = count / sgl_num_per_block; block_num = count / sgl_num_per_block;
remain_sgl = count % sgl_num_per_block; remain_sgl = count % sgl_num_per_block;

View File

@ -367,6 +367,7 @@ static ssize_t create_store(struct device *dev, struct device_attribute *attr,
.dax_region = dax_region, .dax_region = dax_region,
.size = 0, .size = 0,
.id = -1, .id = -1,
.memmap_on_memory = false,
}; };
struct dev_dax *dev_dax = devm_create_dev_dax(&data); struct dev_dax *dev_dax = devm_create_dev_dax(&data);
@ -1400,6 +1401,8 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
dev_dax->align = dax_region->align; dev_dax->align = dax_region->align;
ida_init(&dev_dax->ida); ida_init(&dev_dax->ida);
dev_dax->memmap_on_memory = data->memmap_on_memory;
inode = dax_inode(dax_dev); inode = dax_inode(dax_dev);
dev->devt = inode->i_rdev; dev->devt = inode->i_rdev;
dev->bus = &dax_bus_type; dev->bus = &dax_bus_type;

View File

@ -23,6 +23,7 @@ struct dev_dax_data {
struct dev_pagemap *pgmap; struct dev_pagemap *pgmap;
resource_size_t size; resource_size_t size;
int id; int id;
bool memmap_on_memory;
}; };
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data); struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);

View File

@ -26,6 +26,7 @@ static int cxl_dax_region_probe(struct device *dev)
.dax_region = dax_region, .dax_region = dax_region,
.id = -1, .id = -1,
.size = range_len(&cxlr_dax->hpa_range), .size = range_len(&cxlr_dax->hpa_range),
.memmap_on_memory = true,
}; };
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data)); return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));

View File

@ -70,6 +70,7 @@ struct dev_dax {
struct ida ida; struct ida ida;
struct device dev; struct device dev;
struct dev_pagemap *pgmap; struct dev_pagemap *pgmap;
bool memmap_on_memory;
int nr_range; int nr_range;
struct dev_dax_range { struct dev_dax_range {
unsigned long pgoff; unsigned long pgoff;

View File

@ -36,6 +36,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
.dax_region = dax_region, .dax_region = dax_region,
.id = -1, .id = -1,
.size = region_idle ? 0 : range_len(&mri->range), .size = region_idle ? 0 : range_len(&mri->range),
.memmap_on_memory = false,
}; };
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data)); return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));

View File

@ -12,6 +12,7 @@
#include <linux/mm.h> #include <linux/mm.h>
#include <linux/mman.h> #include <linux/mman.h>
#include <linux/memory-tiers.h> #include <linux/memory-tiers.h>
#include <linux/memory_hotplug.h>
#include "dax-private.h" #include "dax-private.h"
#include "bus.h" #include "bus.h"
@ -93,6 +94,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
struct dax_kmem_data *data; struct dax_kmem_data *data;
struct memory_dev_type *mtype; struct memory_dev_type *mtype;
int i, rc, mapped = 0; int i, rc, mapped = 0;
mhp_t mhp_flags;
int numa_node; int numa_node;
int adist = MEMTIER_DEFAULT_DAX_ADISTANCE; int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
@ -179,12 +181,16 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
*/ */
res->flags = IORESOURCE_SYSTEM_RAM; res->flags = IORESOURCE_SYSTEM_RAM;
mhp_flags = MHP_NID_IS_MGID;
if (dev_dax->memmap_on_memory)
mhp_flags |= MHP_MEMMAP_ON_MEMORY;
/* /*
* Ensure that future kexec'd kernels will not treat * Ensure that future kexec'd kernels will not treat
* this as RAM automatically. * this as RAM automatically.
*/ */
rc = add_memory_driver_managed(data->mgid, range.start, rc = add_memory_driver_managed(data->mgid, range.start,
range_len(&range), kmem_name, MHP_NID_IS_MGID); range_len(&range), kmem_name, mhp_flags);
if (rc) { if (rc) {
dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",

View File

@ -63,6 +63,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
.id = id, .id = id,
.pgmap = &pgmap, .pgmap = &pgmap,
.size = range_len(&range), .size = range_len(&range),
.memmap_on_memory = false,
}; };
return devm_create_dev_dax(&data); return devm_create_dev_dax(&data);

View File

@ -36,7 +36,7 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj)
struct sg_table *st; struct sg_table *st;
struct scatterlist *sg; struct scatterlist *sg;
unsigned int npages; /* restricted by sg_alloc_table */ unsigned int npages; /* restricted by sg_alloc_table */
int max_order = MAX_ORDER; int max_order = MAX_PAGE_ORDER;
unsigned int max_segment; unsigned int max_segment;
gfp_t gfp; gfp_t gfp;

View File

@ -115,7 +115,7 @@ static int get_huge_pages(struct drm_i915_gem_object *obj)
do { do {
struct page *page; struct page *page;
GEM_BUG_ON(order > MAX_ORDER); GEM_BUG_ON(order > MAX_PAGE_ORDER);
page = alloc_pages(GFP | __GFP_ZERO, order); page = alloc_pages(GFP | __GFP_ZERO, order);
if (!page) if (!page)
goto err; goto err;

View File

@ -175,7 +175,7 @@ static void ttm_device_init_pools(struct kunit *test)
if (params->pools_init_expected) { if (params->pools_init_expected) {
for (int i = 0; i < TTM_NUM_CACHING_TYPES; ++i) { for (int i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
for (int j = 0; j <= MAX_ORDER; ++j) { for (int j = 0; j < NR_PAGE_ORDERS; ++j) {
pt = pool->caching[i].orders[j]; pt = pool->caching[i].orders[j];
KUNIT_EXPECT_PTR_EQ(test, pt.pool, pool); KUNIT_EXPECT_PTR_EQ(test, pt.pool, pool);
KUNIT_EXPECT_EQ(test, pt.caching, i); KUNIT_EXPECT_EQ(test, pt.caching, i);

View File

@ -109,7 +109,7 @@ static const struct ttm_pool_test_case ttm_pool_basic_cases[] = {
}, },
{ {
.description = "Above the allocation limit", .description = "Above the allocation limit",
.order = MAX_ORDER + 1, .order = MAX_PAGE_ORDER + 1,
}, },
{ {
.description = "One page, with coherent DMA mappings enabled", .description = "One page, with coherent DMA mappings enabled",
@ -118,7 +118,7 @@ static const struct ttm_pool_test_case ttm_pool_basic_cases[] = {
}, },
{ {
.description = "Above the allocation limit, with coherent DMA mappings enabled", .description = "Above the allocation limit, with coherent DMA mappings enabled",
.order = MAX_ORDER + 1, .order = MAX_PAGE_ORDER + 1,
.use_dma_alloc = true, .use_dma_alloc = true,
}, },
}; };
@ -165,7 +165,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
fst_page = tt->pages[0]; fst_page = tt->pages[0];
last_page = tt->pages[tt->num_pages - 1]; last_page = tt->pages[tt->num_pages - 1];
if (params->order <= MAX_ORDER) { if (params->order <= MAX_PAGE_ORDER) {
if (params->use_dma_alloc) { if (params->use_dma_alloc) {
KUNIT_ASSERT_NOT_NULL(test, (void *)fst_page->private); KUNIT_ASSERT_NOT_NULL(test, (void *)fst_page->private);
KUNIT_ASSERT_NOT_NULL(test, (void *)last_page->private); KUNIT_ASSERT_NOT_NULL(test, (void *)last_page->private);
@ -182,7 +182,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
* order 0 blocks * order 0 blocks
*/ */
KUNIT_ASSERT_EQ(test, fst_page->private, KUNIT_ASSERT_EQ(test, fst_page->private,
min_t(unsigned int, MAX_ORDER, min_t(unsigned int, MAX_PAGE_ORDER,
params->order)); params->order));
KUNIT_ASSERT_EQ(test, last_page->private, 0); KUNIT_ASSERT_EQ(test, last_page->private, 0);
} }

View File

@ -65,11 +65,11 @@ module_param(page_pool_size, ulong, 0644);
static atomic_long_t allocated_pages; static atomic_long_t allocated_pages;
static struct ttm_pool_type global_write_combined[MAX_ORDER + 1]; static struct ttm_pool_type global_write_combined[NR_PAGE_ORDERS];
static struct ttm_pool_type global_uncached[MAX_ORDER + 1]; static struct ttm_pool_type global_uncached[NR_PAGE_ORDERS];
static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER + 1]; static struct ttm_pool_type global_dma32_write_combined[NR_PAGE_ORDERS];
static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1]; static struct ttm_pool_type global_dma32_uncached[NR_PAGE_ORDERS];
static spinlock_t shrinker_lock; static spinlock_t shrinker_lock;
static struct list_head shrinker_list; static struct list_head shrinker_list;
@ -447,7 +447,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
else else
gfp_flags |= GFP_HIGHUSER; gfp_flags |= GFP_HIGHUSER;
for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages)); for (order = min_t(unsigned int, MAX_PAGE_ORDER, __fls(num_pages));
num_pages; num_pages;
order = min_t(unsigned int, order, __fls(num_pages))) { order = min_t(unsigned int, order, __fls(num_pages))) {
struct ttm_pool_type *pt; struct ttm_pool_type *pt;
@ -568,7 +568,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
if (use_dma_alloc || nid != NUMA_NO_NODE) { if (use_dma_alloc || nid != NUMA_NO_NODE) {
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
for (j = 0; j <= MAX_ORDER; ++j) for (j = 0; j < NR_PAGE_ORDERS; ++j)
ttm_pool_type_init(&pool->caching[i].orders[j], ttm_pool_type_init(&pool->caching[i].orders[j],
pool, i, j); pool, i, j);
} }
@ -601,7 +601,7 @@ void ttm_pool_fini(struct ttm_pool *pool)
if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) { if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) {
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
for (j = 0; j <= MAX_ORDER; ++j) for (j = 0; j < NR_PAGE_ORDERS; ++j)
ttm_pool_type_fini(&pool->caching[i].orders[j]); ttm_pool_type_fini(&pool->caching[i].orders[j]);
} }
@ -656,7 +656,7 @@ static void ttm_pool_debugfs_header(struct seq_file *m)
unsigned int i; unsigned int i;
seq_puts(m, "\t "); seq_puts(m, "\t ");
for (i = 0; i <= MAX_ORDER; ++i) for (i = 0; i < NR_PAGE_ORDERS; ++i)
seq_printf(m, " ---%2u---", i); seq_printf(m, " ---%2u---", i);
seq_puts(m, "\n"); seq_puts(m, "\n");
} }
@ -667,7 +667,7 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
{ {
unsigned int i; unsigned int i;
for (i = 0; i <= MAX_ORDER; ++i) for (i = 0; i < NR_PAGE_ORDERS; ++i)
seq_printf(m, " %8u", ttm_pool_type_count(&pt[i])); seq_printf(m, " %8u", ttm_pool_type_count(&pt[i]));
seq_puts(m, "\n"); seq_puts(m, "\n");
} }
@ -776,7 +776,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
spin_lock_init(&shrinker_lock); spin_lock_init(&shrinker_lock);
INIT_LIST_HEAD(&shrinker_list); INIT_LIST_HEAD(&shrinker_list);
for (i = 0; i <= MAX_ORDER; ++i) { for (i = 0; i < NR_PAGE_ORDERS; ++i) {
ttm_pool_type_init(&global_write_combined[i], NULL, ttm_pool_type_init(&global_write_combined[i], NULL,
ttm_write_combined, i); ttm_write_combined, i);
ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i); ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i);
@ -816,7 +816,7 @@ void ttm_pool_mgr_fini(void)
{ {
unsigned int i; unsigned int i;
for (i = 0; i <= MAX_ORDER; ++i) { for (i = 0; i < NR_PAGE_ORDERS; ++i) {
ttm_pool_type_fini(&global_write_combined[i]); ttm_pool_type_fini(&global_write_combined[i]);
ttm_pool_type_fini(&global_uncached[i]); ttm_pool_type_fini(&global_uncached[i]);

View File

@ -188,7 +188,7 @@
#ifdef CONFIG_CMA_ALIGNMENT #ifdef CONFIG_CMA_ALIGNMENT
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + CONFIG_CMA_ALIGNMENT) #define Q_MAX_SZ_SHIFT (PAGE_SHIFT + CONFIG_CMA_ALIGNMENT)
#else #else
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER) #define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_PAGE_ORDER)
#endif #endif
/* /*

View File

@ -884,7 +884,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
struct page **pages; struct page **pages;
unsigned int i = 0, nid = dev_to_node(dev); unsigned int i = 0, nid = dev_to_node(dev);
order_mask &= GENMASK(MAX_ORDER, 0); order_mask &= GENMASK(MAX_PAGE_ORDER, 0);
if (!order_mask) if (!order_mask)
return NULL; return NULL;

View File

@ -2465,8 +2465,8 @@ static bool its_parse_indirect_baser(struct its_node *its,
* feature is not supported by hardware. * feature is not supported by hardware.
*/ */
new_order = max_t(u32, get_order(esz << ids), new_order); new_order = max_t(u32, get_order(esz << ids), new_order);
if (new_order > MAX_ORDER) { if (new_order > MAX_PAGE_ORDER) {
new_order = MAX_ORDER; new_order = MAX_PAGE_ORDER;
ids = ilog2(PAGE_ORDER_TO_SIZE(new_order) / (int)esz); ids = ilog2(PAGE_ORDER_TO_SIZE(new_order) / (int)esz);
pr_warn("ITS@%pa: %s Table too large, reduce ids %llu->%u\n", pr_warn("ITS@%pa: %s Table too large, reduce ids %llu->%u\n",
&its->phys_base, its_base_type_string[type], &its->phys_base, its_base_type_string[type],

View File

@ -1170,7 +1170,7 @@ static void __cache_size_refresh(void)
* If the allocation may fail we use __get_free_pages. Memory fragmentation * If the allocation may fail we use __get_free_pages. Memory fragmentation
* won't have a fatal effect here, but it just causes flushes of some other * won't have a fatal effect here, but it just causes flushes of some other
* buffers and more I/O will be performed. Don't use __get_free_pages if it * buffers and more I/O will be performed. Don't use __get_free_pages if it
* always fails (i.e. order > MAX_ORDER). * always fails (i.e. order > MAX_PAGE_ORDER).
* *
* If the allocation shouldn't fail we use __vmalloc. This is only for the * If the allocation shouldn't fail we use __vmalloc. This is only for the
* initial reserve allocation, so there's no risk of wasting all vmalloc * initial reserve allocation, so there's no risk of wasting all vmalloc

View File

@ -1673,7 +1673,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)
unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM; gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
unsigned int remaining_size; unsigned int remaining_size;
unsigned int order = MAX_ORDER; unsigned int order = MAX_PAGE_ORDER;
retry: retry:
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM)) if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))

View File

@ -434,7 +434,7 @@ static struct bio *clone_bio(struct dm_target *ti, struct flakey_c *fc, struct b
remaining_size = size; remaining_size = size;
order = MAX_ORDER; order = MAX_PAGE_ORDER;
while (remaining_size) { while (remaining_size) {
struct page *pages; struct page *pages;
unsigned size_to_add, to_copy; unsigned size_to_add, to_copy;

View File

@ -443,7 +443,7 @@ static int genwqe_mmap(struct file *filp, struct vm_area_struct *vma)
if (vsize == 0) if (vsize == 0)
return -EINVAL; return -EINVAL;
if (get_order(vsize) > MAX_ORDER) if (get_order(vsize) > MAX_PAGE_ORDER)
return -ENOMEM; return -ENOMEM;
dma_map = kzalloc(sizeof(struct dma_mapping), GFP_KERNEL); dma_map = kzalloc(sizeof(struct dma_mapping), GFP_KERNEL);

View File

@ -210,7 +210,7 @@ u32 genwqe_crc32(u8 *buff, size_t len, u32 init)
void *__genwqe_alloc_consistent(struct genwqe_dev *cd, size_t size, void *__genwqe_alloc_consistent(struct genwqe_dev *cd, size_t size,
dma_addr_t *dma_handle) dma_addr_t *dma_handle)
{ {
if (get_order(size) > MAX_ORDER) if (get_order(size) > MAX_PAGE_ORDER)
return NULL; return NULL;
return dma_alloc_coherent(&cd->pci_dev->dev, size, dma_handle, return dma_alloc_coherent(&cd->pci_dev->dev, size, dma_handle,
@ -308,7 +308,7 @@ int genwqe_alloc_sync_sgl(struct genwqe_dev *cd, struct genwqe_sgl *sgl,
sgl->write = write; sgl->write = write;
sgl->sgl_size = genwqe_sgl_size(sgl->nr_pages); sgl->sgl_size = genwqe_sgl_size(sgl->nr_pages);
if (get_order(sgl->sgl_size) > MAX_ORDER) { if (get_order(sgl->sgl_size) > MAX_PAGE_ORDER) {
dev_err(&pci_dev->dev, dev_err(&pci_dev->dev,
"[%s] err: too much memory requested!\n", __func__); "[%s] err: too much memory requested!\n", __func__);
return ret; return ret;

View File

@ -1041,7 +1041,7 @@ static void hns3_init_tx_spare_buffer(struct hns3_enet_ring *ring)
return; return;
order = get_order(alloc_size); order = get_order(alloc_size);
if (order > MAX_ORDER) { if (order > MAX_PAGE_ORDER) {
if (net_ratelimit()) if (net_ratelimit())
dev_warn(ring_to_dev(ring), "failed to allocate tx spare buffer, exceed to max order\n"); dev_warn(ring_to_dev(ring), "failed to allocate tx spare buffer, exceed to max order\n");
return; return;

View File

@ -48,7 +48,7 @@
* of 4096 jumbo frames (MTU=9000) we will need about 9K*4K = 36MB plus * of 4096 jumbo frames (MTU=9000) we will need about 9K*4K = 36MB plus
* some padding. * some padding.
* *
* But the size of a single DMA region is limited by MAX_ORDER in the * But the size of a single DMA region is limited by MAX_PAGE_ORDER in the
* kernel (about 16MB currently). To support say 4K Jumbo frames, we * kernel (about 16MB currently). To support say 4K Jumbo frames, we
* use a set of LTBs (struct ltb_set) per pool. * use a set of LTBs (struct ltb_set) per pool.
* *
@ -75,7 +75,7 @@
* pool for the 4MB. Thus the 16 Rx and Tx queues require 32 * 5 = 160 * pool for the 4MB. Thus the 16 Rx and Tx queues require 32 * 5 = 160
* plus 16 for the TSO pools for a total of 176 LTB mappings per VNIC. * plus 16 for the TSO pools for a total of 176 LTB mappings per VNIC.
*/ */
#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_ORDER) * PAGE_SIZE)) #define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_PAGE_ORDER) * PAGE_SIZE))
#define IBMVNIC_ONE_LTB_SIZE min((u32)(8 << 20), IBMVNIC_ONE_LTB_MAX) #define IBMVNIC_ONE_LTB_SIZE min((u32)(8 << 20), IBMVNIC_ONE_LTB_MAX)
#define IBMVNIC_LTB_SET_SIZE (38 << 20) #define IBMVNIC_LTB_SET_SIZE (38 << 20)

View File

@ -927,8 +927,8 @@ static phys_addr_t hvfb_get_phymem(struct hv_device *hdev,
if (request_size == 0) if (request_size == 0)
return -1; return -1;
if (order <= MAX_ORDER) { if (order <= MAX_PAGE_ORDER) {
/* Call alloc_pages if the size is less than 2^MAX_ORDER */ /* Call alloc_pages if the size is less than 2^MAX_PAGE_ORDER */
page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order); page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
if (!page) if (!page)
return -1; return -1;
@ -958,7 +958,7 @@ static void hvfb_release_phymem(struct hv_device *hdev,
{ {
unsigned int order = get_order(size); unsigned int order = get_order(size);
if (order <= MAX_ORDER) if (order <= MAX_PAGE_ORDER)
__free_pages(pfn_to_page(paddr >> PAGE_SHIFT), order); __free_pages(pfn_to_page(paddr >> PAGE_SHIFT), order);
else else
dma_free_coherent(&hdev->device, dma_free_coherent(&hdev->device,

View File

@ -197,7 +197,7 @@ static int vmlfb_alloc_vram(struct vml_info *vinfo,
va = &vinfo->vram[i]; va = &vinfo->vram[i];
order = 0; order = 0;
while (requested > (PAGE_SIZE << order) && order <= MAX_ORDER) while (requested > (PAGE_SIZE << order) && order <= MAX_PAGE_ORDER)
order++; order++;
err = vmlfb_alloc_vram_area(va, order, 0); err = vmlfb_alloc_vram_area(va, order, 0);

View File

@ -33,7 +33,7 @@
#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \ #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
__GFP_NOMEMALLOC) __GFP_NOMEMALLOC)
/* The order of free page blocks to report to host */ /* The order of free page blocks to report to host */
#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_ORDER #define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_PAGE_ORDER
/* The size of a free page block in bytes */ /* The size of a free page block in bytes */
#define VIRTIO_BALLOON_HINT_BLOCK_BYTES \ #define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT)) (1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))

View File

@ -1154,13 +1154,13 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
*/ */
static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages) static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
{ {
unsigned long order = MAX_ORDER; unsigned long order = MAX_PAGE_ORDER;
unsigned long i; unsigned long i;
/* /*
* We might get called for ranges that don't cover properly aligned * We might get called for ranges that don't cover properly aligned
* MAX_ORDER pages; however, we can only online properly aligned * MAX_PAGE_ORDER pages; however, we can only online properly aligned
* pages with an order of MAX_ORDER at maximum. * pages with an order of MAX_PAGE_ORDER at maximum.
*/ */
while (!IS_ALIGNED(pfn | nr_pages, 1 << order)) while (!IS_ALIGNED(pfn | nr_pages, 1 << order))
order--; order--;
@ -1280,7 +1280,7 @@ static void virtio_mem_online_page(struct virtio_mem *vm,
bool do_online; bool do_online;
/* /*
* We can get called with any order up to MAX_ORDER. If our subblock * We can get called with any order up to MAX_PAGE_ORDER. If our subblock
* size is smaller than that and we have a mixture of plugged and * size is smaller than that and we have a mixture of plugged and
* unplugged subblocks within such a page, we have to process in * unplugged subblocks within such a page, we have to process in
* smaller granularity. In that case we'll adjust the order exactly once * smaller granularity. In that case we'll adjust the order exactly once

View File

@ -258,7 +258,7 @@ config TMPFS_QUOTA
config ARCH_SUPPORTS_HUGETLBFS config ARCH_SUPPORTS_HUGETLBFS
def_bool n def_bool n
config HUGETLBFS menuconfig HUGETLBFS
bool "HugeTLB file system support" bool "HugeTLB file system support"
depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
depends on (SYSFS || SYSCTL) depends on (SYSFS || SYSCTL)
@ -270,6 +270,17 @@ config HUGETLBFS
If unsure, say N. If unsure, say N.
if HUGETLBFS
config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
default n
depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
help
The HugeTLB Vmemmap Optimization (HVO) defaults to off. Say Y here to
enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
(boot command line) or hugetlb_optimize_vmemmap (sysctl).
endif # HUGETLBFS
config HUGETLB_PAGE config HUGETLB_PAGE
def_bool HUGETLBFS def_bool HUGETLBFS
select XARRAY_MULTI select XARRAY_MULTI
@ -279,15 +290,6 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP depends on SPARSEMEM_VMEMMAP
config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
default n
depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
help
The HugeTLB VmemmapvOptimization (HVO) defaults to off. Say Y here to
enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
(boot command line) or hugetlb_optimize_vmemmap (sysctl).
config ARCH_HAS_GIGANTIC_PAGE config ARCH_HAS_GIGANTIC_PAGE
bool bool

View File

@ -5,6 +5,7 @@
* Copyright (C) 1997-1999 Russell King * Copyright (C) 1997-1999 Russell King
*/ */
#include <linux/buffer_head.h> #include <linux/buffer_head.h>
#include <linux/mpage.h>
#include <linux/writeback.h> #include <linux/writeback.h>
#include "adfs.h" #include "adfs.h"
@ -33,9 +34,10 @@ abort_toobig:
return 0; return 0;
} }
static int adfs_writepage(struct page *page, struct writeback_control *wbc) static int adfs_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{ {
return block_write_full_page(page, adfs_get_block, wbc); return mpage_writepages(mapping, wbc, adfs_get_block);
} }
static int adfs_read_folio(struct file *file, struct folio *folio) static int adfs_read_folio(struct file *file, struct folio *folio)
@ -76,10 +78,11 @@ static const struct address_space_operations adfs_aops = {
.dirty_folio = block_dirty_folio, .dirty_folio = block_dirty_folio,
.invalidate_folio = block_invalidate_folio, .invalidate_folio = block_invalidate_folio,
.read_folio = adfs_read_folio, .read_folio = adfs_read_folio,
.writepage = adfs_writepage, .writepages = adfs_writepages,
.write_begin = adfs_write_begin, .write_begin = adfs_write_begin,
.write_end = generic_write_end, .write_end = generic_write_end,
.bmap = _adfs_bmap .migrate_folio = buffer_migrate_folio,
.bmap = _adfs_bmap,
}; };
/* /*

View File

@ -242,7 +242,7 @@ static void afs_kill_pages(struct address_space *mapping,
folio_clear_uptodate(folio); folio_clear_uptodate(folio);
folio_end_writeback(folio); folio_end_writeback(folio);
folio_lock(folio); folio_lock(folio);
generic_error_remove_page(mapping, &folio->page); generic_error_remove_folio(mapping, folio);
folio_unlock(folio); folio_unlock(folio);
folio_put(folio); folio_put(folio);
@ -559,8 +559,7 @@ static void afs_extend_writeback(struct address_space *mapping,
if (!folio_clear_dirty_for_io(folio)) if (!folio_clear_dirty_for_io(folio))
BUG(); BUG();
if (folio_start_writeback(folio)) folio_start_writeback(folio);
BUG();
afs_folio_start_fscache(caching, folio); afs_folio_start_fscache(caching, folio);
*_count -= folio_nr_pages(folio); *_count -= folio_nr_pages(folio);
@ -595,8 +594,7 @@ static ssize_t afs_write_back_from_locked_folio(struct address_space *mapping,
_enter(",%lx,%llx-%llx", folio_index(folio), start, end); _enter(",%lx,%llx-%llx", folio_index(folio), start, end);
if (folio_start_writeback(folio)) folio_start_writeback(folio);
BUG();
afs_folio_start_fscache(caching, folio); afs_folio_start_fscache(caching, folio);
count -= folio_nr_pages(folio); count -= folio_nr_pages(folio);

View File

@ -1131,7 +1131,7 @@ static const struct address_space_operations bch_address_space_operations = {
#ifdef CONFIG_MIGRATION #ifdef CONFIG_MIGRATION
.migrate_folio = filemap_migrate_folio, .migrate_folio = filemap_migrate_folio,
#endif #endif
.error_remove_page = generic_error_remove_page, .error_remove_folio = generic_error_remove_folio,
}; };
struct bcachefs_fid { struct bcachefs_fid {

View File

@ -11,6 +11,7 @@
*/ */
#include <linux/fs.h> #include <linux/fs.h>
#include <linux/mpage.h>
#include <linux/buffer_head.h> #include <linux/buffer_head.h>
#include "bfs.h" #include "bfs.h"
@ -150,9 +151,10 @@ out:
return err; return err;
} }
static int bfs_writepage(struct page *page, struct writeback_control *wbc) static int bfs_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{ {
return block_write_full_page(page, bfs_get_block, wbc); return mpage_writepages(mapping, wbc, bfs_get_block);
} }
static int bfs_read_folio(struct file *file, struct folio *folio) static int bfs_read_folio(struct file *file, struct folio *folio)
@ -190,9 +192,10 @@ const struct address_space_operations bfs_aops = {
.dirty_folio = block_dirty_folio, .dirty_folio = block_dirty_folio,
.invalidate_folio = block_invalidate_folio, .invalidate_folio = block_invalidate_folio,
.read_folio = bfs_read_folio, .read_folio = bfs_read_folio,
.writepage = bfs_writepage, .writepages = bfs_writepages,
.write_begin = bfs_write_begin, .write_begin = bfs_write_begin,
.write_end = generic_write_end, .write_end = generic_write_end,
.migrate_folio = buffer_migrate_folio,
.bmap = bfs_bmap, .bmap = bfs_bmap,
}; };

View File

@ -10930,7 +10930,7 @@ static const struct address_space_operations btrfs_aops = {
.release_folio = btrfs_release_folio, .release_folio = btrfs_release_folio,
.migrate_folio = btrfs_migrate_folio, .migrate_folio = btrfs_migrate_folio,
.dirty_folio = filemap_dirty_folio, .dirty_folio = filemap_dirty_folio,
.error_remove_page = generic_error_remove_page, .error_remove_folio = generic_error_remove_folio,
.swap_activate = btrfs_swap_activate, .swap_activate = btrfs_swap_activate,
.swap_deactivate = btrfs_swap_deactivate, .swap_deactivate = btrfs_swap_deactivate,
}; };

View File

@ -199,7 +199,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
int all_mapped = 1; int all_mapped = 1;
static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1); static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
index = block >> (PAGE_SHIFT - bd_inode->i_blkbits); index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0); folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);
if (IS_ERR(folio)) if (IS_ERR(folio))
goto out; goto out;
@ -372,10 +372,10 @@ static void end_buffer_async_read_io(struct buffer_head *bh, int uptodate)
} }
/* /*
* Completion handler for block_write_full_page() - pages which are unlocked * Completion handler for block_write_full_folio() - folios which are unlocked
* during I/O, and which have PageWriteback cleared upon I/O completion. * during I/O, and which have the writeback flag cleared upon I/O completion.
*/ */
void end_buffer_async_write(struct buffer_head *bh, int uptodate) static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
{ {
unsigned long flags; unsigned long flags;
struct buffer_head *first; struct buffer_head *first;
@ -415,7 +415,6 @@ still_busy:
spin_unlock_irqrestore(&first->b_uptodate_lock, flags); spin_unlock_irqrestore(&first->b_uptodate_lock, flags);
return; return;
} }
EXPORT_SYMBOL(end_buffer_async_write);
/* /*
* If a page's buffers are under async readin (end_buffer_async_read * If a page's buffers are under async readin (end_buffer_async_read
@ -995,11 +994,12 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
* Initialise the state of a blockdev folio's buffers. * Initialise the state of a blockdev folio's buffers.
*/ */
static sector_t folio_init_buffers(struct folio *folio, static sector_t folio_init_buffers(struct folio *folio,
struct block_device *bdev, sector_t block, int size) struct block_device *bdev, unsigned size)
{ {
struct buffer_head *head = folio_buffers(folio); struct buffer_head *head = folio_buffers(folio);
struct buffer_head *bh = head; struct buffer_head *bh = head;
bool uptodate = folio_test_uptodate(folio); bool uptodate = folio_test_uptodate(folio);
sector_t block = div_u64(folio_pos(folio), size);
sector_t end_block = blkdev_max_block(bdev, size); sector_t end_block = blkdev_max_block(bdev, size);
do { do {
@ -1024,40 +1024,49 @@ static sector_t folio_init_buffers(struct folio *folio,
} }
/* /*
* Create the page-cache page that contains the requested block. * Create the page-cache folio that contains the requested block.
* *
* This is used purely for blockdev mappings. * This is used purely for blockdev mappings.
*
* Returns false if we have a failure which cannot be cured by retrying
* without sleeping. Returns true if we succeeded, or the caller should retry.
*/ */
static int static bool grow_dev_folio(struct block_device *bdev, sector_t block,
grow_dev_page(struct block_device *bdev, sector_t block, pgoff_t index, unsigned size, gfp_t gfp)
pgoff_t index, int size, int sizebits, gfp_t gfp)
{ {
struct inode *inode = bdev->bd_inode; struct inode *inode = bdev->bd_inode;
struct folio *folio; struct folio *folio;
struct buffer_head *bh; struct buffer_head *bh;
sector_t end_block; sector_t end_block = 0;
int ret = 0;
folio = __filemap_get_folio(inode->i_mapping, index, folio = __filemap_get_folio(inode->i_mapping, index,
FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp); FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
if (IS_ERR(folio)) if (IS_ERR(folio))
return PTR_ERR(folio); return false;
bh = folio_buffers(folio); bh = folio_buffers(folio);
if (bh) { if (bh) {
if (bh->b_size == size) { if (bh->b_size == size) {
end_block = folio_init_buffers(folio, bdev, end_block = folio_init_buffers(folio, bdev, size);
(sector_t)index << sizebits, size); goto unlock;
goto done; }
/*
* Retrying may succeed; for example the folio may finish
* writeback, or buffers may be cleaned. This should not
* happen very often; maybe we have old buffers attached to
* this blockdev's page cache and we're trying to change
* the block size?
*/
if (!try_to_free_buffers(folio)) {
end_block = ~0ULL;
goto unlock;
} }
if (!try_to_free_buffers(folio))
goto failed;
} }
ret = -ENOMEM;
bh = folio_alloc_buffers(folio, size, gfp | __GFP_ACCOUNT); bh = folio_alloc_buffers(folio, size, gfp | __GFP_ACCOUNT);
if (!bh) if (!bh)
goto failed; goto unlock;
/* /*
* Link the folio to the buffers and initialise them. Take the * Link the folio to the buffers and initialise them. Take the
@ -1066,44 +1075,37 @@ grow_dev_page(struct block_device *bdev, sector_t block,
*/ */
spin_lock(&inode->i_mapping->i_private_lock); spin_lock(&inode->i_mapping->i_private_lock);
link_dev_buffers(folio, bh); link_dev_buffers(folio, bh);
end_block = folio_init_buffers(folio, bdev, end_block = folio_init_buffers(folio, bdev, size);
(sector_t)index << sizebits, size);
spin_unlock(&inode->i_mapping->i_private_lock); spin_unlock(&inode->i_mapping->i_private_lock);
done: unlock:
ret = (block < end_block) ? 1 : -ENXIO;
failed:
folio_unlock(folio); folio_unlock(folio);
folio_put(folio); folio_put(folio);
return ret; return block < end_block;
} }
/* /*
* Create buffers for the specified block device block's page. If * Create buffers for the specified block device block's folio. If
* that page was dirty, the buffers are set dirty also. * that folio was dirty, the buffers are set dirty also. Returns false
* if we've hit a permanent error.
*/ */
static int static bool grow_buffers(struct block_device *bdev, sector_t block,
grow_buffers(struct block_device *bdev, sector_t block, int size, gfp_t gfp) unsigned size, gfp_t gfp)
{ {
pgoff_t index; loff_t pos;
int sizebits;
sizebits = PAGE_SHIFT - __ffs(size);
index = block >> sizebits;
/* /*
* Check for a block which wants to lie outside our maximum possible * Check for a block which lies outside our maximum possible
* pagecache index. (this comparison is done using sector_t types). * pagecache index.
*/ */
if (unlikely(index != block >> sizebits)) { if (check_mul_overflow(block, (sector_t)size, &pos) || pos > MAX_LFS_FILESIZE) {
printk(KERN_ERR "%s: requested out-of-range block %llu for " printk(KERN_ERR "%s: requested out-of-range block %llu for device %pg\n",
"device %pg\n",
__func__, (unsigned long long)block, __func__, (unsigned long long)block,
bdev); bdev);
return -EIO; return false;
} }
/* Create a page with the proper size buffers.. */ /* Create a folio with the proper size buffers */
return grow_dev_page(bdev, block, index, size, sizebits, gfp); return grow_dev_folio(bdev, block, pos / PAGE_SIZE, size, gfp);
} }
static struct buffer_head * static struct buffer_head *
@ -1124,14 +1126,12 @@ __getblk_slow(struct block_device *bdev, sector_t block,
for (;;) { for (;;) {
struct buffer_head *bh; struct buffer_head *bh;
int ret;
bh = __find_get_block(bdev, block, size); bh = __find_get_block(bdev, block, size);
if (bh) if (bh)
return bh; return bh;
ret = grow_buffers(bdev, block, size, gfp); if (!grow_buffers(bdev, block, size, gfp))
if (ret < 0)
return NULL; return NULL;
} }
} }
@ -1699,13 +1699,13 @@ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len)
struct inode *bd_inode = bdev->bd_inode; struct inode *bd_inode = bdev->bd_inode;
struct address_space *bd_mapping = bd_inode->i_mapping; struct address_space *bd_mapping = bd_inode->i_mapping;
struct folio_batch fbatch; struct folio_batch fbatch;
pgoff_t index = block >> (PAGE_SHIFT - bd_inode->i_blkbits); pgoff_t index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
pgoff_t end; pgoff_t end;
int i, count; int i, count;
struct buffer_head *bh; struct buffer_head *bh;
struct buffer_head *head; struct buffer_head *head;
end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits); end = ((loff_t)(block + len - 1) << bd_inode->i_blkbits) / PAGE_SIZE;
folio_batch_init(&fbatch); folio_batch_init(&fbatch);
while (filemap_get_folios(bd_mapping, &index, end, &fbatch)) { while (filemap_get_folios(bd_mapping, &index, end, &fbatch)) {
count = folio_batch_count(&fbatch); count = folio_batch_count(&fbatch);
@ -1748,19 +1748,6 @@ unlock_page:
} }
EXPORT_SYMBOL(clean_bdev_aliases); EXPORT_SYMBOL(clean_bdev_aliases);
/*
* Size is a power-of-two in the range 512..PAGE_SIZE,
* and the case we care about most is PAGE_SIZE.
*
* So this *could* possibly be written with those
* constraints in mind (relevant mostly if some
* architecture has a slow bit-scan instruction)
*/
static inline int block_size_bits(unsigned int blocksize)
{
return ilog2(blocksize);
}
static struct buffer_head *folio_create_buffers(struct folio *folio, static struct buffer_head *folio_create_buffers(struct folio *folio,
struct inode *inode, struct inode *inode,
unsigned int b_state) unsigned int b_state)
@ -1790,30 +1777,29 @@ static struct buffer_head *folio_create_buffers(struct folio *folio,
*/ */
/* /*
* While block_write_full_page is writing back the dirty buffers under * While block_write_full_folio is writing back the dirty buffers under
* the page lock, whoever dirtied the buffers may decide to clean them * the page lock, whoever dirtied the buffers may decide to clean them
* again at any time. We handle that by only looking at the buffer * again at any time. We handle that by only looking at the buffer
* state inside lock_buffer(). * state inside lock_buffer().
* *
* If block_write_full_page() is called for regular writeback * If block_write_full_folio() is called for regular writeback
* (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a * (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
* locked buffer. This only can happen if someone has written the buffer * locked buffer. This only can happen if someone has written the buffer
* directly, with submit_bh(). At the address_space level PageWriteback * directly, with submit_bh(). At the address_space level PageWriteback
* prevents this contention from occurring. * prevents this contention from occurring.
* *
* If block_write_full_page() is called with wbc->sync_mode == * If block_write_full_folio() is called with wbc->sync_mode ==
* WB_SYNC_ALL, the writes are posted using REQ_SYNC; this * WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
* causes the writes to be flagged as synchronous writes. * causes the writes to be flagged as synchronous writes.
*/ */
int __block_write_full_folio(struct inode *inode, struct folio *folio, int __block_write_full_folio(struct inode *inode, struct folio *folio,
get_block_t *get_block, struct writeback_control *wbc, get_block_t *get_block, struct writeback_control *wbc)
bh_end_io_t *handler)
{ {
int err; int err;
sector_t block; sector_t block;
sector_t last_block; sector_t last_block;
struct buffer_head *bh, *head; struct buffer_head *bh, *head;
unsigned int blocksize, bbits; size_t blocksize;
int nr_underway = 0; int nr_underway = 0;
blk_opf_t write_flags = wbc_to_write_flags(wbc); blk_opf_t write_flags = wbc_to_write_flags(wbc);
@ -1832,10 +1818,9 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
bh = head; bh = head;
blocksize = bh->b_size; blocksize = bh->b_size;
bbits = block_size_bits(blocksize);
block = (sector_t)folio->index << (PAGE_SHIFT - bbits); block = div_u64(folio_pos(folio), blocksize);
last_block = (i_size_read(inode) - 1) >> bbits; last_block = div_u64(i_size_read(inode) - 1, blocksize);
/* /*
* Get all the dirty buffers mapped to disk addresses and * Get all the dirty buffers mapped to disk addresses and
@ -1849,7 +1834,7 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
* truncate in progress. * truncate in progress.
*/ */
/* /*
* The buffer was zeroed by block_write_full_page() * The buffer was zeroed by block_write_full_folio()
*/ */
clear_buffer_dirty(bh); clear_buffer_dirty(bh);
set_buffer_uptodate(bh); set_buffer_uptodate(bh);
@ -1887,7 +1872,8 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
continue; continue;
} }
if (test_clear_buffer_dirty(bh)) { if (test_clear_buffer_dirty(bh)) {
mark_buffer_async_write_endio(bh, handler); mark_buffer_async_write_endio(bh,
end_buffer_async_write);
} else { } else {
unlock_buffer(bh); unlock_buffer(bh);
} }
@ -1940,7 +1926,8 @@ recover:
if (buffer_mapped(bh) && buffer_dirty(bh) && if (buffer_mapped(bh) && buffer_dirty(bh) &&
!buffer_delay(bh)) { !buffer_delay(bh)) {
lock_buffer(bh); lock_buffer(bh);
mark_buffer_async_write_endio(bh, handler); mark_buffer_async_write_endio(bh,
end_buffer_async_write);
} else { } else {
/* /*
* The buffer may have been set dirty during * The buffer may have been set dirty during
@ -2014,7 +2001,7 @@ static int
iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh, iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
const struct iomap *iomap) const struct iomap *iomap)
{ {
loff_t offset = block << inode->i_blkbits; loff_t offset = (loff_t)block << inode->i_blkbits;
bh->b_bdev = iomap->bdev; bh->b_bdev = iomap->bdev;
@ -2081,27 +2068,24 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len, int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
get_block_t *get_block, const struct iomap *iomap) get_block_t *get_block, const struct iomap *iomap)
{ {
unsigned from = pos & (PAGE_SIZE - 1); size_t from = offset_in_folio(folio, pos);
unsigned to = from + len; size_t to = from + len;
struct inode *inode = folio->mapping->host; struct inode *inode = folio->mapping->host;
unsigned block_start, block_end; size_t block_start, block_end;
sector_t block; sector_t block;
int err = 0; int err = 0;
unsigned blocksize, bbits; size_t blocksize;
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait; struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
BUG_ON(!folio_test_locked(folio)); BUG_ON(!folio_test_locked(folio));
BUG_ON(from > PAGE_SIZE); BUG_ON(to > folio_size(folio));
BUG_ON(to > PAGE_SIZE);
BUG_ON(from > to); BUG_ON(from > to);
head = folio_create_buffers(folio, inode, 0); head = folio_create_buffers(folio, inode, 0);
blocksize = head->b_size; blocksize = head->b_size;
bbits = block_size_bits(blocksize); block = div_u64(folio_pos(folio), blocksize);
block = (sector_t)folio->index << (PAGE_SHIFT - bbits); for (bh = head, block_start = 0; bh != head || !block_start;
for(bh = head, block_start = 0; bh != head || !block_start;
block++, block_start=block_end, bh = bh->b_this_page) { block++, block_start=block_end, bh = bh->b_this_page) {
block_end = block_start + blocksize; block_end = block_start + blocksize;
if (block_end <= from || block_start >= to) { if (block_end <= from || block_start >= to) {
@ -2364,7 +2348,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
struct inode *inode = folio->mapping->host; struct inode *inode = folio->mapping->host;
sector_t iblock, lblock; sector_t iblock, lblock;
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
unsigned int blocksize, bbits; size_t blocksize;
int nr, i; int nr, i;
int fully_mapped = 1; int fully_mapped = 1;
bool page_error = false; bool page_error = false;
@ -2378,10 +2362,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
head = folio_create_buffers(folio, inode, 0); head = folio_create_buffers(folio, inode, 0);
blocksize = head->b_size; blocksize = head->b_size;
bbits = block_size_bits(blocksize);
iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits); iblock = div_u64(folio_pos(folio), blocksize);
lblock = (limit+blocksize-1) >> bbits; lblock = div_u64(limit + blocksize - 1, blocksize);
bh = head; bh = head;
nr = 0; nr = 0;
i = 0; i = 0;
@ -2666,8 +2649,8 @@ int block_truncate_page(struct address_space *mapping,
return 0; return 0;
length = blocksize - length; length = blocksize - length;
iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits); iblock = ((loff_t)index * PAGE_SIZE) >> inode->i_blkbits;
folio = filemap_grab_folio(mapping, index); folio = filemap_grab_folio(mapping, index);
if (IS_ERR(folio)) if (IS_ERR(folio))
return PTR_ERR(folio); return PTR_ERR(folio);
@ -2720,17 +2703,15 @@ EXPORT_SYMBOL(block_truncate_page);
/* /*
* The generic ->writepage function for buffer-backed address_spaces * The generic ->writepage function for buffer-backed address_spaces
*/ */
int block_write_full_page(struct page *page, get_block_t *get_block, int block_write_full_folio(struct folio *folio, struct writeback_control *wbc,
struct writeback_control *wbc) void *get_block)
{ {
struct folio *folio = page_folio(page);
struct inode * const inode = folio->mapping->host; struct inode * const inode = folio->mapping->host;
loff_t i_size = i_size_read(inode); loff_t i_size = i_size_read(inode);
/* Is the folio fully inside i_size? */ /* Is the folio fully inside i_size? */
if (folio_pos(folio) + folio_size(folio) <= i_size) if (folio_pos(folio) + folio_size(folio) <= i_size)
return __block_write_full_folio(inode, folio, get_block, wbc, return __block_write_full_folio(inode, folio, get_block, wbc);
end_buffer_async_write);
/* Is the folio fully outside i_size? (truncate in progress) */ /* Is the folio fully outside i_size? (truncate in progress) */
if (folio_pos(folio) >= i_size) { if (folio_pos(folio) >= i_size) {
@ -2747,10 +2728,8 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
*/ */
folio_zero_segment(folio, offset_in_folio(folio, i_size), folio_zero_segment(folio, offset_in_folio(folio, i_size),
folio_size(folio)); folio_size(folio));
return __block_write_full_folio(inode, folio, get_block, wbc, return __block_write_full_folio(inode, folio, get_block, wbc);
end_buffer_async_write);
} }
EXPORT_SYMBOL(block_write_full_page);
sector_t generic_block_bmap(struct address_space *mapping, sector_t block, sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
get_block_t *get_block) get_block_t *get_block)

View File

@ -907,8 +907,8 @@ static void writepages_finish(struct ceph_osd_request *req)
doutc(cl, "unlocking %p\n", page); doutc(cl, "unlocking %p\n", page);
if (remove_page) if (remove_page)
generic_error_remove_page(inode->i_mapping, generic_error_remove_folio(inode->i_mapping,
page); page_folio(page));
unlock_page(page); unlock_page(page);
} }

View File

@ -428,7 +428,8 @@ static void d_lru_add(struct dentry *dentry)
this_cpu_inc(nr_dentry_unused); this_cpu_inc(nr_dentry_unused);
if (d_is_negative(dentry)) if (d_is_negative(dentry))
this_cpu_inc(nr_dentry_negative); this_cpu_inc(nr_dentry_negative);
WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); WARN_ON_ONCE(!list_lru_add_obj(
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
} }
static void d_lru_del(struct dentry *dentry) static void d_lru_del(struct dentry *dentry)
@ -438,7 +439,8 @@ static void d_lru_del(struct dentry *dentry)
this_cpu_dec(nr_dentry_unused); this_cpu_dec(nr_dentry_unused);
if (d_is_negative(dentry)) if (d_is_negative(dentry))
this_cpu_dec(nr_dentry_negative); this_cpu_dec(nr_dentry_negative);
WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); WARN_ON_ONCE(!list_lru_del_obj(
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
} }
static void d_shrink_del(struct dentry *dentry) static void d_shrink_del(struct dentry *dentry)
@ -1240,7 +1242,7 @@ static enum lru_status dentry_lru_isolate(struct list_head *item,
* *
* This is guaranteed by the fact that all LRU management * This is guaranteed by the fact that all LRU management
* functions are intermediated by the LRU API calls like * functions are intermediated by the LRU API calls like
* list_lru_add and list_lru_del. List movement in this file * list_lru_add_obj and list_lru_del_obj. List movement in this file
* only ever occur through this functions or through callbacks * only ever occur through this functions or through callbacks
* like this one, that are called from the LRU API. * like this one, that are called from the LRU API.
* *

View File

@ -969,7 +969,7 @@ const struct address_space_operations ext2_aops = {
.writepages = ext2_writepages, .writepages = ext2_writepages,
.migrate_folio = buffer_migrate_folio, .migrate_folio = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate, .is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page, .error_remove_folio = generic_error_remove_folio,
}; };
static const struct address_space_operations ext2_dax_aops = { static const struct address_space_operations ext2_dax_aops = {

Some files were not shown because too many files have changed in this diff Show More