mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
synced 2025-01-16 21:35:07 +00:00
Many singleton patches against the MM code. The patch series which
are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series "maple_tree: add mt_free_one() and mt_attr() helpers" "Some cleanups of maple tree" - In the series "mm: use memmap_on_memory semantics for dax/kmem" Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series "Add folio_zero_tail() and folio_fill_tail()" "Make folio_start_writeback return void" "Fix fault handler's handling of poisoned tail pages" "Convert aops->error_remove_page to ->error_remove_folio" "Finish two folio conversions" "More swap folio conversions" - Kefeng Wang has also contributed folio-related work in the series "mm: cleanup and use more folio in page fault" - Jim Cromie has improved the kmemleak reporting output in the series "tweak kmemleak report format". - In the series "stackdepot: allow evicting stack traces" Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series "mm: page_alloc: fixes for high atomic reserve caluculations". - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series "samples: introduce cgroup events listeners". - Some mapletree maintanance work from Liam Howlett in the series "maple_tree: iterator state changes". - Nhat Pham has improved zswap's approach to writeback in the series "workload-specific and memory pressure-driven zswap writeback". - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series "mm/damon: let users feed and tame/auto-tune DAMOS" "selftests/damon: add Python-written DAMON functionality tests" "mm/damon: misc updates for 6.8" - Yosry Ahmed has improved memcg's stats flushing in the series "mm: memcg: subtree stats flushing and thresholds". - In the series "Multi-size THP for anonymous memory" Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series "More buffer_head cleanups". - Suren Baghdasaryan has done work on Andrea Arcangeli's series "userfaultfd move option". UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a "KSM Advisor", in the series "mm/ksm: Add ksm advisor". This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series "mm/zswap: dstmem reuse optimizations and cleanups". - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is "Clean up the writeback paths". - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series "kasan: save mempool stack traces". - Andrey also performed some KASAN maintenance work in the series "kasan: assorted clean-ups". - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series "mm/rmap: interface overhaul". - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series "mm/mglru: Kconfig cleanup". - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series "Remove some lruvec page accounting functions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU= =0NHs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
This commit is contained in:
commit
fb46e22a9e
@ -25,12 +25,14 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
|
|||||||
stops, respectively. Reading the file returns the keywords
|
stops, respectively. Reading the file returns the keywords
|
||||||
based on the current status. Writing 'commit' to this file
|
based on the current status. Writing 'commit' to this file
|
||||||
makes the kdamond reads the user inputs in the sysfs files
|
makes the kdamond reads the user inputs in the sysfs files
|
||||||
except 'state' again. Writing 'update_schemes_stats' to the
|
except 'state' again. Writing 'commit_schemes_quota_goals' to
|
||||||
file updates contents of schemes stats files of the kdamond.
|
this file makes the kdamond reads the quota goal files again.
|
||||||
Writing 'update_schemes_tried_regions' to the file updates
|
Writing 'update_schemes_stats' to the file updates contents of
|
||||||
contents of 'tried_regions' directory of every scheme directory
|
schemes stats files of the kdamond. Writing
|
||||||
of this kdamond. Writing 'update_schemes_tried_bytes' to the
|
'update_schemes_tried_regions' to the file updates contents of
|
||||||
file updates only '.../tried_regions/total_bytes' files of this
|
'tried_regions' directory of every scheme directory of this
|
||||||
|
kdamond. Writing 'update_schemes_tried_bytes' to the file
|
||||||
|
updates only '.../tried_regions/total_bytes' files of this
|
||||||
kdamond. Writing 'clear_schemes_tried_regions' to the file
|
kdamond. Writing 'clear_schemes_tried_regions' to the file
|
||||||
removes contents of the 'tried_regions' directory.
|
removes contents of the 'tried_regions' directory.
|
||||||
|
|
||||||
@ -212,6 +214,25 @@ Contact: SeongJae Park <sj@kernel.org>
|
|||||||
Description: Writing to and reading from this file sets and gets the quotas
|
Description: Writing to and reading from this file sets and gets the quotas
|
||||||
charge reset interval of the scheme in milliseconds.
|
charge reset interval of the scheme in milliseconds.
|
||||||
|
|
||||||
|
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/nr_goals
|
||||||
|
Date: Nov 2023
|
||||||
|
Contact: SeongJae Park <sj@kernel.org>
|
||||||
|
Description: Writing a number 'N' to this file creates the number of
|
||||||
|
directories for setting automatic tuning of the scheme's
|
||||||
|
aggressiveness named '0' to 'N-1' under the goals/ directory.
|
||||||
|
|
||||||
|
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/target_value
|
||||||
|
Date: Nov 2023
|
||||||
|
Contact: SeongJae Park <sj@kernel.org>
|
||||||
|
Description: Writing to and reading from this file sets and gets the target
|
||||||
|
value of the goal metric.
|
||||||
|
|
||||||
|
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/current_value
|
||||||
|
Date: Nov 2023
|
||||||
|
Contact: SeongJae Park <sj@kernel.org>
|
||||||
|
Description: Writing to and reading from this file sets and gets the current
|
||||||
|
value of the goal metric.
|
||||||
|
|
||||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil
|
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil
|
||||||
Date: Mar 2022
|
Date: Mar 2022
|
||||||
Contact: SeongJae Park <sj@kernel.org>
|
Contact: SeongJae Park <sj@kernel.org>
|
||||||
|
@ -328,7 +328,7 @@ as idle::
|
|||||||
From now on, any pages on zram are idle pages. The idle mark
|
From now on, any pages on zram are idle pages. The idle mark
|
||||||
will be removed until someone requests access of the block.
|
will be removed until someone requests access of the block.
|
||||||
IOW, unless there is access request, those pages are still idle pages.
|
IOW, unless there is access request, those pages are still idle pages.
|
||||||
Additionally, when CONFIG_ZRAM_MEMORY_TRACKING is enabled pages can be
|
Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be
|
||||||
marked as idle based on how long (in seconds) it's been since they were
|
marked as idle based on how long (in seconds) it's been since they were
|
||||||
last accessed::
|
last accessed::
|
||||||
|
|
||||||
|
@ -1693,6 +1693,21 @@ PAGE_SIZE multiple when read back.
|
|||||||
limit, it will refuse to take any more stores before existing
|
limit, it will refuse to take any more stores before existing
|
||||||
entries fault back in or are written out to disk.
|
entries fault back in or are written out to disk.
|
||||||
|
|
||||||
|
memory.zswap.writeback
|
||||||
|
A read-write single value file. The default value is "1". The
|
||||||
|
initial value of the root cgroup is 1, and when a new cgroup is
|
||||||
|
created, it inherits the current value of its parent.
|
||||||
|
|
||||||
|
When this is set to 0, all swapping attempts to swapping devices
|
||||||
|
are disabled. This included both zswap writebacks, and swapping due
|
||||||
|
to zswap store failures. If the zswap store failures are recurring
|
||||||
|
(for e.g if the pages are incompressible), users can observe
|
||||||
|
reclaim inefficiency after disabling writeback (because the same
|
||||||
|
pages might be rejected again and again).
|
||||||
|
|
||||||
|
Note that this is subtly different from setting memory.swap.max to
|
||||||
|
0, as it still allows for pages to be written to the zswap pool.
|
||||||
|
|
||||||
memory.pressure
|
memory.pressure
|
||||||
A read-only nested-keyed file.
|
A read-only nested-keyed file.
|
||||||
|
|
||||||
|
@ -172,7 +172,7 @@ variables.
|
|||||||
Offset of the free_list's member. This value is used to compute the number
|
Offset of the free_list's member. This value is used to compute the number
|
||||||
of free pages.
|
of free pages.
|
||||||
|
|
||||||
Each zone has a free_area structure array called free_area[MAX_ORDER + 1].
|
Each zone has a free_area structure array called free_area[NR_PAGE_ORDERS].
|
||||||
The free_list represents a linked list of free page blocks.
|
The free_list represents a linked list of free page blocks.
|
||||||
|
|
||||||
(list_head, next|prev)
|
(list_head, next|prev)
|
||||||
@ -189,11 +189,11 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
|
|||||||
information. Makedumpfile gets the start address of the vmalloc region
|
information. Makedumpfile gets the start address of the vmalloc region
|
||||||
from this.
|
from this.
|
||||||
|
|
||||||
(zone.free_area, MAX_ORDER + 1)
|
(zone.free_area, NR_PAGE_ORDERS)
|
||||||
-------------------------------
|
--------------------------------
|
||||||
|
|
||||||
Free areas descriptor. User-space tools use this value to iterate the
|
Free areas descriptor. User-space tools use this value to iterate the
|
||||||
free_area ranges. MAX_ORDER is used by the zone buddy allocator.
|
free_area ranges. NR_PAGE_ORDERS is used by the zone buddy allocator.
|
||||||
|
|
||||||
prb
|
prb
|
||||||
---
|
---
|
||||||
|
@ -970,17 +970,17 @@
|
|||||||
buddy allocator. Bigger value increase the probability
|
buddy allocator. Bigger value increase the probability
|
||||||
of catching random memory corruption, but reduce the
|
of catching random memory corruption, but reduce the
|
||||||
amount of memory for normal system use. The maximum
|
amount of memory for normal system use. The maximum
|
||||||
possible value is MAX_ORDER/2. Setting this parameter
|
possible value is MAX_PAGE_ORDER/2. Setting this
|
||||||
to 1 or 2 should be enough to identify most random
|
parameter to 1 or 2 should be enough to identify most
|
||||||
memory corruption problems caused by bugs in kernel or
|
random memory corruption problems caused by bugs in
|
||||||
driver code when a CPU writes to (or reads from) a
|
kernel or driver code when a CPU writes to (or reads
|
||||||
random memory location. Note that there exists a class
|
from) a random memory location. Note that there exists
|
||||||
of memory corruptions problems caused by buggy H/W or
|
a class of memory corruptions problems caused by buggy
|
||||||
F/W or by drivers badly programming DMA (basically when
|
H/W or F/W or by drivers badly programming DMA
|
||||||
memory is written at bus level and the CPU MMU is
|
(basically when memory is written at bus level and the
|
||||||
bypassed) which are not detectable by
|
CPU MMU is bypassed) which are not detectable by
|
||||||
CONFIG_DEBUG_PAGEALLOC, hence this option will not help
|
CONFIG_DEBUG_PAGEALLOC, hence this option will not
|
||||||
tracking down these problems.
|
help tracking down these problems.
|
||||||
|
|
||||||
debug_pagealloc=
|
debug_pagealloc=
|
||||||
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
|
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
|
||||||
@ -4136,7 +4136,7 @@
|
|||||||
[KNL] Minimal page reporting order
|
[KNL] Minimal page reporting order
|
||||||
Format: <integer>
|
Format: <integer>
|
||||||
Adjust the minimal page reporting order. The page
|
Adjust the minimal page reporting order. The page
|
||||||
reporting is disabled when it exceeds MAX_ORDER.
|
reporting is disabled when it exceeds MAX_PAGE_ORDER.
|
||||||
|
|
||||||
panic= [KNL] Kernel behaviour on panic: delay <timeout>
|
panic= [KNL] Kernel behaviour on panic: delay <timeout>
|
||||||
timeout > 0: seconds before rebooting
|
timeout > 0: seconds before rebooting
|
||||||
|
@ -59,41 +59,47 @@ Files Hierarchy
|
|||||||
The files hierarchy of DAMON sysfs interface is shown below. In the below
|
The files hierarchy of DAMON sysfs interface is shown below. In the below
|
||||||
figure, parents-children relations are represented with indentations, each
|
figure, parents-children relations are represented with indentations, each
|
||||||
directory is having ``/`` suffix, and files in each directory are separated by
|
directory is having ``/`` suffix, and files in each directory are separated by
|
||||||
comma (","). ::
|
comma (",").
|
||||||
|
|
||||||
/sys/kernel/mm/damon/admin
|
.. parsed-literal::
|
||||||
│ kdamonds/nr_kdamonds
|
|
||||||
│ │ 0/state,pid
|
:ref:`/sys/kernel/mm/damon <sysfs_root>`/admin
|
||||||
│ │ │ contexts/nr_contexts
|
│ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds
|
||||||
│ │ │ │ 0/avail_operations,operations
|
│ │ :ref:`0 <sysfs_kdamond>`/state,pid
|
||||||
│ │ │ │ │ monitoring_attrs/
|
│ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts
|
||||||
|
│ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations
|
||||||
|
│ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/
|
||||||
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
|
│ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
|
||||||
│ │ │ │ │ │ nr_regions/min,max
|
│ │ │ │ │ │ nr_regions/min,max
|
||||||
│ │ │ │ │ targets/nr_targets
|
│ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets
|
||||||
│ │ │ │ │ │ 0/pid_target
|
│ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target
|
||||||
│ │ │ │ │ │ │ regions/nr_regions
|
│ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions
|
||||||
│ │ │ │ │ │ │ │ 0/start,end
|
│ │ │ │ │ │ │ │ :ref:`0 <sysfs_region>`/start,end
|
||||||
│ │ │ │ │ │ │ │ ...
|
│ │ │ │ │ │ │ │ ...
|
||||||
│ │ │ │ │ │ ...
|
│ │ │ │ │ │ ...
|
||||||
│ │ │ │ │ schemes/nr_schemes
|
│ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes
|
||||||
│ │ │ │ │ │ 0/action,apply_interval_us
|
│ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us
|
||||||
│ │ │ │ │ │ │ access_pattern/
|
│ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/
|
||||||
│ │ │ │ │ │ │ │ sz/min,max
|
│ │ │ │ │ │ │ │ sz/min,max
|
||||||
│ │ │ │ │ │ │ │ nr_accesses/min,max
|
│ │ │ │ │ │ │ │ nr_accesses/min,max
|
||||||
│ │ │ │ │ │ │ │ age/min,max
|
│ │ │ │ │ │ │ │ age/min,max
|
||||||
│ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms
|
│ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms
|
||||||
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
|
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
|
||||||
│ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low
|
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
|
||||||
│ │ │ │ │ │ │ filters/nr_filters
|
│ │ │ │ │ │ │ │ │ 0/target_value,current_value
|
||||||
|
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
|
||||||
|
│ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
|
||||||
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
|
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
|
||||||
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
|
│ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
|
||||||
│ │ │ │ │ │ │ tried_regions/total_bytes
|
│ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
|
||||||
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
|
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
|
||||||
│ │ │ │ │ │ │ │ ...
|
│ │ │ │ │ │ │ │ ...
|
||||||
│ │ │ │ │ │ ...
|
│ │ │ │ │ │ ...
|
||||||
│ │ │ │ ...
|
│ │ │ │ ...
|
||||||
│ │ ...
|
│ │ ...
|
||||||
|
|
||||||
|
.. _sysfs_root:
|
||||||
|
|
||||||
Root
|
Root
|
||||||
----
|
----
|
||||||
|
|
||||||
@ -102,6 +108,8 @@ has one directory named ``admin``. The directory contains the files for
|
|||||||
privileged user space programs' control of DAMON. User space tools or daemons
|
privileged user space programs' control of DAMON. User space tools or daemons
|
||||||
having the root permission could use this directory.
|
having the root permission could use this directory.
|
||||||
|
|
||||||
|
.. _sysfs_kdamonds:
|
||||||
|
|
||||||
kdamonds/
|
kdamonds/
|
||||||
---------
|
---------
|
||||||
|
|
||||||
@ -113,6 +121,8 @@ details) exists. In the beginning, this directory has only one file,
|
|||||||
child directories named ``0`` to ``N-1``. Each directory represents each
|
child directories named ``0`` to ``N-1``. Each directory represents each
|
||||||
kdamond.
|
kdamond.
|
||||||
|
|
||||||
|
.. _sysfs_kdamond:
|
||||||
|
|
||||||
kdamonds/<N>/
|
kdamonds/<N>/
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
@ -120,29 +130,37 @@ In each kdamond directory, two files (``state`` and ``pid``) and one directory
|
|||||||
(``contexts``) exist.
|
(``contexts``) exist.
|
||||||
|
|
||||||
Reading ``state`` returns ``on`` if the kdamond is currently running, or
|
Reading ``state`` returns ``on`` if the kdamond is currently running, or
|
||||||
``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be
|
``off`` if it is not running.
|
||||||
in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
|
|
||||||
user inputs in the sysfs files except ``state`` file again. Writing
|
|
||||||
``update_schemes_stats`` to ``state`` file updates the contents of stats files
|
|
||||||
for each DAMON-based operation scheme of the kdamond. For details of the
|
|
||||||
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
|
|
||||||
|
|
||||||
Writing ``update_schemes_tried_regions`` to ``state`` file updates the
|
Users can write below commands for the kdamond to the ``state`` file.
|
||||||
DAMON-based operation scheme action tried regions directory for each
|
|
||||||
DAMON-based operation scheme of the kdamond. Writing
|
- ``on``: Start running.
|
||||||
``update_schemes_tried_bytes`` to ``state`` file updates only
|
- ``off``: Stop running.
|
||||||
``.../tried_regions/total_bytes`` files. Writing
|
- ``commit``: Read the user inputs in the sysfs files except ``state`` file
|
||||||
``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based
|
again.
|
||||||
operating scheme action tried regions directory for each DAMON-based operation
|
- ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes'
|
||||||
scheme of the kdamond. For details of the DAMON-based operation scheme action
|
:ref:`quota goals <sysfs_schemes_quota_goals>`.
|
||||||
tried regions directory, please refer to :ref:`tried_regions section
|
- ``update_schemes_stats``: Update the contents of stats files for each
|
||||||
<sysfs_schemes_tried_regions>`.
|
DAMON-based operation scheme of the kdamond. For details of the stats,
|
||||||
|
please refer to :ref:`stats section <sysfs_schemes_stats>`.
|
||||||
|
- ``update_schemes_tried_regions``: Update the DAMON-based operation scheme
|
||||||
|
action tried regions directory for each DAMON-based operation scheme of the
|
||||||
|
kdamond. For details of the DAMON-based operation scheme action tried
|
||||||
|
regions directory, please refer to
|
||||||
|
:ref:`tried_regions section <sysfs_schemes_tried_regions>`.
|
||||||
|
- ``update_schemes_tried_bytes``: Update only ``.../tried_regions/total_bytes``
|
||||||
|
files.
|
||||||
|
- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme
|
||||||
|
action tried regions directory for each DAMON-based operation scheme of the
|
||||||
|
kdamond.
|
||||||
|
|
||||||
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
|
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
|
||||||
|
|
||||||
``contexts`` directory contains files for controlling the monitoring contexts
|
``contexts`` directory contains files for controlling the monitoring contexts
|
||||||
that this kdamond will execute.
|
that this kdamond will execute.
|
||||||
|
|
||||||
|
.. _sysfs_contexts:
|
||||||
|
|
||||||
kdamonds/<N>/contexts/
|
kdamonds/<N>/contexts/
|
||||||
----------------------
|
----------------------
|
||||||
|
|
||||||
@ -153,7 +171,7 @@ number (``N``) to the file creates the number of child directories named as
|
|||||||
details). At the moment, only one context per kdamond is supported, so only
|
details). At the moment, only one context per kdamond is supported, so only
|
||||||
``0`` or ``1`` can be written to the file.
|
``0`` or ``1`` can be written to the file.
|
||||||
|
|
||||||
.. _sysfs_contexts:
|
.. _sysfs_context:
|
||||||
|
|
||||||
contexts/<N>/
|
contexts/<N>/
|
||||||
-------------
|
-------------
|
||||||
@ -203,6 +221,8 @@ writing to and rading from the files.
|
|||||||
For more details about the intervals and monitoring regions range, please refer
|
For more details about the intervals and monitoring regions range, please refer
|
||||||
to the Design document (:doc:`/mm/damon/design`).
|
to the Design document (:doc:`/mm/damon/design`).
|
||||||
|
|
||||||
|
.. _sysfs_targets:
|
||||||
|
|
||||||
contexts/<N>/targets/
|
contexts/<N>/targets/
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
@ -210,6 +230,8 @@ In the beginning, this directory has only one file, ``nr_targets``. Writing a
|
|||||||
number (``N``) to the file creates the number of child directories named ``0``
|
number (``N``) to the file creates the number of child directories named ``0``
|
||||||
to ``N-1``. Each directory represents each monitoring target.
|
to ``N-1``. Each directory represents each monitoring target.
|
||||||
|
|
||||||
|
.. _sysfs_target:
|
||||||
|
|
||||||
targets/<N>/
|
targets/<N>/
|
||||||
------------
|
------------
|
||||||
|
|
||||||
@ -244,6 +266,8 @@ In the beginning, this directory has only one file, ``nr_regions``. Writing a
|
|||||||
number (``N``) to the file creates the number of child directories named ``0``
|
number (``N``) to the file creates the number of child directories named ``0``
|
||||||
to ``N-1``. Each directory represents each initial monitoring target region.
|
to ``N-1``. Each directory represents each initial monitoring target region.
|
||||||
|
|
||||||
|
.. _sysfs_region:
|
||||||
|
|
||||||
regions/<N>/
|
regions/<N>/
|
||||||
------------
|
------------
|
||||||
|
|
||||||
@ -254,6 +278,8 @@ region by writing to and reading from the files, respectively.
|
|||||||
Each region should not overlap with others. ``end`` of directory ``N`` should
|
Each region should not overlap with others. ``end`` of directory ``N`` should
|
||||||
be equal or smaller than ``start`` of directory ``N+1``.
|
be equal or smaller than ``start`` of directory ``N+1``.
|
||||||
|
|
||||||
|
.. _sysfs_schemes:
|
||||||
|
|
||||||
contexts/<N>/schemes/
|
contexts/<N>/schemes/
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
@ -265,6 +291,8 @@ In the beginning, this directory has only one file, ``nr_schemes``. Writing a
|
|||||||
number (``N``) to the file creates the number of child directories named ``0``
|
number (``N``) to the file creates the number of child directories named ``0``
|
||||||
to ``N-1``. Each directory represents each DAMON-based operation scheme.
|
to ``N-1``. Each directory represents each DAMON-based operation scheme.
|
||||||
|
|
||||||
|
.. _sysfs_scheme:
|
||||||
|
|
||||||
schemes/<N>/
|
schemes/<N>/
|
||||||
------------
|
------------
|
||||||
|
|
||||||
@ -277,7 +305,7 @@ The ``action`` file is for setting and getting the scheme's :ref:`action
|
|||||||
from the file and their meaning are as below.
|
from the file and their meaning are as below.
|
||||||
|
|
||||||
Note that support of each action depends on the running DAMON operations set
|
Note that support of each action depends on the running DAMON operations set
|
||||||
:ref:`implementation <sysfs_contexts>`.
|
:ref:`implementation <sysfs_context>`.
|
||||||
|
|
||||||
- ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
|
- ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
|
||||||
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
Supported by ``vaddr`` and ``fvaddr`` operations set.
|
||||||
@ -299,6 +327,8 @@ Note that support of each action depends on the running DAMON operations set
|
|||||||
The ``apply_interval_us`` file is for setting and getting the scheme's
|
The ``apply_interval_us`` file is for setting and getting the scheme's
|
||||||
:ref:`apply_interval <damon_design_damos>` in microseconds.
|
:ref:`apply_interval <damon_design_damos>` in microseconds.
|
||||||
|
|
||||||
|
.. _sysfs_access_pattern:
|
||||||
|
|
||||||
schemes/<N>/access_pattern/
|
schemes/<N>/access_pattern/
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
@ -312,6 +342,8 @@ to and reading from the ``min`` and ``max`` files under ``sz``,
|
|||||||
``nr_accesses``, and ``age`` directories, respectively. Note that the ``min``
|
``nr_accesses``, and ``age`` directories, respectively. Note that the ``min``
|
||||||
and the ``max`` form a closed interval.
|
and the ``max`` form a closed interval.
|
||||||
|
|
||||||
|
.. _sysfs_quotas:
|
||||||
|
|
||||||
schemes/<N>/quotas/
|
schemes/<N>/quotas/
|
||||||
-------------------
|
-------------------
|
||||||
|
|
||||||
@ -319,8 +351,7 @@ The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
|
|||||||
DAMON-based operation scheme.
|
DAMON-based operation scheme.
|
||||||
|
|
||||||
Under ``quotas`` directory, three files (``ms``, ``bytes``,
|
Under ``quotas`` directory, three files (``ms``, ``bytes``,
|
||||||
``reset_interval_ms``) and one directory (``weights``) having three files
|
``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist.
|
||||||
(``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) in it exist.
|
|
||||||
|
|
||||||
You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
|
You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
|
||||||
``reset interval`` in milliseconds by writing the values to the three files,
|
``reset interval`` in milliseconds by writing the values to the three files,
|
||||||
@ -330,11 +361,37 @@ apply the action to only up to ``bytes`` bytes of memory regions within the
|
|||||||
``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the
|
``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the
|
||||||
quota limits.
|
quota limits.
|
||||||
|
|
||||||
You can also set the :ref:`prioritization weights
|
Under ``weights`` directory, three files (``sz_permil``,
|
||||||
|
``nr_accesses_permil``, and ``age_permil``) exist.
|
||||||
|
You can set the :ref:`prioritization weights
|
||||||
<damon_design_damos_quotas_prioritization>` for size, access frequency, and age
|
<damon_design_damos_quotas_prioritization>` for size, access frequency, and age
|
||||||
in per-thousand unit by writing the values to the three files under the
|
in per-thousand unit by writing the values to the three files under the
|
||||||
``weights`` directory.
|
``weights`` directory.
|
||||||
|
|
||||||
|
.. _sysfs_schemes_quota_goals:
|
||||||
|
|
||||||
|
schemes/<N>/quotas/goals/
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
The directory for the :ref:`automatic quota tuning goals
|
||||||
|
<damon_design_damos_quotas_auto_tuning>` of the given DAMON-based operation
|
||||||
|
scheme.
|
||||||
|
|
||||||
|
In the beginning, this directory has only one file, ``nr_goals``. Writing a
|
||||||
|
number (``N``) to the file creates the number of child directories named ``0``
|
||||||
|
to ``N-1``. Each directory represents each goal and current achievement.
|
||||||
|
Among the multiple feedback, the best one is used.
|
||||||
|
|
||||||
|
Each goal directory contains two files, namely ``target_value`` and
|
||||||
|
``current_value``. Users can set and get any number to those files to set the
|
||||||
|
feedback. User space main workload's latency or throughput, system metrics
|
||||||
|
like free memory ratio or memory pressure stall time (PSI) could be example
|
||||||
|
metrics for the values. Note that users should write
|
||||||
|
``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
|
||||||
|
directory <sysfs_kdamond>` to pass the feedback to DAMON.
|
||||||
|
|
||||||
|
.. _sysfs_watermarks:
|
||||||
|
|
||||||
schemes/<N>/watermarks/
|
schemes/<N>/watermarks/
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
@ -354,6 +411,8 @@ as below.
|
|||||||
|
|
||||||
The ``interval`` should written in microseconds unit.
|
The ``interval`` should written in microseconds unit.
|
||||||
|
|
||||||
|
.. _sysfs_filters:
|
||||||
|
|
||||||
schemes/<N>/filters/
|
schemes/<N>/filters/
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
@ -394,7 +453,7 @@ pages of all memory cgroups except ``/having_care_already``.::
|
|||||||
echo N > 1/matching
|
echo N > 1/matching
|
||||||
|
|
||||||
Note that ``anon`` and ``memcg`` filters are currently supported only when
|
Note that ``anon`` and ``memcg`` filters are currently supported only when
|
||||||
``paddr`` :ref:`implementation <sysfs_contexts>` is being used.
|
``paddr`` :ref:`implementation <sysfs_context>` is being used.
|
||||||
|
|
||||||
Also, memory regions that are filtered out by ``addr`` or ``target`` filters
|
Also, memory regions that are filtered out by ``addr`` or ``target`` filters
|
||||||
are not counted as the scheme has tried to those, while regions that filtered
|
are not counted as the scheme has tried to those, while regions that filtered
|
||||||
@ -449,6 +508,8 @@ and query-like efficient data access monitoring results retrievals. For the
|
|||||||
latter use case, in particular, users can set the ``action`` as ``stat`` and
|
latter use case, in particular, users can set the ``action`` as ``stat`` and
|
||||||
set the ``access pattern`` as their interested pattern that they want to query.
|
set the ``access pattern`` as their interested pattern that they want to query.
|
||||||
|
|
||||||
|
.. _sysfs_schemes_tried_region:
|
||||||
|
|
||||||
tried_regions/<N>/
|
tried_regions/<N>/
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
|
@ -80,6 +80,9 @@ pages_to_scan
|
|||||||
how many pages to scan before ksmd goes to sleep
|
how many pages to scan before ksmd goes to sleep
|
||||||
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
|
e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``.
|
||||||
|
|
||||||
|
The pages_to_scan value cannot be changed if ``advisor_mode`` has
|
||||||
|
been set to scan-time.
|
||||||
|
|
||||||
Default: 100 (chosen for demonstration purposes)
|
Default: 100 (chosen for demonstration purposes)
|
||||||
|
|
||||||
sleep_millisecs
|
sleep_millisecs
|
||||||
@ -164,6 +167,29 @@ smart_scan
|
|||||||
optimization is enabled. The ``pages_skipped`` metric shows how
|
optimization is enabled. The ``pages_skipped`` metric shows how
|
||||||
effective the setting is.
|
effective the setting is.
|
||||||
|
|
||||||
|
advisor_mode
|
||||||
|
The ``advisor_mode`` selects the current advisor. Two modes are
|
||||||
|
supported: none and scan-time. The default is none. By setting
|
||||||
|
``advisor_mode`` to scan-time, the scan time advisor is enabled.
|
||||||
|
The section about ``advisor`` explains in detail how the scan time
|
||||||
|
advisor works.
|
||||||
|
|
||||||
|
adivsor_max_cpu
|
||||||
|
specifies the upper limit of the cpu percent usage of the ksmd
|
||||||
|
background thread. The default is 70.
|
||||||
|
|
||||||
|
advisor_target_scan_time
|
||||||
|
specifies the target scan time in seconds to scan all the candidate
|
||||||
|
pages. The default value is 200 seconds.
|
||||||
|
|
||||||
|
advisor_min_pages_to_scan
|
||||||
|
specifies the lower limit of the ``pages_to_scan`` parameter of the
|
||||||
|
scan time advisor. The default is 500.
|
||||||
|
|
||||||
|
adivsor_max_pages_to_scan
|
||||||
|
specifies the upper limit of the ``pages_to_scan`` parameter of the
|
||||||
|
scan time advisor. The default is 30000.
|
||||||
|
|
||||||
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
||||||
|
|
||||||
general_profit
|
general_profit
|
||||||
@ -263,6 +289,35 @@ ksm_swpin_copy
|
|||||||
note that KSM page might be copied when swapping in because do_swap_page()
|
note that KSM page might be copied when swapping in because do_swap_page()
|
||||||
cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
|
cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.
|
||||||
|
|
||||||
|
Advisor
|
||||||
|
=======
|
||||||
|
|
||||||
|
The number of candidate pages for KSM is dynamic. It can be often observed
|
||||||
|
that during the startup of an application more candidate pages need to be
|
||||||
|
processed. Without an advisor the ``pages_to_scan`` parameter needs to be
|
||||||
|
sized for the maximum number of candidate pages. The scan time advisor can
|
||||||
|
changes the ``pages_to_scan`` parameter based on demand.
|
||||||
|
|
||||||
|
The advisor can be enabled, so KSM can automatically adapt to changes in the
|
||||||
|
number of candidate pages to scan. Two advisors are implemented: none and
|
||||||
|
scan-time. With none, no advisor is enabled. The default is none.
|
||||||
|
|
||||||
|
The scan time advisor changes the ``pages_to_scan`` parameter based on the
|
||||||
|
observed scan times. The possible values for the ``pages_to_scan`` parameter is
|
||||||
|
limited by the ``advisor_max_cpu`` parameter. In addition there is also the
|
||||||
|
``advisor_target_scan_time`` parameter. This parameter sets the target time to
|
||||||
|
scan all the KSM candidate pages. The parameter ``advisor_target_scan_time``
|
||||||
|
decides how aggressive the scan time advisor scans candidate pages. Lower
|
||||||
|
values make the scan time advisor to scan more aggresively. This is the most
|
||||||
|
important parameter for the configuration of the scan time advisor.
|
||||||
|
|
||||||
|
The initial value and the maximum value can be changed with
|
||||||
|
``advisor_min_pages_to_scan`` and ``advisor_max_pages_to_scan``. The default
|
||||||
|
values are sufficient for most workloads and use cases.
|
||||||
|
|
||||||
|
The ``pages_to_scan`` parameter is re-calculated after a scan has been completed.
|
||||||
|
|
||||||
|
|
||||||
--
|
--
|
||||||
Izik Eidus,
|
Izik Eidus,
|
||||||
Hugh Dickins, 17 Nov 2009
|
Hugh Dickins, 17 Nov 2009
|
||||||
|
@ -253,6 +253,7 @@ Following flags about pages are currently supported:
|
|||||||
- ``PAGE_IS_SWAPPED`` - Page is in swapped
|
- ``PAGE_IS_SWAPPED`` - Page is in swapped
|
||||||
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
|
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
|
||||||
- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
|
- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
|
||||||
|
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
|
||||||
|
|
||||||
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
|
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
|
||||||
|
|
||||||
|
@ -45,10 +45,25 @@ components:
|
|||||||
the two is using hugepages just because of the fact the TLB miss is
|
the two is using hugepages just because of the fact the TLB miss is
|
||||||
going to run faster.
|
going to run faster.
|
||||||
|
|
||||||
|
Modern kernels support "multi-size THP" (mTHP), which introduces the
|
||||||
|
ability to allocate memory in blocks that are bigger than a base page
|
||||||
|
but smaller than traditional PMD-size (as described above), in
|
||||||
|
increments of a power-of-2 number of pages. mTHP can back anonymous
|
||||||
|
memory (for example 16K, 32K, 64K, etc). These THPs continue to be
|
||||||
|
PTE-mapped, but in many cases can still provide similar benefits to
|
||||||
|
those outlined above: Page faults are significantly reduced (by a
|
||||||
|
factor of e.g. 4, 8, 16, etc), but latency spikes are much less
|
||||||
|
prominent because the size of each page isn't as huge as the PMD-sized
|
||||||
|
variant and there is less memory to clear in each page fault. Some
|
||||||
|
architectures also employ TLB compression mechanisms to squeeze more
|
||||||
|
entries in when a set of PTEs are virtually and physically contiguous
|
||||||
|
and approporiately aligned. In this case, TLB misses will occur less
|
||||||
|
often.
|
||||||
|
|
||||||
THP can be enabled system wide or restricted to certain tasks or even
|
THP can be enabled system wide or restricted to certain tasks or even
|
||||||
memory ranges inside task's address space. Unless THP is completely
|
memory ranges inside task's address space. Unless THP is completely
|
||||||
disabled, there is ``khugepaged`` daemon that scans memory and
|
disabled, there is ``khugepaged`` daemon that scans memory and
|
||||||
collapses sequences of basic pages into huge pages.
|
collapses sequences of basic pages into PMD-sized huge pages.
|
||||||
|
|
||||||
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
||||||
interface and using madvise(2) and prctl(2) system calls.
|
interface and using madvise(2) and prctl(2) system calls.
|
||||||
@ -95,12 +110,40 @@ Global THP controls
|
|||||||
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
Transparent Hugepage Support for anonymous memory can be entirely disabled
|
||||||
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
||||||
regions (to avoid the risk of consuming more memory resources) or enabled
|
regions (to avoid the risk of consuming more memory resources) or enabled
|
||||||
system wide. This can be achieved with one of::
|
system wide. This can be achieved per-supported-THP-size with one of::
|
||||||
|
|
||||||
|
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
||||||
|
echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
||||||
|
echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
||||||
|
|
||||||
|
where <size> is the hugepage size being addressed, the available sizes
|
||||||
|
for which vary by system.
|
||||||
|
|
||||||
|
For example::
|
||||||
|
|
||||||
|
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
|
||||||
|
|
||||||
|
Alternatively it is possible to specify that a given hugepage size
|
||||||
|
will inherit the top-level "enabled" value::
|
||||||
|
|
||||||
|
echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
||||||
|
|
||||||
|
For example::
|
||||||
|
|
||||||
|
echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
|
||||||
|
|
||||||
|
The top-level setting (for use with "inherit") can be set by issuing
|
||||||
|
one of the following commands::
|
||||||
|
|
||||||
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
||||||
|
|
||||||
|
By default, PMD-sized hugepages have enabled="inherit" and all other
|
||||||
|
hugepage sizes have enabled="never". If enabling multiple hugepage
|
||||||
|
sizes, the kernel will select the most appropriate enabled size for a
|
||||||
|
given allocation.
|
||||||
|
|
||||||
It's also possible to limit defrag efforts in the VM to generate
|
It's also possible to limit defrag efforts in the VM to generate
|
||||||
anonymous hugepages in case they're not immediately free to madvise
|
anonymous hugepages in case they're not immediately free to madvise
|
||||||
regions or to never try to defrag memory and simply fallback to regular
|
regions or to never try to defrag memory and simply fallback to regular
|
||||||
@ -146,25 +189,34 @@ madvise
|
|||||||
never
|
never
|
||||||
should be self-explanatory.
|
should be self-explanatory.
|
||||||
|
|
||||||
By default kernel tries to use huge zero page on read page fault to
|
By default kernel tries to use huge, PMD-mappable zero page on read
|
||||||
anonymous mapping. It's possible to disable huge zero page by writing 0
|
page fault to anonymous mapping. It's possible to disable huge zero
|
||||||
or enable it back by writing 1::
|
page by writing 0 or enable it back by writing 1::
|
||||||
|
|
||||||
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||||
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
||||||
|
|
||||||
Some userspace (such as a test program, or an optimized memory allocation
|
Some userspace (such as a test program, or an optimized memory
|
||||||
library) may want to know the size (in bytes) of a transparent hugepage::
|
allocation library) may want to know the size (in bytes) of a
|
||||||
|
PMD-mappable transparent hugepage::
|
||||||
|
|
||||||
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
||||||
|
|
||||||
khugepaged will be automatically started when
|
khugepaged will be automatically started when one or more hugepage
|
||||||
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
|
sizes are enabled (either by directly setting "always" or "madvise",
|
||||||
be automatically shutdown if it's set to "never".
|
or by setting "inherit" while the top-level enabled is set to "always"
|
||||||
|
or "madvise"), and it'll be automatically shutdown when the last
|
||||||
|
hugepage size is disabled (either by directly setting "never", or by
|
||||||
|
setting "inherit" while the top-level enabled is set to "never").
|
||||||
|
|
||||||
Khugepaged controls
|
Khugepaged controls
|
||||||
-------------------
|
-------------------
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
khugepaged currently only searches for opportunities to collapse to
|
||||||
|
PMD-sized THP and no attempt is made to collapse to other THP
|
||||||
|
sizes.
|
||||||
|
|
||||||
khugepaged runs usually at low frequency so while one may not want to
|
khugepaged runs usually at low frequency so while one may not want to
|
||||||
invoke defrag algorithms synchronously during the page faults, it
|
invoke defrag algorithms synchronously during the page faults, it
|
||||||
should be worth invoking defrag at least in khugepaged. However it's
|
should be worth invoking defrag at least in khugepaged. However it's
|
||||||
@ -282,19 +334,26 @@ force
|
|||||||
Need of application restart
|
Need of application restart
|
||||||
===========================
|
===========================
|
||||||
|
|
||||||
The transparent_hugepage/enabled values and tmpfs mount option only affect
|
The transparent_hugepage/enabled and
|
||||||
future behavior. So to make them effective you need to restart any
|
transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
|
||||||
application that could have been using hugepages. This also applies to the
|
option only affect future behavior. So to make them effective you need
|
||||||
regions registered in khugepaged.
|
to restart any application that could have been using hugepages. This
|
||||||
|
also applies to the regions registered in khugepaged.
|
||||||
|
|
||||||
Monitoring usage
|
Monitoring usage
|
||||||
================
|
================
|
||||||
|
|
||||||
The number of anonymous transparent huge pages currently used by the
|
.. note::
|
||||||
|
Currently the below counters only record events relating to
|
||||||
|
PMD-sized THP. Events relating to other THP sizes are not included.
|
||||||
|
|
||||||
|
The number of PMD-sized anonymous transparent huge pages currently used by the
|
||||||
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
||||||
To identify what applications are using anonymous transparent huge pages,
|
To identify what applications are using PMD-sized anonymous transparent huge
|
||||||
it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
|
pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages
|
||||||
for each mapping.
|
fields for each mapping. (Note that AnonHugePages only applies to traditional
|
||||||
|
PMD-sized THP for historical reasons and should have been called
|
||||||
|
AnonHugePmdMapped).
|
||||||
|
|
||||||
The number of file transparent huge pages mapped to userspace is available
|
The number of file transparent huge pages mapped to userspace is available
|
||||||
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
||||||
@ -413,7 +472,7 @@ for huge pages.
|
|||||||
Optimizing the applications
|
Optimizing the applications
|
||||||
===========================
|
===========================
|
||||||
|
|
||||||
To be guaranteed that the kernel will map a 2M page immediately in any
|
To be guaranteed that the kernel will map a THP immediately in any
|
||||||
memory region, the mmap region has to be hugepage naturally
|
memory region, the mmap region has to be hugepage naturally
|
||||||
aligned. posix_memalign() can provide that guarantee.
|
aligned. posix_memalign() can provide that guarantee.
|
||||||
|
|
||||||
|
@ -113,6 +113,9 @@ events, except page fault notifications, may be generated:
|
|||||||
areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
|
areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
|
||||||
support for shmem virtual memory areas.
|
support for shmem virtual memory areas.
|
||||||
|
|
||||||
|
- ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an
|
||||||
|
existing page contents from userspace.
|
||||||
|
|
||||||
The userland application should set the feature flags it intends to use
|
The userland application should set the feature flags it intends to use
|
||||||
when invoking the ``UFFDIO_API`` ioctl, to request that those features be
|
when invoking the ``UFFDIO_API`` ioctl, to request that those features be
|
||||||
enabled if supported.
|
enabled if supported.
|
||||||
|
@ -153,6 +153,26 @@ attribute, e. g.::
|
|||||||
|
|
||||||
Setting this parameter to 100 will disable the hysteresis.
|
Setting this parameter to 100 will disable the hysteresis.
|
||||||
|
|
||||||
|
Some users cannot tolerate the swapping that comes with zswap store failures
|
||||||
|
and zswap writebacks. Swapping can be disabled entirely (without disabling
|
||||||
|
zswap itself) on a cgroup-basis as follows:
|
||||||
|
|
||||||
|
echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
|
||||||
|
|
||||||
|
Note that if the store failures are recurring (for e.g if the pages are
|
||||||
|
incompressible), users can observe reclaim inefficiency after disabling
|
||||||
|
writeback (because the same pages might be rejected again and again).
|
||||||
|
|
||||||
|
When there is a sizable amount of cold memory residing in the zswap pool, it
|
||||||
|
can be advantageous to proactively write these cold pages to swap and reclaim
|
||||||
|
the memory for other use cases. By default, the zswap shrinker is disabled.
|
||||||
|
User can enable it as follows:
|
||||||
|
|
||||||
|
echo Y > /sys/module/zswap/parameters/shrinker_enabled
|
||||||
|
|
||||||
|
This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is
|
||||||
|
selected.
|
||||||
|
|
||||||
A debugfs interface is provided for various statistic about pool size, number
|
A debugfs interface is provided for various statistic about pool size, number
|
||||||
of pages stored, same-value filled pages and various counters for the reasons
|
of pages stored, same-value filled pages and various counters for the reasons
|
||||||
pages are rejected.
|
pages are rejected.
|
||||||
|
@ -81,6 +81,9 @@ section.
|
|||||||
Sometimes it is necessary to ensure the next call to store to a maple tree does
|
Sometimes it is necessary to ensure the next call to store to a maple tree does
|
||||||
not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
|
not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.
|
||||||
|
|
||||||
|
You can use mtree_dup() to duplicate an entire maple tree. It is a more
|
||||||
|
efficient way than inserting all elements one by one into a new tree.
|
||||||
|
|
||||||
Finally, you can remove all entries from a maple tree by calling
|
Finally, you can remove all entries from a maple tree by calling
|
||||||
mtree_destroy(). If the maple tree entries are pointers, you may wish to free
|
mtree_destroy(). If the maple tree entries are pointers, you may wish to free
|
||||||
the entries first.
|
the entries first.
|
||||||
@ -112,6 +115,7 @@ Takes ma_lock internally:
|
|||||||
* mtree_insert()
|
* mtree_insert()
|
||||||
* mtree_insert_range()
|
* mtree_insert_range()
|
||||||
* mtree_erase()
|
* mtree_erase()
|
||||||
|
* mtree_dup()
|
||||||
* mtree_destroy()
|
* mtree_destroy()
|
||||||
* mt_set_in_rcu()
|
* mt_set_in_rcu()
|
||||||
* mt_clear_in_rcu()
|
* mt_clear_in_rcu()
|
||||||
|
@ -261,7 +261,7 @@ prototypes::
|
|||||||
struct folio *src, enum migrate_mode);
|
struct folio *src, enum migrate_mode);
|
||||||
int (*launder_folio)(struct folio *);
|
int (*launder_folio)(struct folio *);
|
||||||
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
|
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
|
||||||
int (*error_remove_page)(struct address_space *, struct page *);
|
int (*error_remove_folio)(struct address_space *, struct folio *);
|
||||||
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
|
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
|
||||||
int (*swap_deactivate)(struct file *);
|
int (*swap_deactivate)(struct file *);
|
||||||
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
|
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
|
||||||
@ -287,7 +287,7 @@ direct_IO:
|
|||||||
migrate_folio: yes (both)
|
migrate_folio: yes (both)
|
||||||
launder_folio: yes
|
launder_folio: yes
|
||||||
is_partially_uptodate: yes
|
is_partially_uptodate: yes
|
||||||
error_remove_page: yes
|
error_remove_folio: yes
|
||||||
swap_activate: no
|
swap_activate: no
|
||||||
swap_deactivate: no
|
swap_deactivate: no
|
||||||
swap_rw: yes, unlocks
|
swap_rw: yes, unlocks
|
||||||
|
@ -528,9 +528,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
|
|||||||
does not take into account swapped out page of underlying shmem objects.
|
does not take into account swapped out page of underlying shmem objects.
|
||||||
"Locked" indicates whether the mapping is locked in memory or not.
|
"Locked" indicates whether the mapping is locked in memory or not.
|
||||||
|
|
||||||
"THPeligible" indicates whether the mapping is eligible for allocating THP
|
"THPeligible" indicates whether the mapping is eligible for allocating
|
||||||
pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
|
naturally aligned THP pages of any currently enabled size. 1 if true, 0
|
||||||
It just shows the current status.
|
otherwise.
|
||||||
|
|
||||||
"VmFlags" field deserves a separate description. This member represents the
|
"VmFlags" field deserves a separate description. This member represents the
|
||||||
kernel flags associated with the particular virtual memory area in two letter
|
kernel flags associated with the particular virtual memory area in two letter
|
||||||
|
@ -823,7 +823,7 @@ cache in your filesystem. The following members are defined:
|
|||||||
bool (*is_partially_uptodate) (struct folio *, size_t from,
|
bool (*is_partially_uptodate) (struct folio *, size_t from,
|
||||||
size_t count);
|
size_t count);
|
||||||
void (*is_dirty_writeback)(struct folio *, bool *, bool *);
|
void (*is_dirty_writeback)(struct folio *, bool *, bool *);
|
||||||
int (*error_remove_page) (struct mapping *mapping, struct page *page);
|
int (*error_remove_folio)(struct mapping *mapping, struct folio *);
|
||||||
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
|
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
|
||||||
int (*swap_deactivate)(struct file *);
|
int (*swap_deactivate)(struct file *);
|
||||||
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
|
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
|
||||||
@ -1034,8 +1034,8 @@ cache in your filesystem. The following members are defined:
|
|||||||
VM if a folio should be treated as dirty or writeback for the
|
VM if a folio should be treated as dirty or writeback for the
|
||||||
purposes of stalling.
|
purposes of stalling.
|
||||||
|
|
||||||
``error_remove_page``
|
``error_remove_folio``
|
||||||
normally set to generic_error_remove_page if truncation is ok
|
normally set to generic_error_remove_folio if truncation is ok
|
||||||
for this address space. Used for memory failure handling.
|
for this address space. Used for memory failure handling.
|
||||||
Setting this implies you deal with pages going away under you,
|
Setting this implies you deal with pages going away under you,
|
||||||
unless you have them locked or reference counts increased.
|
unless you have them locked or reference counts increased.
|
||||||
|
@ -18,8 +18,6 @@ PTE Page Table Helpers
|
|||||||
+---------------------------+--------------------------------------------------+
|
+---------------------------+--------------------------------------------------+
|
||||||
| pte_same | Tests whether both PTE entries are the same |
|
| pte_same | Tests whether both PTE entries are the same |
|
||||||
+---------------------------+--------------------------------------------------+
|
+---------------------------+--------------------------------------------------+
|
||||||
| pte_bad | Tests a non-table mapped PTE |
|
|
||||||
+---------------------------+--------------------------------------------------+
|
|
||||||
| pte_present | Tests a valid mapped PTE |
|
| pte_present | Tests a valid mapped PTE |
|
||||||
+---------------------------+--------------------------------------------------+
|
+---------------------------+--------------------------------------------------+
|
||||||
| pte_young | Tests a young PTE |
|
| pte_young | Tests a young PTE |
|
||||||
|
@ -5,6 +5,18 @@ Design
|
|||||||
======
|
======
|
||||||
|
|
||||||
|
|
||||||
|
.. _damon_design_execution_model_and_data_structures:
|
||||||
|
|
||||||
|
Execution Model and Data Structures
|
||||||
|
===================================
|
||||||
|
|
||||||
|
The monitoring-related information including the monitoring request
|
||||||
|
specification and DAMON-based operation schemes are stored in a data structure
|
||||||
|
called DAMON ``context``. DAMON executes each context with a kernel thread
|
||||||
|
called ``kdamond``. Multiple kdamonds could run in parallel, for different
|
||||||
|
types of monitoring.
|
||||||
|
|
||||||
|
|
||||||
Overall Architecture
|
Overall Architecture
|
||||||
====================
|
====================
|
||||||
|
|
||||||
@ -346,6 +358,19 @@ the weight will be respected are up to the underlying prioritization mechanism
|
|||||||
implementation.
|
implementation.
|
||||||
|
|
||||||
|
|
||||||
|
.. _damon_design_damos_quotas_auto_tuning:
|
||||||
|
|
||||||
|
Aim-oriented Feedback-driven Auto-tuning
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Automatic feedback-driven quota tuning. Instead of setting the absolute quota
|
||||||
|
value, users can repeatedly provide numbers representing how much of their goal
|
||||||
|
for the scheme is achieved as feedback. DAMOS then automatically tunes the
|
||||||
|
aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS
|
||||||
|
is under achieving the goal, DAMOS automatically increases the quota. If DAMOS
|
||||||
|
is over achieving the goal, it decreases the quota.
|
||||||
|
|
||||||
|
|
||||||
.. _damon_design_damos_watermarks:
|
.. _damon_design_damos_watermarks:
|
||||||
|
|
||||||
Watermarks
|
Watermarks
|
||||||
@ -477,15 +502,3 @@ modules for proactive reclamation and LRU lists manipulation are provided. For
|
|||||||
more detail, please read the usage documents for those
|
more detail, please read the usage documents for those
|
||||||
(:doc:`/admin-guide/mm/damon/reclaim` and
|
(:doc:`/admin-guide/mm/damon/reclaim` and
|
||||||
:doc:`/admin-guide/mm/damon/lru_sort`).
|
:doc:`/admin-guide/mm/damon/lru_sort`).
|
||||||
|
|
||||||
|
|
||||||
.. _damon_design_execution_model_and_data_structures:
|
|
||||||
|
|
||||||
Execution Model and Data Structures
|
|
||||||
===================================
|
|
||||||
|
|
||||||
The monitoring-related information including the monitoring request
|
|
||||||
specification and DAMON-based operation schemes are stored in a data structure
|
|
||||||
called DAMON ``context``. DAMON executes each context with a kernel thread
|
|
||||||
called ``kdamond``. Multiple kdamonds could run in parallel, for different
|
|
||||||
types of monitoring.
|
|
||||||
|
@ -117,7 +117,7 @@ pages:
|
|||||||
|
|
||||||
- map/unmap of a PMD entry for the whole THP increment/decrement
|
- map/unmap of a PMD entry for the whole THP increment/decrement
|
||||||
folio->_entire_mapcount and also increment/decrement
|
folio->_entire_mapcount and also increment/decrement
|
||||||
folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount
|
folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount
|
||||||
goes from -1 to 0 or 0 to -1.
|
goes from -1 to 0 or 0 to -1.
|
||||||
|
|
||||||
- map/unmap of individual pages with PTE entry increment/decrement
|
- map/unmap of individual pages with PTE entry increment/decrement
|
||||||
@ -156,7 +156,7 @@ Partial unmap and deferred_split_folio()
|
|||||||
|
|
||||||
Unmapping part of THP (with munmap() or other way) is not going to free
|
Unmapping part of THP (with munmap() or other way) is not going to free
|
||||||
memory immediately. Instead, we detect that a subpage of THP is not in use
|
memory immediately. Instead, we detect that a subpage of THP is not in use
|
||||||
in page_remove_rmap() and queue the THP for splitting if memory pressure
|
in folio_remove_rmap_*() and queue the THP for splitting if memory pressure
|
||||||
comes. Splitting will free up unused subpages.
|
comes. Splitting will free up unused subpages.
|
||||||
|
|
||||||
Splitting the page right away is not an option due to locking context in
|
Splitting the page right away is not an option due to locking context in
|
||||||
|
@ -486,7 +486,7 @@ munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
|
|||||||
Before the unevictable/mlock changes, mlocking did not mark the pages in any
|
Before the unevictable/mlock changes, mlocking did not mark the pages in any
|
||||||
way, so unmapping them required no processing.
|
way, so unmapping them required no processing.
|
||||||
|
|
||||||
For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
|
For each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls
|
||||||
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
|
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
|
||||||
(unless it was a PTE mapping of a part of a transparent huge page).
|
(unless it was a PTE mapping of a part of a transparent huge page).
|
||||||
|
|
||||||
@ -511,7 +511,7 @@ userspace; truncation even unmaps and deletes any private anonymous pages
|
|||||||
which had been Copied-On-Write from the file pages now being truncated.
|
which had been Copied-On-Write from the file pages now being truncated.
|
||||||
|
|
||||||
Mlocked pages can be munlocked and deleted in this way: like with munmap(),
|
Mlocked pages can be munlocked and deleted in this way: like with munmap(),
|
||||||
for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
|
for each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls
|
||||||
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
|
munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED
|
||||||
(unless it was a PTE mapping of a part of a transparent huge page).
|
(unless it was a PTE mapping of a part of a transparent huge page).
|
||||||
|
|
||||||
|
@ -263,20 +263,20 @@ the name indicates, this function allocates pages of memory, and the second
|
|||||||
argument is "order" or a power of two number of pages, that is
|
argument is "order" or a power of two number of pages, that is
|
||||||
(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
|
(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
|
||||||
order=2 ==> 16384 bytes, etc. The maximum size of a
|
order=2 ==> 16384 bytes, etc. The maximum size of a
|
||||||
region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
|
region allocated by __get_free_pages is determined by the MAX_PAGE_ORDER macro.
|
||||||
precisely the limit can be calculated as::
|
More precisely the limit can be calculated as::
|
||||||
|
|
||||||
PAGE_SIZE << MAX_ORDER
|
PAGE_SIZE << MAX_PAGE_ORDER
|
||||||
|
|
||||||
In a i386 architecture PAGE_SIZE is 4096 bytes
|
In a i386 architecture PAGE_SIZE is 4096 bytes
|
||||||
In a 2.4/i386 kernel MAX_ORDER is 10
|
In a 2.4/i386 kernel MAX_PAGE_ORDER is 10
|
||||||
In a 2.6/i386 kernel MAX_ORDER is 11
|
In a 2.6/i386 kernel MAX_PAGE_ORDER is 11
|
||||||
|
|
||||||
So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
|
So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
|
||||||
respectively, with an i386 architecture.
|
respectively, with an i386 architecture.
|
||||||
|
|
||||||
User space programs can include /usr/include/sys/user.h and
|
User space programs can include /usr/include/sys/user.h and
|
||||||
/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
|
/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_PAGE_ORDER declarations.
|
||||||
|
|
||||||
The pagesize can also be determined dynamically with the getpagesize (2)
|
The pagesize can also be determined dynamically with the getpagesize (2)
|
||||||
system call.
|
system call.
|
||||||
@ -324,7 +324,7 @@ Definitions:
|
|||||||
(see /proc/slabinfo)
|
(see /proc/slabinfo)
|
||||||
<pointer size> depends on the architecture -- ``sizeof(void *)``
|
<pointer size> depends on the architecture -- ``sizeof(void *)``
|
||||||
<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
|
<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
|
||||||
<max-order> is the value defined with MAX_ORDER
|
<max-order> is the value defined with MAX_PAGE_ORDER
|
||||||
<frame size> it's an upper bound of frame's capture size (more on this later)
|
<frame size> it's an upper bound of frame's capture size (more on this later)
|
||||||
============== ================================================================
|
============== ================================================================
|
||||||
|
|
||||||
|
@ -5339,6 +5339,7 @@ L: linux-mm@kvack.org
|
|||||||
S: Maintained
|
S: Maintained
|
||||||
F: mm/memcontrol.c
|
F: mm/memcontrol.c
|
||||||
F: mm/swap_cgroup.c
|
F: mm/swap_cgroup.c
|
||||||
|
F: samples/cgroup/*
|
||||||
F: tools/testing/selftests/cgroup/memcg_protection.m
|
F: tools/testing/selftests/cgroup/memcg_protection.m
|
||||||
F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
|
F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
|
||||||
F: tools/testing/selftests/cgroup/test_kmem.c
|
F: tools/testing/selftests/cgroup/test_kmem.c
|
||||||
|
@ -1470,6 +1470,14 @@ config DYNAMIC_SIGFRAME
|
|||||||
config HAVE_ARCH_NODE_DEV_GROUP
|
config HAVE_ARCH_NODE_DEV_GROUP
|
||||||
bool
|
bool
|
||||||
|
|
||||||
|
config ARCH_HAS_HW_PTE_YOUNG
|
||||||
|
bool
|
||||||
|
help
|
||||||
|
Architectures that select this option are capable of setting the
|
||||||
|
accessed bit in PTE entries when using them as part of linear address
|
||||||
|
translations. Architectures that require runtime check should select
|
||||||
|
this option and override arch_has_hw_pte_young().
|
||||||
|
|
||||||
config ARCH_HAS_NONLEAF_PMD_YOUNG
|
config ARCH_HAS_NONLEAF_PMD_YOUNG
|
||||||
bool
|
bool
|
||||||
help
|
help
|
||||||
|
@ -1362,7 +1362,7 @@ config ARCH_FORCE_MAX_ORDER
|
|||||||
default "10"
|
default "10"
|
||||||
help
|
help
|
||||||
The kernel page allocator limits the size of maximal physically
|
The kernel page allocator limits the size of maximal physically
|
||||||
contiguous allocations. The limit is called MAX_ORDER and it
|
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||||
defines the maximal power of two of number of pages that can be
|
defines the maximal power of two of number of pages that can be
|
||||||
allocated as a single contiguous block. This option allows
|
allocated as a single contiguous block. This option allows
|
||||||
overriding the default setting when ability to allocate very
|
overriding the default setting when ability to allocate very
|
||||||
|
@ -36,6 +36,7 @@ config ARM64
|
|||||||
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
|
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
|
||||||
select ARCH_HAS_PTE_DEVMAP
|
select ARCH_HAS_PTE_DEVMAP
|
||||||
select ARCH_HAS_PTE_SPECIAL
|
select ARCH_HAS_PTE_SPECIAL
|
||||||
|
select ARCH_HAS_HW_PTE_YOUNG
|
||||||
select ARCH_HAS_SETUP_DMA_OPS
|
select ARCH_HAS_SETUP_DMA_OPS
|
||||||
select ARCH_HAS_SET_DIRECT_MAP
|
select ARCH_HAS_SET_DIRECT_MAP
|
||||||
select ARCH_HAS_SET_MEMORY
|
select ARCH_HAS_SET_MEMORY
|
||||||
@ -1519,15 +1520,15 @@ config XEN
|
|||||||
|
|
||||||
# include/linux/mmzone.h requires the following to be true:
|
# include/linux/mmzone.h requires the following to be true:
|
||||||
#
|
#
|
||||||
# MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
# MAX_PAGE_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
||||||
#
|
#
|
||||||
# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT:
|
# so the maximum value of MAX_PAGE_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT:
|
||||||
#
|
#
|
||||||
# | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_ORDER | default MAX_ORDER |
|
# | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_PAGE_ORDER | default MAX_PAGE_ORDER |
|
||||||
# ----+-------------------+--------------+-----------------+--------------------+
|
# ----+-------------------+--------------+----------------------+-------------------------+
|
||||||
# 4K | 27 | 12 | 15 | 10 |
|
# 4K | 27 | 12 | 15 | 10 |
|
||||||
# 16K | 27 | 14 | 13 | 11 |
|
# 16K | 27 | 14 | 13 | 11 |
|
||||||
# 64K | 29 | 16 | 13 | 13 |
|
# 64K | 29 | 16 | 13 | 13 |
|
||||||
config ARCH_FORCE_MAX_ORDER
|
config ARCH_FORCE_MAX_ORDER
|
||||||
int
|
int
|
||||||
default "13" if ARM64_64K_PAGES
|
default "13" if ARM64_64K_PAGES
|
||||||
@ -1535,16 +1536,16 @@ config ARCH_FORCE_MAX_ORDER
|
|||||||
default "10"
|
default "10"
|
||||||
help
|
help
|
||||||
The kernel page allocator limits the size of maximal physically
|
The kernel page allocator limits the size of maximal physically
|
||||||
contiguous allocations. The limit is called MAX_ORDER and it
|
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||||
defines the maximal power of two of number of pages that can be
|
defines the maximal power of two of number of pages that can be
|
||||||
allocated as a single contiguous block. This option allows
|
allocated as a single contiguous block. This option allows
|
||||||
overriding the default setting when ability to allocate very
|
overriding the default setting when ability to allocate very
|
||||||
large blocks of physically contiguous memory is required.
|
large blocks of physically contiguous memory is required.
|
||||||
|
|
||||||
The maximal size of allocation cannot exceed the size of the
|
The maximal size of allocation cannot exceed the size of the
|
||||||
section, so the value of MAX_ORDER should satisfy
|
section, so the value of MAX_PAGE_ORDER should satisfy
|
||||||
|
|
||||||
MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
MAX_PAGE_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS
|
||||||
|
|
||||||
Don't change if unsure.
|
Don't change if unsure.
|
||||||
|
|
||||||
|
@ -15,29 +15,9 @@
|
|||||||
|
|
||||||
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
|
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
|
||||||
|
|
||||||
void kasan_init(void);
|
|
||||||
|
|
||||||
/*
|
|
||||||
* KASAN_SHADOW_START: beginning of the kernel virtual addresses.
|
|
||||||
* KASAN_SHADOW_END: KASAN_SHADOW_START + 1/N of kernel virtual addresses,
|
|
||||||
* where N = (1 << KASAN_SHADOW_SCALE_SHIFT).
|
|
||||||
*
|
|
||||||
* KASAN_SHADOW_OFFSET:
|
|
||||||
* This value is used to map an address to the corresponding shadow
|
|
||||||
* address by the following formula:
|
|
||||||
* shadow_addr = (address >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
|
|
||||||
*
|
|
||||||
* (1 << (64 - KASAN_SHADOW_SCALE_SHIFT)) shadow addresses that lie in range
|
|
||||||
* [KASAN_SHADOW_OFFSET, KASAN_SHADOW_END) cover all 64-bits of virtual
|
|
||||||
* addresses. So KASAN_SHADOW_OFFSET should satisfy the following equation:
|
|
||||||
* KASAN_SHADOW_OFFSET = KASAN_SHADOW_END -
|
|
||||||
* (1ULL << (64 - KASAN_SHADOW_SCALE_SHIFT))
|
|
||||||
*/
|
|
||||||
#define _KASAN_SHADOW_START(va) (KASAN_SHADOW_END - (1UL << ((va) - KASAN_SHADOW_SCALE_SHIFT)))
|
|
||||||
#define KASAN_SHADOW_START _KASAN_SHADOW_START(vabits_actual)
|
|
||||||
|
|
||||||
void kasan_copy_shadow(pgd_t *pgdir);
|
|
||||||
asmlinkage void kasan_early_init(void);
|
asmlinkage void kasan_early_init(void);
|
||||||
|
void kasan_init(void);
|
||||||
|
void kasan_copy_shadow(pgd_t *pgdir);
|
||||||
|
|
||||||
#else
|
#else
|
||||||
static inline void kasan_init(void) { }
|
static inline void kasan_init(void) { }
|
||||||
|
@ -65,15 +65,41 @@
|
|||||||
#define KERNEL_END _end
|
#define KERNEL_END _end
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Generic and tag-based KASAN require 1/8th and 1/16th of the kernel virtual
|
* Generic and Software Tag-Based KASAN modes require 1/8th and 1/16th of the
|
||||||
* address space for the shadow region respectively. They can bloat the stack
|
* kernel virtual address space for storing the shadow memory respectively.
|
||||||
* significantly, so double the (minimum) stack size when they are in use.
|
*
|
||||||
|
* The mapping between a virtual memory address and its corresponding shadow
|
||||||
|
* memory address is defined based on the formula:
|
||||||
|
*
|
||||||
|
* shadow_addr = (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
|
||||||
|
*
|
||||||
|
* where KASAN_SHADOW_SCALE_SHIFT is the order of the number of bits that map
|
||||||
|
* to a single shadow byte and KASAN_SHADOW_OFFSET is a constant that offsets
|
||||||
|
* the mapping. Note that KASAN_SHADOW_OFFSET does not point to the start of
|
||||||
|
* the shadow memory region.
|
||||||
|
*
|
||||||
|
* Based on this mapping, we define two constants:
|
||||||
|
*
|
||||||
|
* KASAN_SHADOW_START: the start of the shadow memory region;
|
||||||
|
* KASAN_SHADOW_END: the end of the shadow memory region.
|
||||||
|
*
|
||||||
|
* KASAN_SHADOW_END is defined first as the shadow address that corresponds to
|
||||||
|
* the upper bound of possible virtual kernel memory addresses UL(1) << 64
|
||||||
|
* according to the mapping formula.
|
||||||
|
*
|
||||||
|
* KASAN_SHADOW_START is defined second based on KASAN_SHADOW_END. The shadow
|
||||||
|
* memory start must map to the lowest possible kernel virtual memory address
|
||||||
|
* and thus it depends on the actual bitness of the address space.
|
||||||
|
*
|
||||||
|
* As KASAN inserts redzones between stack variables, this increases the stack
|
||||||
|
* memory usage significantly. Thus, we double the (minimum) stack size.
|
||||||
*/
|
*/
|
||||||
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
|
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
|
||||||
#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
|
#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
|
||||||
#define KASAN_SHADOW_END ((UL(1) << (64 - KASAN_SHADOW_SCALE_SHIFT)) \
|
#define KASAN_SHADOW_END ((UL(1) << (64 - KASAN_SHADOW_SCALE_SHIFT)) + KASAN_SHADOW_OFFSET)
|
||||||
+ KASAN_SHADOW_OFFSET)
|
#define _KASAN_SHADOW_START(va) (KASAN_SHADOW_END - (UL(1) << ((va) - KASAN_SHADOW_SCALE_SHIFT)))
|
||||||
#define PAGE_END (KASAN_SHADOW_END - (1UL << (vabits_actual - KASAN_SHADOW_SCALE_SHIFT)))
|
#define KASAN_SHADOW_START _KASAN_SHADOW_START(vabits_actual)
|
||||||
|
#define PAGE_END KASAN_SHADOW_START
|
||||||
#define KASAN_THREAD_SHIFT 1
|
#define KASAN_THREAD_SHIFT 1
|
||||||
#else
|
#else
|
||||||
#define KASAN_THREAD_SHIFT 0
|
#define KASAN_THREAD_SHIFT 0
|
||||||
|
@ -10,7 +10,7 @@
|
|||||||
/*
|
/*
|
||||||
* Section size must be at least 512MB for 64K base
|
* Section size must be at least 512MB for 64K base
|
||||||
* page size config. Otherwise it will be less than
|
* page size config. Otherwise it will be less than
|
||||||
* MAX_ORDER and the build process will fail.
|
* MAX_PAGE_ORDER and the build process will fail.
|
||||||
*/
|
*/
|
||||||
#ifdef CONFIG_ARM64_64K_PAGES
|
#ifdef CONFIG_ARM64_64K_PAGES
|
||||||
#define SECTION_SIZE_BITS 29
|
#define SECTION_SIZE_BITS 29
|
||||||
|
@ -16,7 +16,7 @@ struct hyp_pool {
|
|||||||
* API at EL2.
|
* API at EL2.
|
||||||
*/
|
*/
|
||||||
hyp_spinlock_t lock;
|
hyp_spinlock_t lock;
|
||||||
struct list_head free_area[MAX_ORDER + 1];
|
struct list_head free_area[NR_PAGE_ORDERS];
|
||||||
phys_addr_t range_start;
|
phys_addr_t range_start;
|
||||||
phys_addr_t range_end;
|
phys_addr_t range_end;
|
||||||
unsigned short max_order;
|
unsigned short max_order;
|
||||||
|
@ -228,7 +228,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, unsigned int nr_pages,
|
|||||||
int i;
|
int i;
|
||||||
|
|
||||||
hyp_spin_lock_init(&pool->lock);
|
hyp_spin_lock_init(&pool->lock);
|
||||||
pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
|
pool->max_order = min(MAX_PAGE_ORDER,
|
||||||
|
get_order(nr_pages << PAGE_SHIFT));
|
||||||
for (i = 0; i <= pool->max_order; i++)
|
for (i = 0; i <= pool->max_order; i++)
|
||||||
INIT_LIST_HEAD(&pool->free_area[i]);
|
INIT_LIST_HEAD(&pool->free_area[i]);
|
||||||
pool->range_start = phys;
|
pool->range_start = phys;
|
||||||
|
@ -51,7 +51,7 @@ void __init arm64_hugetlb_cma_reserve(void)
|
|||||||
* page allocator. Just warn if there is any change
|
* page allocator. Just warn if there is any change
|
||||||
* breaking this assumption.
|
* breaking this assumption.
|
||||||
*/
|
*/
|
||||||
WARN_ON(order <= MAX_ORDER);
|
WARN_ON(order <= MAX_PAGE_ORDER);
|
||||||
hugetlb_cma_reserve(order);
|
hugetlb_cma_reserve(order);
|
||||||
}
|
}
|
||||||
#endif /* CONFIG_CMA */
|
#endif /* CONFIG_CMA */
|
||||||
|
@ -170,6 +170,11 @@ asmlinkage void __init kasan_early_init(void)
|
|||||||
{
|
{
|
||||||
BUILD_BUG_ON(KASAN_SHADOW_OFFSET !=
|
BUILD_BUG_ON(KASAN_SHADOW_OFFSET !=
|
||||||
KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT)));
|
KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT)));
|
||||||
|
/*
|
||||||
|
* We cannot check the actual value of KASAN_SHADOW_START during build,
|
||||||
|
* as it depends on vabits_actual. As a best-effort approach, check
|
||||||
|
* potential values calculated based on VA_BITS and VA_BITS_MIN.
|
||||||
|
*/
|
||||||
BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS), PGDIR_SIZE));
|
BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS), PGDIR_SIZE));
|
||||||
BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS_MIN), PGDIR_SIZE));
|
BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS_MIN), PGDIR_SIZE));
|
||||||
BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE));
|
BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE));
|
||||||
|
@ -523,6 +523,7 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
|
|||||||
return pmd;
|
return pmd;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#define pmd_dirty pmd_dirty
|
||||||
static inline int pmd_dirty(pmd_t pmd)
|
static inline int pmd_dirty(pmd_t pmd)
|
||||||
{
|
{
|
||||||
return !!(pmd_val(pmd) & (_PAGE_DIRTY | _PAGE_MODIFIED));
|
return !!(pmd_val(pmd) & (_PAGE_DIRTY | _PAGE_MODIFIED));
|
||||||
|
@ -226,32 +226,6 @@ static void __init node_mem_init(unsigned int node)
|
|||||||
|
|
||||||
#ifdef CONFIG_ACPI_NUMA
|
#ifdef CONFIG_ACPI_NUMA
|
||||||
|
|
||||||
/*
|
|
||||||
* Sanity check to catch more bad NUMA configurations (they are amazingly
|
|
||||||
* common). Make sure the nodes cover all memory.
|
|
||||||
*/
|
|
||||||
static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
|
|
||||||
{
|
|
||||||
int i;
|
|
||||||
u64 numaram, biosram;
|
|
||||||
|
|
||||||
numaram = 0;
|
|
||||||
for (i = 0; i < mi->nr_blks; i++) {
|
|
||||||
u64 s = mi->blk[i].start >> PAGE_SHIFT;
|
|
||||||
u64 e = mi->blk[i].end >> PAGE_SHIFT;
|
|
||||||
|
|
||||||
numaram += e - s;
|
|
||||||
numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
|
|
||||||
if ((s64)numaram < 0)
|
|
||||||
numaram = 0;
|
|
||||||
}
|
|
||||||
max_pfn = max_low_pfn;
|
|
||||||
biosram = max_pfn - absent_pages_in_range(0, max_pfn);
|
|
||||||
|
|
||||||
BUG_ON((s64)(biosram - numaram) >= (1 << (20 - PAGE_SHIFT)));
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
|
|
||||||
static void __init add_node_intersection(u32 node, u64 start, u64 size, u32 type)
|
static void __init add_node_intersection(u32 node, u64 start, u64 size, u32 type)
|
||||||
{
|
{
|
||||||
static unsigned long num_physpages;
|
static unsigned long num_physpages;
|
||||||
@ -396,7 +370,7 @@ int __init init_numa_memory(void)
|
|||||||
return -EINVAL;
|
return -EINVAL;
|
||||||
|
|
||||||
init_node_memblock();
|
init_node_memblock();
|
||||||
if (numa_meminfo_cover_memory(&numa_meminfo) == false)
|
if (!memblock_validate_numa_coverage(SZ_1M))
|
||||||
return -EINVAL;
|
return -EINVAL;
|
||||||
|
|
||||||
for_each_node_mask(node, node_possible_map) {
|
for_each_node_mask(node, node_possible_map) {
|
||||||
|
@ -402,7 +402,7 @@ config ARCH_FORCE_MAX_ORDER
|
|||||||
default "10"
|
default "10"
|
||||||
help
|
help
|
||||||
The kernel page allocator limits the size of maximal physically
|
The kernel page allocator limits the size of maximal physically
|
||||||
contiguous allocations. The limit is called MAX_ORDER and it
|
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||||
defines the maximal power of two of number of pages that can be
|
defines the maximal power of two of number of pages that can be
|
||||||
allocated as a single contiguous block. This option allows
|
allocated as a single contiguous block. This option allows
|
||||||
overriding the default setting when ability to allocate very
|
overriding the default setting when ability to allocate very
|
||||||
|
@ -655,6 +655,7 @@ static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
|
|||||||
return pmd;
|
return pmd;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#define pmd_dirty pmd_dirty
|
||||||
static inline int pmd_dirty(pmd_t pmd)
|
static inline int pmd_dirty(pmd_t pmd)
|
||||||
{
|
{
|
||||||
return !!(pmd_val(pmd) & _PAGE_MODIFIED);
|
return !!(pmd_val(pmd) & _PAGE_MODIFIED);
|
||||||
|
@ -50,7 +50,7 @@ config ARCH_FORCE_MAX_ORDER
|
|||||||
default "10"
|
default "10"
|
||||||
help
|
help
|
||||||
The kernel page allocator limits the size of maximal physically
|
The kernel page allocator limits the size of maximal physically
|
||||||
contiguous allocations. The limit is called MAX_ORDER and it
|
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||||
defines the maximal power of two of number of pages that can be
|
defines the maximal power of two of number of pages that can be
|
||||||
allocated as a single contiguous block. This option allows
|
allocated as a single contiguous block. This option allows
|
||||||
overriding the default setting when ability to allocate very
|
overriding the default setting when ability to allocate very
|
||||||
|
@ -916,7 +916,7 @@ config ARCH_FORCE_MAX_ORDER
|
|||||||
default "10"
|
default "10"
|
||||||
help
|
help
|
||||||
The kernel page allocator limits the size of maximal physically
|
The kernel page allocator limits the size of maximal physically
|
||||||
contiguous allocations. The limit is called MAX_ORDER and it
|
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||||
defines the maximal power of two of number of pages that can be
|
defines the maximal power of two of number of pages that can be
|
||||||
allocated as a single contiguous block. This option allows
|
allocated as a single contiguous block. This option allows
|
||||||
overriding the default setting when ability to allocate very
|
overriding the default setting when ability to allocate very
|
||||||
|
@ -97,7 +97,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
|
|||||||
}
|
}
|
||||||
|
|
||||||
mmap_read_lock(mm);
|
mmap_read_lock(mm);
|
||||||
chunk = (1UL << (PAGE_SHIFT + MAX_ORDER)) /
|
chunk = (1UL << (PAGE_SHIFT + MAX_PAGE_ORDER)) /
|
||||||
sizeof(struct vm_area_struct *);
|
sizeof(struct vm_area_struct *);
|
||||||
chunk = min(chunk, entries);
|
chunk = min(chunk, entries);
|
||||||
for (entry = 0; entry < entries; entry += chunk) {
|
for (entry = 0; entry < entries; entry += chunk) {
|
||||||
|
@ -615,7 +615,7 @@ void __init gigantic_hugetlb_cma_reserve(void)
|
|||||||
order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT;
|
order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT;
|
||||||
|
|
||||||
if (order) {
|
if (order) {
|
||||||
VM_WARN_ON(order <= MAX_ORDER);
|
VM_WARN_ON(order <= MAX_PAGE_ORDER);
|
||||||
hugetlb_cma_reserve(order);
|
hugetlb_cma_reserve(order);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -1389,7 +1389,7 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
|
|||||||
* DMA window can be larger than available memory, which will
|
* DMA window can be larger than available memory, which will
|
||||||
* cause errors later.
|
* cause errors later.
|
||||||
*/
|
*/
|
||||||
const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER);
|
const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_PAGE_ORDER);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* We create the default window as big as we can. The constraint is
|
* We create the default window as big as we can. The constraint is
|
||||||
|
@ -673,6 +673,7 @@ static inline int pmd_write(pmd_t pmd)
|
|||||||
return pte_write(pmd_pte(pmd));
|
return pte_write(pmd_pte(pmd));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#define pmd_dirty pmd_dirty
|
||||||
static inline int pmd_dirty(pmd_t pmd)
|
static inline int pmd_dirty(pmd_t pmd)
|
||||||
{
|
{
|
||||||
return pte_dirty(pmd_pte(pmd));
|
return pte_dirty(pmd_pte(pmd));
|
||||||
|
@ -770,6 +770,7 @@ static inline int pud_write(pud_t pud)
|
|||||||
return (pud_val(pud) & _REGION3_ENTRY_WRITE) != 0;
|
return (pud_val(pud) & _REGION3_ENTRY_WRITE) != 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#define pmd_dirty pmd_dirty
|
||||||
static inline int pmd_dirty(pmd_t pmd)
|
static inline int pmd_dirty(pmd_t pmd)
|
||||||
{
|
{
|
||||||
return (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY) != 0;
|
return (pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY) != 0;
|
||||||
|
@ -26,7 +26,7 @@ config ARCH_FORCE_MAX_ORDER
|
|||||||
default "10"
|
default "10"
|
||||||
help
|
help
|
||||||
The kernel page allocator limits the size of maximal physically
|
The kernel page allocator limits the size of maximal physically
|
||||||
contiguous allocations. The limit is called MAX_ORDER and it
|
contiguous allocations. The limit is called MAX_PAGE:_ORDER and it
|
||||||
defines the maximal power of two of number of pages that can be
|
defines the maximal power of two of number of pages that can be
|
||||||
allocated as a single contiguous block. This option allows
|
allocated as a single contiguous block. This option allows
|
||||||
overriding the default setting when ability to allocate very
|
overriding the default setting when ability to allocate very
|
||||||
|
@ -277,7 +277,7 @@ config ARCH_FORCE_MAX_ORDER
|
|||||||
default "12"
|
default "12"
|
||||||
help
|
help
|
||||||
The kernel page allocator limits the size of maximal physically
|
The kernel page allocator limits the size of maximal physically
|
||||||
contiguous allocations. The limit is called MAX_ORDER and it
|
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||||
defines the maximal power of two of number of pages that can be
|
defines the maximal power of two of number of pages that can be
|
||||||
allocated as a single contiguous block. This option allows
|
allocated as a single contiguous block. This option allows
|
||||||
overriding the default setting when ability to allocate very
|
overriding the default setting when ability to allocate very
|
||||||
|
@ -706,6 +706,7 @@ static inline unsigned long pmd_write(pmd_t pmd)
|
|||||||
#define pud_write(pud) pte_write(__pte(pud_val(pud)))
|
#define pud_write(pud) pte_write(__pte(pud_val(pud)))
|
||||||
|
|
||||||
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
||||||
|
#define pmd_dirty pmd_dirty
|
||||||
static inline unsigned long pmd_dirty(pmd_t pmd)
|
static inline unsigned long pmd_dirty(pmd_t pmd)
|
||||||
{
|
{
|
||||||
pte_t pte = __pte(pmd_val(pmd));
|
pte_t pte = __pte(pmd_val(pmd));
|
||||||
|
@ -194,7 +194,7 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size,
|
|||||||
|
|
||||||
size = IO_PAGE_ALIGN(size);
|
size = IO_PAGE_ALIGN(size);
|
||||||
order = get_order(size);
|
order = get_order(size);
|
||||||
if (unlikely(order > MAX_ORDER))
|
if (unlikely(order > MAX_PAGE_ORDER))
|
||||||
return NULL;
|
return NULL;
|
||||||
|
|
||||||
npages = size >> IO_PAGE_SHIFT;
|
npages = size >> IO_PAGE_SHIFT;
|
||||||
|
@ -897,7 +897,7 @@ void __init cheetah_ecache_flush_init(void)
|
|||||||
|
|
||||||
/* Now allocate error trap reporting scoreboard. */
|
/* Now allocate error trap reporting scoreboard. */
|
||||||
sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info));
|
sz = NR_CPUS * (2 * sizeof(struct cheetah_err_info));
|
||||||
for (order = 0; order <= MAX_ORDER; order++) {
|
for (order = 0; order < NR_PAGE_ORDERS; order++) {
|
||||||
if ((PAGE_SIZE << order) >= sz)
|
if ((PAGE_SIZE << order) >= sz)
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
@ -402,8 +402,8 @@ void tsb_grow(struct mm_struct *mm, unsigned long tsb_index, unsigned long rss)
|
|||||||
unsigned long new_rss_limit;
|
unsigned long new_rss_limit;
|
||||||
gfp_t gfp_flags;
|
gfp_t gfp_flags;
|
||||||
|
|
||||||
if (max_tsb_size > PAGE_SIZE << MAX_ORDER)
|
if (max_tsb_size > PAGE_SIZE << MAX_PAGE_ORDER)
|
||||||
max_tsb_size = PAGE_SIZE << MAX_ORDER;
|
max_tsb_size = PAGE_SIZE << MAX_PAGE_ORDER;
|
||||||
|
|
||||||
new_cache_index = 0;
|
new_cache_index = 0;
|
||||||
for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) {
|
for (new_size = 8192; new_size < max_tsb_size; new_size <<= 1UL) {
|
||||||
|
@ -373,10 +373,10 @@ int __init linux_main(int argc, char **argv)
|
|||||||
max_physmem = TASK_SIZE - uml_physmem - iomem_size - MIN_VMALLOC;
|
max_physmem = TASK_SIZE - uml_physmem - iomem_size - MIN_VMALLOC;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Zones have to begin on a 1 << MAX_ORDER page boundary,
|
* Zones have to begin on a 1 << MAX_PAGE_ORDER page boundary,
|
||||||
* so this makes sure that's true for highmem
|
* so this makes sure that's true for highmem
|
||||||
*/
|
*/
|
||||||
max_physmem &= ~((1 << (PAGE_SHIFT + MAX_ORDER)) - 1);
|
max_physmem &= ~((1 << (PAGE_SHIFT + MAX_PAGE_ORDER)) - 1);
|
||||||
if (physmem_size + iomem_size > max_physmem) {
|
if (physmem_size + iomem_size > max_physmem) {
|
||||||
highmem = physmem_size + iomem_size - max_physmem;
|
highmem = physmem_size + iomem_size - max_physmem;
|
||||||
physmem_size -= highmem;
|
physmem_size -= highmem;
|
||||||
|
@ -88,6 +88,7 @@ config X86
|
|||||||
select ARCH_HAS_PMEM_API if X86_64
|
select ARCH_HAS_PMEM_API if X86_64
|
||||||
select ARCH_HAS_PTE_DEVMAP if X86_64
|
select ARCH_HAS_PTE_DEVMAP if X86_64
|
||||||
select ARCH_HAS_PTE_SPECIAL
|
select ARCH_HAS_PTE_SPECIAL
|
||||||
|
select ARCH_HAS_HW_PTE_YOUNG
|
||||||
select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
|
select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
|
||||||
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
|
select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
|
||||||
select ARCH_HAS_COPY_MC if X86_64
|
select ARCH_HAS_COPY_MC if X86_64
|
||||||
|
@ -141,6 +141,7 @@ static inline int pte_young(pte_t pte)
|
|||||||
return pte_flags(pte) & _PAGE_ACCESSED;
|
return pte_flags(pte) & _PAGE_ACCESSED;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#define pmd_dirty pmd_dirty
|
||||||
static inline bool pmd_dirty(pmd_t pmd)
|
static inline bool pmd_dirty(pmd_t pmd)
|
||||||
{
|
{
|
||||||
return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
|
return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
|
||||||
@ -1679,12 +1680,6 @@ static inline bool arch_has_pfn_modify_check(void)
|
|||||||
return boot_cpu_has_bug(X86_BUG_L1TF);
|
return boot_cpu_has_bug(X86_BUG_L1TF);
|
||||||
}
|
}
|
||||||
|
|
||||||
#define arch_has_hw_pte_young arch_has_hw_pte_young
|
|
||||||
static inline bool arch_has_hw_pte_young(void)
|
|
||||||
{
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
|
|
||||||
#define arch_check_zapped_pte arch_check_zapped_pte
|
#define arch_check_zapped_pte arch_check_zapped_pte
|
||||||
void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
|
void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
|
||||||
|
|
||||||
|
@ -449,37 +449,6 @@ int __node_distance(int from, int to)
|
|||||||
}
|
}
|
||||||
EXPORT_SYMBOL(__node_distance);
|
EXPORT_SYMBOL(__node_distance);
|
||||||
|
|
||||||
/*
|
|
||||||
* Sanity check to catch more bad NUMA configurations (they are amazingly
|
|
||||||
* common). Make sure the nodes cover all memory.
|
|
||||||
*/
|
|
||||||
static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
|
|
||||||
{
|
|
||||||
u64 numaram, e820ram;
|
|
||||||
int i;
|
|
||||||
|
|
||||||
numaram = 0;
|
|
||||||
for (i = 0; i < mi->nr_blks; i++) {
|
|
||||||
u64 s = mi->blk[i].start >> PAGE_SHIFT;
|
|
||||||
u64 e = mi->blk[i].end >> PAGE_SHIFT;
|
|
||||||
numaram += e - s;
|
|
||||||
numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
|
|
||||||
if ((s64)numaram < 0)
|
|
||||||
numaram = 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
e820ram = max_pfn - absent_pages_in_range(0, max_pfn);
|
|
||||||
|
|
||||||
/* We seem to lose 3 pages somewhere. Allow 1M of slack. */
|
|
||||||
if ((s64)(e820ram - numaram) >= (1 << (20 - PAGE_SHIFT))) {
|
|
||||||
printk(KERN_ERR "NUMA: nodes only cover %LuMB of your %LuMB e820 RAM. Not used.\n",
|
|
||||||
(numaram << PAGE_SHIFT) >> 20,
|
|
||||||
(e820ram << PAGE_SHIFT) >> 20);
|
|
||||||
return false;
|
|
||||||
}
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Mark all currently memblock-reserved physical memory (which covers the
|
* Mark all currently memblock-reserved physical memory (which covers the
|
||||||
* kernel's own memory ranges) as hot-unswappable.
|
* kernel's own memory ranges) as hot-unswappable.
|
||||||
@ -585,7 +554,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
|
|||||||
return -EINVAL;
|
return -EINVAL;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
if (!numa_meminfo_cover_memory(mi))
|
|
||||||
|
if (!memblock_validate_numa_coverage(SZ_1M))
|
||||||
return -EINVAL;
|
return -EINVAL;
|
||||||
|
|
||||||
/* Finally register nodes. */
|
/* Finally register nodes. */
|
||||||
|
@ -793,7 +793,7 @@ config ARCH_FORCE_MAX_ORDER
|
|||||||
default "10"
|
default "10"
|
||||||
help
|
help
|
||||||
The kernel page allocator limits the size of maximal physically
|
The kernel page allocator limits the size of maximal physically
|
||||||
contiguous allocations. The limit is called MAX_ORDER and it
|
contiguous allocations. The limit is called MAX_PAGE_ORDER and it
|
||||||
defines the maximal power of two of number of pages that can be
|
defines the maximal power of two of number of pages that can be
|
||||||
allocated as a single contiguous block. This option allows
|
allocated as a single contiguous block. This option allows
|
||||||
overriding the default setting when ability to allocate very
|
overriding the default setting when ability to allocate very
|
||||||
|
@ -18,6 +18,8 @@
|
|||||||
#define KASAN_SHADOW_START (XCHAL_PAGE_TABLE_VADDR + XCHAL_PAGE_TABLE_SIZE)
|
#define KASAN_SHADOW_START (XCHAL_PAGE_TABLE_VADDR + XCHAL_PAGE_TABLE_SIZE)
|
||||||
/* Size of the shadow map */
|
/* Size of the shadow map */
|
||||||
#define KASAN_SHADOW_SIZE (-KASAN_START_VADDR >> KASAN_SHADOW_SCALE_SHIFT)
|
#define KASAN_SHADOW_SIZE (-KASAN_START_VADDR >> KASAN_SHADOW_SCALE_SHIFT)
|
||||||
|
/* End of the shadow map */
|
||||||
|
#define KASAN_SHADOW_END (KASAN_SHADOW_START + KASAN_SHADOW_SIZE)
|
||||||
/* Offset for mem to shadow address transformation */
|
/* Offset for mem to shadow address transformation */
|
||||||
#define KASAN_SHADOW_OFFSET __XTENSA_UL_CONST(CONFIG_KASAN_SHADOW_OFFSET)
|
#define KASAN_SHADOW_OFFSET __XTENSA_UL_CONST(CONFIG_KASAN_SHADOW_OFFSET)
|
||||||
|
|
||||||
|
23
block/fops.c
23
block/fops.c
@ -410,9 +410,24 @@ static int blkdev_get_block(struct inode *inode, sector_t iblock,
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
|
/*
|
||||||
|
* We cannot call mpage_writepages() as it does not take the buffer lock.
|
||||||
|
* We must use block_write_full_folio() directly which holds the buffer
|
||||||
|
* lock. The buffer lock provides the synchronisation with writeback
|
||||||
|
* that filesystems rely on when they use the blockdev's mapping.
|
||||||
|
*/
|
||||||
|
static int blkdev_writepages(struct address_space *mapping,
|
||||||
|
struct writeback_control *wbc)
|
||||||
{
|
{
|
||||||
return block_write_full_page(page, blkdev_get_block, wbc);
|
struct blk_plug plug;
|
||||||
|
int err;
|
||||||
|
|
||||||
|
blk_start_plug(&plug);
|
||||||
|
err = write_cache_pages(mapping, wbc, block_write_full_folio,
|
||||||
|
blkdev_get_block);
|
||||||
|
blk_finish_plug(&plug);
|
||||||
|
|
||||||
|
return err;
|
||||||
}
|
}
|
||||||
|
|
||||||
static int blkdev_read_folio(struct file *file, struct folio *folio)
|
static int blkdev_read_folio(struct file *file, struct folio *folio)
|
||||||
@ -449,7 +464,7 @@ const struct address_space_operations def_blk_aops = {
|
|||||||
.invalidate_folio = block_invalidate_folio,
|
.invalidate_folio = block_invalidate_folio,
|
||||||
.read_folio = blkdev_read_folio,
|
.read_folio = blkdev_read_folio,
|
||||||
.readahead = blkdev_readahead,
|
.readahead = blkdev_readahead,
|
||||||
.writepage = blkdev_writepage,
|
.writepages = blkdev_writepages,
|
||||||
.write_begin = blkdev_write_begin,
|
.write_begin = blkdev_write_begin,
|
||||||
.write_end = blkdev_write_end,
|
.write_end = blkdev_write_end,
|
||||||
.migrate_folio = buffer_migrate_folio_norefs,
|
.migrate_folio = buffer_migrate_folio_norefs,
|
||||||
@ -500,7 +515,7 @@ const struct address_space_operations def_blk_aops = {
|
|||||||
.readahead = blkdev_readahead,
|
.readahead = blkdev_readahead,
|
||||||
.writepages = blkdev_writepages,
|
.writepages = blkdev_writepages,
|
||||||
.is_partially_uptodate = iomap_is_partially_uptodate,
|
.is_partially_uptodate = iomap_is_partially_uptodate,
|
||||||
.error_remove_page = generic_error_remove_page,
|
.error_remove_folio = generic_error_remove_folio,
|
||||||
.migrate_folio = filemap_migrate_folio,
|
.migrate_folio = filemap_migrate_folio,
|
||||||
};
|
};
|
||||||
#endif /* CONFIG_BUFFER_HEAD */
|
#endif /* CONFIG_BUFFER_HEAD */
|
||||||
|
@ -451,7 +451,7 @@ static int create_sgt(struct qaic_device *qdev, struct sg_table **sgt_out, u64 s
|
|||||||
* later
|
* later
|
||||||
*/
|
*/
|
||||||
buf_extra = (PAGE_SIZE - size % PAGE_SIZE) % PAGE_SIZE;
|
buf_extra = (PAGE_SIZE - size % PAGE_SIZE) % PAGE_SIZE;
|
||||||
max_order = min(MAX_ORDER - 1, get_order(size));
|
max_order = min(MAX_PAGE_ORDER - 1, get_order(size));
|
||||||
} else {
|
} else {
|
||||||
/* allocate a single page for book keeping */
|
/* allocate a single page for book keeping */
|
||||||
nr_pages = 1;
|
nr_pages = 1;
|
||||||
|
@ -234,7 +234,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate,
|
|||||||
if (page->page_ptr) {
|
if (page->page_ptr) {
|
||||||
trace_binder_alloc_lru_start(alloc, index);
|
trace_binder_alloc_lru_start(alloc, index);
|
||||||
|
|
||||||
on_lru = list_lru_del(&binder_alloc_lru, &page->lru);
|
on_lru = list_lru_del_obj(&binder_alloc_lru, &page->lru);
|
||||||
WARN_ON(!on_lru);
|
WARN_ON(!on_lru);
|
||||||
|
|
||||||
trace_binder_alloc_lru_end(alloc, index);
|
trace_binder_alloc_lru_end(alloc, index);
|
||||||
@ -285,7 +285,7 @@ free_range:
|
|||||||
|
|
||||||
trace_binder_free_lru_start(alloc, index);
|
trace_binder_free_lru_start(alloc, index);
|
||||||
|
|
||||||
ret = list_lru_add(&binder_alloc_lru, &page->lru);
|
ret = list_lru_add_obj(&binder_alloc_lru, &page->lru);
|
||||||
WARN_ON(!ret);
|
WARN_ON(!ret);
|
||||||
|
|
||||||
trace_binder_free_lru_end(alloc, index);
|
trace_binder_free_lru_end(alloc, index);
|
||||||
@ -848,7 +848,7 @@ void binder_alloc_deferred_release(struct binder_alloc *alloc)
|
|||||||
if (!alloc->pages[i].page_ptr)
|
if (!alloc->pages[i].page_ptr)
|
||||||
continue;
|
continue;
|
||||||
|
|
||||||
on_lru = list_lru_del(&binder_alloc_lru,
|
on_lru = list_lru_del_obj(&binder_alloc_lru,
|
||||||
&alloc->pages[i].lru);
|
&alloc->pages[i].lru);
|
||||||
page_addr = alloc->buffer + i * PAGE_SIZE;
|
page_addr = alloc->buffer + i * PAGE_SIZE;
|
||||||
binder_alloc_debug(BINDER_DEBUG_BUFFER_ALLOC,
|
binder_alloc_debug(BINDER_DEBUG_BUFFER_ALLOC,
|
||||||
@ -1287,4 +1287,3 @@ int binder_alloc_copy_from_buffer(struct binder_alloc *alloc,
|
|||||||
return binder_alloc_do_buffer_copy(alloc, false, buffer, buffer_offset,
|
return binder_alloc_do_buffer_copy(alloc, false, buffer, buffer_offset,
|
||||||
dest, bytes);
|
dest, bytes);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -226,8 +226,8 @@ static ssize_t regmap_read_debugfs(struct regmap *map, unsigned int from,
|
|||||||
if (*ppos < 0 || !count)
|
if (*ppos < 0 || !count)
|
||||||
return -EINVAL;
|
return -EINVAL;
|
||||||
|
|
||||||
if (count > (PAGE_SIZE << MAX_ORDER))
|
if (count > (PAGE_SIZE << MAX_PAGE_ORDER))
|
||||||
count = PAGE_SIZE << MAX_ORDER;
|
count = PAGE_SIZE << MAX_PAGE_ORDER;
|
||||||
|
|
||||||
buf = kmalloc(count, GFP_KERNEL);
|
buf = kmalloc(count, GFP_KERNEL);
|
||||||
if (!buf)
|
if (!buf)
|
||||||
@ -373,8 +373,8 @@ static ssize_t regmap_reg_ranges_read_file(struct file *file,
|
|||||||
if (*ppos < 0 || !count)
|
if (*ppos < 0 || !count)
|
||||||
return -EINVAL;
|
return -EINVAL;
|
||||||
|
|
||||||
if (count > (PAGE_SIZE << MAX_ORDER))
|
if (count > (PAGE_SIZE << MAX_PAGE_ORDER))
|
||||||
count = PAGE_SIZE << MAX_ORDER;
|
count = PAGE_SIZE << MAX_PAGE_ORDER;
|
||||||
|
|
||||||
buf = kmalloc(count, GFP_KERNEL);
|
buf = kmalloc(count, GFP_KERNEL);
|
||||||
if (!buf)
|
if (!buf)
|
||||||
|
@ -3079,7 +3079,7 @@ static void raw_cmd_free(struct floppy_raw_cmd **ptr)
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
#define MAX_LEN (1UL << MAX_ORDER << PAGE_SHIFT)
|
#define MAX_LEN (1UL << MAX_PAGE_ORDER << PAGE_SHIFT)
|
||||||
|
|
||||||
static int raw_cmd_copyin(int cmd, void __user *param,
|
static int raw_cmd_copyin(int cmd, void __user *param,
|
||||||
struct floppy_raw_cmd **rcmd)
|
struct floppy_raw_cmd **rcmd)
|
||||||
|
@ -59,8 +59,8 @@ config ZRAM_WRITEBACK
|
|||||||
bool "Write back incompressible or idle page to backing device"
|
bool "Write back incompressible or idle page to backing device"
|
||||||
depends on ZRAM
|
depends on ZRAM
|
||||||
help
|
help
|
||||||
With incompressible page, there is no memory saving to keep it
|
This lets zram entries (incompressible or idle pages) be written
|
||||||
in memory. Instead, write it out to backing device.
|
back to a backing device, helping save memory.
|
||||||
For this feature, admin should set up backing device via
|
For this feature, admin should set up backing device via
|
||||||
/sys/block/zramX/backing_dev.
|
/sys/block/zramX/backing_dev.
|
||||||
|
|
||||||
@ -69,9 +69,18 @@ config ZRAM_WRITEBACK
|
|||||||
|
|
||||||
See Documentation/admin-guide/blockdev/zram.rst for more information.
|
See Documentation/admin-guide/blockdev/zram.rst for more information.
|
||||||
|
|
||||||
|
config ZRAM_TRACK_ENTRY_ACTIME
|
||||||
|
bool "Track access time of zram entries"
|
||||||
|
depends on ZRAM
|
||||||
|
help
|
||||||
|
With this feature zram tracks access time of every stored
|
||||||
|
entry (page), which can be used for a more fine grained IDLE
|
||||||
|
pages writeback.
|
||||||
|
|
||||||
config ZRAM_MEMORY_TRACKING
|
config ZRAM_MEMORY_TRACKING
|
||||||
bool "Track zRam block status"
|
bool "Track zRam block status"
|
||||||
depends on ZRAM && DEBUG_FS
|
depends on ZRAM && DEBUG_FS
|
||||||
|
select ZRAM_TRACK_ENTRY_ACTIME
|
||||||
help
|
help
|
||||||
With this feature, admin can track the state of allocated blocks
|
With this feature, admin can track the state of allocated blocks
|
||||||
of zRAM. Admin could see the information via
|
of zRAM. Admin could see the information via
|
||||||
@ -86,4 +95,4 @@ config ZRAM_MULTI_COMP
|
|||||||
This will enable multi-compression streams, so that ZRAM can
|
This will enable multi-compression streams, so that ZRAM can
|
||||||
re-compress pages using a potentially slower but more effective
|
re-compress pages using a potentially slower but more effective
|
||||||
compression algorithm. Note, that IDLE page recompression
|
compression algorithm. Note, that IDLE page recompression
|
||||||
requires ZRAM_MEMORY_TRACKING.
|
requires ZRAM_TRACK_ENTRY_ACTIME.
|
||||||
|
@ -174,6 +174,14 @@ static inline u32 zram_get_priority(struct zram *zram, u32 index)
|
|||||||
return prio & ZRAM_COMP_PRIORITY_MASK;
|
return prio & ZRAM_COMP_PRIORITY_MASK;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
static void zram_accessed(struct zram *zram, u32 index)
|
||||||
|
{
|
||||||
|
zram_clear_flag(zram, index, ZRAM_IDLE);
|
||||||
|
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
|
||||||
|
zram->table[index].ac_time = ktime_get_boottime();
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
static inline void update_used_max(struct zram *zram,
|
static inline void update_used_max(struct zram *zram,
|
||||||
const unsigned long pages)
|
const unsigned long pages)
|
||||||
{
|
{
|
||||||
@ -293,8 +301,9 @@ static void mark_idle(struct zram *zram, ktime_t cutoff)
|
|||||||
zram_slot_lock(zram, index);
|
zram_slot_lock(zram, index);
|
||||||
if (zram_allocated(zram, index) &&
|
if (zram_allocated(zram, index) &&
|
||||||
!zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
|
!zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
|
||||||
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
|
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
|
||||||
is_idle = !cutoff || ktime_after(cutoff, zram->table[index].ac_time);
|
is_idle = !cutoff || ktime_after(cutoff,
|
||||||
|
zram->table[index].ac_time);
|
||||||
#endif
|
#endif
|
||||||
if (is_idle)
|
if (is_idle)
|
||||||
zram_set_flag(zram, index, ZRAM_IDLE);
|
zram_set_flag(zram, index, ZRAM_IDLE);
|
||||||
@ -317,7 +326,7 @@ static ssize_t idle_store(struct device *dev,
|
|||||||
*/
|
*/
|
||||||
u64 age_sec;
|
u64 age_sec;
|
||||||
|
|
||||||
if (IS_ENABLED(CONFIG_ZRAM_MEMORY_TRACKING) && !kstrtoull(buf, 0, &age_sec))
|
if (IS_ENABLED(CONFIG_ZRAM_TRACK_ENTRY_ACTIME) && !kstrtoull(buf, 0, &age_sec))
|
||||||
cutoff_time = ktime_sub(ktime_get_boottime(),
|
cutoff_time = ktime_sub(ktime_get_boottime(),
|
||||||
ns_to_ktime(age_sec * NSEC_PER_SEC));
|
ns_to_ktime(age_sec * NSEC_PER_SEC));
|
||||||
else
|
else
|
||||||
@ -841,12 +850,6 @@ static void zram_debugfs_destroy(void)
|
|||||||
debugfs_remove_recursive(zram_debugfs_root);
|
debugfs_remove_recursive(zram_debugfs_root);
|
||||||
}
|
}
|
||||||
|
|
||||||
static void zram_accessed(struct zram *zram, u32 index)
|
|
||||||
{
|
|
||||||
zram_clear_flag(zram, index, ZRAM_IDLE);
|
|
||||||
zram->table[index].ac_time = ktime_get_boottime();
|
|
||||||
}
|
|
||||||
|
|
||||||
static ssize_t read_block_state(struct file *file, char __user *buf,
|
static ssize_t read_block_state(struct file *file, char __user *buf,
|
||||||
size_t count, loff_t *ppos)
|
size_t count, loff_t *ppos)
|
||||||
{
|
{
|
||||||
@ -930,10 +933,6 @@ static void zram_debugfs_unregister(struct zram *zram)
|
|||||||
#else
|
#else
|
||||||
static void zram_debugfs_create(void) {};
|
static void zram_debugfs_create(void) {};
|
||||||
static void zram_debugfs_destroy(void) {};
|
static void zram_debugfs_destroy(void) {};
|
||||||
static void zram_accessed(struct zram *zram, u32 index)
|
|
||||||
{
|
|
||||||
zram_clear_flag(zram, index, ZRAM_IDLE);
|
|
||||||
};
|
|
||||||
static void zram_debugfs_register(struct zram *zram) {};
|
static void zram_debugfs_register(struct zram *zram) {};
|
||||||
static void zram_debugfs_unregister(struct zram *zram) {};
|
static void zram_debugfs_unregister(struct zram *zram) {};
|
||||||
#endif
|
#endif
|
||||||
@ -1254,7 +1253,7 @@ static void zram_free_page(struct zram *zram, size_t index)
|
|||||||
{
|
{
|
||||||
unsigned long handle;
|
unsigned long handle;
|
||||||
|
|
||||||
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
|
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
|
||||||
zram->table[index].ac_time = 0;
|
zram->table[index].ac_time = 0;
|
||||||
#endif
|
#endif
|
||||||
if (zram_test_flag(zram, index, ZRAM_IDLE))
|
if (zram_test_flag(zram, index, ZRAM_IDLE))
|
||||||
@ -1322,9 +1321,9 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
|
|||||||
void *mem;
|
void *mem;
|
||||||
|
|
||||||
value = handle ? zram_get_element(zram, index) : 0;
|
value = handle ? zram_get_element(zram, index) : 0;
|
||||||
mem = kmap_atomic(page);
|
mem = kmap_local_page(page);
|
||||||
zram_fill_page(mem, PAGE_SIZE, value);
|
zram_fill_page(mem, PAGE_SIZE, value);
|
||||||
kunmap_atomic(mem);
|
kunmap_local(mem);
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -1337,14 +1336,14 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
|
|||||||
|
|
||||||
src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
|
src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
|
||||||
if (size == PAGE_SIZE) {
|
if (size == PAGE_SIZE) {
|
||||||
dst = kmap_atomic(page);
|
dst = kmap_local_page(page);
|
||||||
memcpy(dst, src, PAGE_SIZE);
|
memcpy(dst, src, PAGE_SIZE);
|
||||||
kunmap_atomic(dst);
|
kunmap_local(dst);
|
||||||
ret = 0;
|
ret = 0;
|
||||||
} else {
|
} else {
|
||||||
dst = kmap_atomic(page);
|
dst = kmap_local_page(page);
|
||||||
ret = zcomp_decompress(zstrm, src, size, dst);
|
ret = zcomp_decompress(zstrm, src, size, dst);
|
||||||
kunmap_atomic(dst);
|
kunmap_local(dst);
|
||||||
zcomp_stream_put(zram->comps[prio]);
|
zcomp_stream_put(zram->comps[prio]);
|
||||||
}
|
}
|
||||||
zs_unmap_object(zram->mem_pool, handle);
|
zs_unmap_object(zram->mem_pool, handle);
|
||||||
@ -1417,21 +1416,21 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
|
|||||||
unsigned long element = 0;
|
unsigned long element = 0;
|
||||||
enum zram_pageflags flags = 0;
|
enum zram_pageflags flags = 0;
|
||||||
|
|
||||||
mem = kmap_atomic(page);
|
mem = kmap_local_page(page);
|
||||||
if (page_same_filled(mem, &element)) {
|
if (page_same_filled(mem, &element)) {
|
||||||
kunmap_atomic(mem);
|
kunmap_local(mem);
|
||||||
/* Free memory associated with this sector now. */
|
/* Free memory associated with this sector now. */
|
||||||
flags = ZRAM_SAME;
|
flags = ZRAM_SAME;
|
||||||
atomic64_inc(&zram->stats.same_pages);
|
atomic64_inc(&zram->stats.same_pages);
|
||||||
goto out;
|
goto out;
|
||||||
}
|
}
|
||||||
kunmap_atomic(mem);
|
kunmap_local(mem);
|
||||||
|
|
||||||
compress_again:
|
compress_again:
|
||||||
zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
|
zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
|
||||||
src = kmap_atomic(page);
|
src = kmap_local_page(page);
|
||||||
ret = zcomp_compress(zstrm, src, &comp_len);
|
ret = zcomp_compress(zstrm, src, &comp_len);
|
||||||
kunmap_atomic(src);
|
kunmap_local(src);
|
||||||
|
|
||||||
if (unlikely(ret)) {
|
if (unlikely(ret)) {
|
||||||
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
|
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
|
||||||
@ -1495,10 +1494,10 @@ compress_again:
|
|||||||
|
|
||||||
src = zstrm->buffer;
|
src = zstrm->buffer;
|
||||||
if (comp_len == PAGE_SIZE)
|
if (comp_len == PAGE_SIZE)
|
||||||
src = kmap_atomic(page);
|
src = kmap_local_page(page);
|
||||||
memcpy(dst, src, comp_len);
|
memcpy(dst, src, comp_len);
|
||||||
if (comp_len == PAGE_SIZE)
|
if (comp_len == PAGE_SIZE)
|
||||||
kunmap_atomic(src);
|
kunmap_local(src);
|
||||||
|
|
||||||
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
|
zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
|
||||||
zs_unmap_object(zram->mem_pool, handle);
|
zs_unmap_object(zram->mem_pool, handle);
|
||||||
@ -1615,9 +1614,9 @@ static int zram_recompress(struct zram *zram, u32 index, struct page *page,
|
|||||||
|
|
||||||
num_recomps++;
|
num_recomps++;
|
||||||
zstrm = zcomp_stream_get(zram->comps[prio]);
|
zstrm = zcomp_stream_get(zram->comps[prio]);
|
||||||
src = kmap_atomic(page);
|
src = kmap_local_page(page);
|
||||||
ret = zcomp_compress(zstrm, src, &comp_len_new);
|
ret = zcomp_compress(zstrm, src, &comp_len_new);
|
||||||
kunmap_atomic(src);
|
kunmap_local(src);
|
||||||
|
|
||||||
if (ret) {
|
if (ret) {
|
||||||
zcomp_stream_put(zram->comps[prio]);
|
zcomp_stream_put(zram->comps[prio]);
|
||||||
|
@ -69,7 +69,7 @@ struct zram_table_entry {
|
|||||||
unsigned long element;
|
unsigned long element;
|
||||||
};
|
};
|
||||||
unsigned long flags;
|
unsigned long flags;
|
||||||
#ifdef CONFIG_ZRAM_MEMORY_TRACKING
|
#ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
|
||||||
ktime_t ac_time;
|
ktime_t ac_time;
|
||||||
#endif
|
#endif
|
||||||
};
|
};
|
||||||
|
@ -906,7 +906,7 @@ static int sev_ioctl_do_get_id2(struct sev_issue_cmd *argp)
|
|||||||
/*
|
/*
|
||||||
* The length of the ID shouldn't be assumed by software since
|
* The length of the ID shouldn't be assumed by software since
|
||||||
* it may change in the future. The allocation size is limited
|
* it may change in the future. The allocation size is limited
|
||||||
* to 1 << (PAGE_SHIFT + MAX_ORDER) by the page allocator.
|
* to 1 << (PAGE_SHIFT + MAX_PAGE_ORDER) by the page allocator.
|
||||||
* If the allocation fails, simply return ENOMEM rather than
|
* If the allocation fails, simply return ENOMEM rather than
|
||||||
* warning in the kernel log.
|
* warning in the kernel log.
|
||||||
*/
|
*/
|
||||||
|
@ -70,11 +70,11 @@ struct hisi_acc_sgl_pool *hisi_acc_create_sgl_pool(struct device *dev,
|
|||||||
HISI_ACC_SGL_ALIGN_SIZE);
|
HISI_ACC_SGL_ALIGN_SIZE);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_ORDER,
|
* the pool may allocate a block of memory of size PAGE_SIZE * 2^MAX_PAGE_ORDER,
|
||||||
* block size may exceed 2^31 on ia64, so the max of block size is 2^31
|
* block size may exceed 2^31 on ia64, so the max of block size is 2^31
|
||||||
*/
|
*/
|
||||||
block_size = 1 << (PAGE_SHIFT + MAX_ORDER < 32 ?
|
block_size = 1 << (PAGE_SHIFT + MAX_PAGE_ORDER < 32 ?
|
||||||
PAGE_SHIFT + MAX_ORDER : 31);
|
PAGE_SHIFT + MAX_PAGE_ORDER : 31);
|
||||||
sgl_num_per_block = block_size / sgl_size;
|
sgl_num_per_block = block_size / sgl_size;
|
||||||
block_num = count / sgl_num_per_block;
|
block_num = count / sgl_num_per_block;
|
||||||
remain_sgl = count % sgl_num_per_block;
|
remain_sgl = count % sgl_num_per_block;
|
||||||
|
@ -367,6 +367,7 @@ static ssize_t create_store(struct device *dev, struct device_attribute *attr,
|
|||||||
.dax_region = dax_region,
|
.dax_region = dax_region,
|
||||||
.size = 0,
|
.size = 0,
|
||||||
.id = -1,
|
.id = -1,
|
||||||
|
.memmap_on_memory = false,
|
||||||
};
|
};
|
||||||
struct dev_dax *dev_dax = devm_create_dev_dax(&data);
|
struct dev_dax *dev_dax = devm_create_dev_dax(&data);
|
||||||
|
|
||||||
@ -1400,6 +1401,8 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
|
|||||||
dev_dax->align = dax_region->align;
|
dev_dax->align = dax_region->align;
|
||||||
ida_init(&dev_dax->ida);
|
ida_init(&dev_dax->ida);
|
||||||
|
|
||||||
|
dev_dax->memmap_on_memory = data->memmap_on_memory;
|
||||||
|
|
||||||
inode = dax_inode(dax_dev);
|
inode = dax_inode(dax_dev);
|
||||||
dev->devt = inode->i_rdev;
|
dev->devt = inode->i_rdev;
|
||||||
dev->bus = &dax_bus_type;
|
dev->bus = &dax_bus_type;
|
||||||
|
@ -23,6 +23,7 @@ struct dev_dax_data {
|
|||||||
struct dev_pagemap *pgmap;
|
struct dev_pagemap *pgmap;
|
||||||
resource_size_t size;
|
resource_size_t size;
|
||||||
int id;
|
int id;
|
||||||
|
bool memmap_on_memory;
|
||||||
};
|
};
|
||||||
|
|
||||||
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
|
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
|
||||||
|
@ -26,6 +26,7 @@ static int cxl_dax_region_probe(struct device *dev)
|
|||||||
.dax_region = dax_region,
|
.dax_region = dax_region,
|
||||||
.id = -1,
|
.id = -1,
|
||||||
.size = range_len(&cxlr_dax->hpa_range),
|
.size = range_len(&cxlr_dax->hpa_range),
|
||||||
|
.memmap_on_memory = true,
|
||||||
};
|
};
|
||||||
|
|
||||||
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
|
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
|
||||||
|
@ -70,6 +70,7 @@ struct dev_dax {
|
|||||||
struct ida ida;
|
struct ida ida;
|
||||||
struct device dev;
|
struct device dev;
|
||||||
struct dev_pagemap *pgmap;
|
struct dev_pagemap *pgmap;
|
||||||
|
bool memmap_on_memory;
|
||||||
int nr_range;
|
int nr_range;
|
||||||
struct dev_dax_range {
|
struct dev_dax_range {
|
||||||
unsigned long pgoff;
|
unsigned long pgoff;
|
||||||
|
@ -36,6 +36,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
|
|||||||
.dax_region = dax_region,
|
.dax_region = dax_region,
|
||||||
.id = -1,
|
.id = -1,
|
||||||
.size = region_idle ? 0 : range_len(&mri->range),
|
.size = region_idle ? 0 : range_len(&mri->range),
|
||||||
|
.memmap_on_memory = false,
|
||||||
};
|
};
|
||||||
|
|
||||||
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
|
return PTR_ERR_OR_ZERO(devm_create_dev_dax(&data));
|
||||||
|
@ -12,6 +12,7 @@
|
|||||||
#include <linux/mm.h>
|
#include <linux/mm.h>
|
||||||
#include <linux/mman.h>
|
#include <linux/mman.h>
|
||||||
#include <linux/memory-tiers.h>
|
#include <linux/memory-tiers.h>
|
||||||
|
#include <linux/memory_hotplug.h>
|
||||||
#include "dax-private.h"
|
#include "dax-private.h"
|
||||||
#include "bus.h"
|
#include "bus.h"
|
||||||
|
|
||||||
@ -93,6 +94,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
|
|||||||
struct dax_kmem_data *data;
|
struct dax_kmem_data *data;
|
||||||
struct memory_dev_type *mtype;
|
struct memory_dev_type *mtype;
|
||||||
int i, rc, mapped = 0;
|
int i, rc, mapped = 0;
|
||||||
|
mhp_t mhp_flags;
|
||||||
int numa_node;
|
int numa_node;
|
||||||
int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
|
int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
|
||||||
|
|
||||||
@ -179,12 +181,16 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
|
|||||||
*/
|
*/
|
||||||
res->flags = IORESOURCE_SYSTEM_RAM;
|
res->flags = IORESOURCE_SYSTEM_RAM;
|
||||||
|
|
||||||
|
mhp_flags = MHP_NID_IS_MGID;
|
||||||
|
if (dev_dax->memmap_on_memory)
|
||||||
|
mhp_flags |= MHP_MEMMAP_ON_MEMORY;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Ensure that future kexec'd kernels will not treat
|
* Ensure that future kexec'd kernels will not treat
|
||||||
* this as RAM automatically.
|
* this as RAM automatically.
|
||||||
*/
|
*/
|
||||||
rc = add_memory_driver_managed(data->mgid, range.start,
|
rc = add_memory_driver_managed(data->mgid, range.start,
|
||||||
range_len(&range), kmem_name, MHP_NID_IS_MGID);
|
range_len(&range), kmem_name, mhp_flags);
|
||||||
|
|
||||||
if (rc) {
|
if (rc) {
|
||||||
dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
|
dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
|
||||||
|
@ -63,6 +63,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
|
|||||||
.id = id,
|
.id = id,
|
||||||
.pgmap = &pgmap,
|
.pgmap = &pgmap,
|
||||||
.size = range_len(&range),
|
.size = range_len(&range),
|
||||||
|
.memmap_on_memory = false,
|
||||||
};
|
};
|
||||||
|
|
||||||
return devm_create_dev_dax(&data);
|
return devm_create_dev_dax(&data);
|
||||||
|
@ -36,7 +36,7 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj)
|
|||||||
struct sg_table *st;
|
struct sg_table *st;
|
||||||
struct scatterlist *sg;
|
struct scatterlist *sg;
|
||||||
unsigned int npages; /* restricted by sg_alloc_table */
|
unsigned int npages; /* restricted by sg_alloc_table */
|
||||||
int max_order = MAX_ORDER;
|
int max_order = MAX_PAGE_ORDER;
|
||||||
unsigned int max_segment;
|
unsigned int max_segment;
|
||||||
gfp_t gfp;
|
gfp_t gfp;
|
||||||
|
|
||||||
|
@ -115,7 +115,7 @@ static int get_huge_pages(struct drm_i915_gem_object *obj)
|
|||||||
do {
|
do {
|
||||||
struct page *page;
|
struct page *page;
|
||||||
|
|
||||||
GEM_BUG_ON(order > MAX_ORDER);
|
GEM_BUG_ON(order > MAX_PAGE_ORDER);
|
||||||
page = alloc_pages(GFP | __GFP_ZERO, order);
|
page = alloc_pages(GFP | __GFP_ZERO, order);
|
||||||
if (!page)
|
if (!page)
|
||||||
goto err;
|
goto err;
|
||||||
|
@ -175,7 +175,7 @@ static void ttm_device_init_pools(struct kunit *test)
|
|||||||
|
|
||||||
if (params->pools_init_expected) {
|
if (params->pools_init_expected) {
|
||||||
for (int i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
|
for (int i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
|
||||||
for (int j = 0; j <= MAX_ORDER; ++j) {
|
for (int j = 0; j < NR_PAGE_ORDERS; ++j) {
|
||||||
pt = pool->caching[i].orders[j];
|
pt = pool->caching[i].orders[j];
|
||||||
KUNIT_EXPECT_PTR_EQ(test, pt.pool, pool);
|
KUNIT_EXPECT_PTR_EQ(test, pt.pool, pool);
|
||||||
KUNIT_EXPECT_EQ(test, pt.caching, i);
|
KUNIT_EXPECT_EQ(test, pt.caching, i);
|
||||||
|
@ -109,7 +109,7 @@ static const struct ttm_pool_test_case ttm_pool_basic_cases[] = {
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
.description = "Above the allocation limit",
|
.description = "Above the allocation limit",
|
||||||
.order = MAX_ORDER + 1,
|
.order = MAX_PAGE_ORDER + 1,
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
.description = "One page, with coherent DMA mappings enabled",
|
.description = "One page, with coherent DMA mappings enabled",
|
||||||
@ -118,7 +118,7 @@ static const struct ttm_pool_test_case ttm_pool_basic_cases[] = {
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
.description = "Above the allocation limit, with coherent DMA mappings enabled",
|
.description = "Above the allocation limit, with coherent DMA mappings enabled",
|
||||||
.order = MAX_ORDER + 1,
|
.order = MAX_PAGE_ORDER + 1,
|
||||||
.use_dma_alloc = true,
|
.use_dma_alloc = true,
|
||||||
},
|
},
|
||||||
};
|
};
|
||||||
@ -165,7 +165,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
|
|||||||
fst_page = tt->pages[0];
|
fst_page = tt->pages[0];
|
||||||
last_page = tt->pages[tt->num_pages - 1];
|
last_page = tt->pages[tt->num_pages - 1];
|
||||||
|
|
||||||
if (params->order <= MAX_ORDER) {
|
if (params->order <= MAX_PAGE_ORDER) {
|
||||||
if (params->use_dma_alloc) {
|
if (params->use_dma_alloc) {
|
||||||
KUNIT_ASSERT_NOT_NULL(test, (void *)fst_page->private);
|
KUNIT_ASSERT_NOT_NULL(test, (void *)fst_page->private);
|
||||||
KUNIT_ASSERT_NOT_NULL(test, (void *)last_page->private);
|
KUNIT_ASSERT_NOT_NULL(test, (void *)last_page->private);
|
||||||
@ -182,7 +182,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
|
|||||||
* order 0 blocks
|
* order 0 blocks
|
||||||
*/
|
*/
|
||||||
KUNIT_ASSERT_EQ(test, fst_page->private,
|
KUNIT_ASSERT_EQ(test, fst_page->private,
|
||||||
min_t(unsigned int, MAX_ORDER,
|
min_t(unsigned int, MAX_PAGE_ORDER,
|
||||||
params->order));
|
params->order));
|
||||||
KUNIT_ASSERT_EQ(test, last_page->private, 0);
|
KUNIT_ASSERT_EQ(test, last_page->private, 0);
|
||||||
}
|
}
|
||||||
|
@ -65,11 +65,11 @@ module_param(page_pool_size, ulong, 0644);
|
|||||||
|
|
||||||
static atomic_long_t allocated_pages;
|
static atomic_long_t allocated_pages;
|
||||||
|
|
||||||
static struct ttm_pool_type global_write_combined[MAX_ORDER + 1];
|
static struct ttm_pool_type global_write_combined[NR_PAGE_ORDERS];
|
||||||
static struct ttm_pool_type global_uncached[MAX_ORDER + 1];
|
static struct ttm_pool_type global_uncached[NR_PAGE_ORDERS];
|
||||||
|
|
||||||
static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER + 1];
|
static struct ttm_pool_type global_dma32_write_combined[NR_PAGE_ORDERS];
|
||||||
static struct ttm_pool_type global_dma32_uncached[MAX_ORDER + 1];
|
static struct ttm_pool_type global_dma32_uncached[NR_PAGE_ORDERS];
|
||||||
|
|
||||||
static spinlock_t shrinker_lock;
|
static spinlock_t shrinker_lock;
|
||||||
static struct list_head shrinker_list;
|
static struct list_head shrinker_list;
|
||||||
@ -447,7 +447,7 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
|
|||||||
else
|
else
|
||||||
gfp_flags |= GFP_HIGHUSER;
|
gfp_flags |= GFP_HIGHUSER;
|
||||||
|
|
||||||
for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
|
for (order = min_t(unsigned int, MAX_PAGE_ORDER, __fls(num_pages));
|
||||||
num_pages;
|
num_pages;
|
||||||
order = min_t(unsigned int, order, __fls(num_pages))) {
|
order = min_t(unsigned int, order, __fls(num_pages))) {
|
||||||
struct ttm_pool_type *pt;
|
struct ttm_pool_type *pt;
|
||||||
@ -568,7 +568,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
|
|||||||
|
|
||||||
if (use_dma_alloc || nid != NUMA_NO_NODE) {
|
if (use_dma_alloc || nid != NUMA_NO_NODE) {
|
||||||
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
|
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
|
||||||
for (j = 0; j <= MAX_ORDER; ++j)
|
for (j = 0; j < NR_PAGE_ORDERS; ++j)
|
||||||
ttm_pool_type_init(&pool->caching[i].orders[j],
|
ttm_pool_type_init(&pool->caching[i].orders[j],
|
||||||
pool, i, j);
|
pool, i, j);
|
||||||
}
|
}
|
||||||
@ -601,7 +601,7 @@ void ttm_pool_fini(struct ttm_pool *pool)
|
|||||||
|
|
||||||
if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) {
|
if (pool->use_dma_alloc || pool->nid != NUMA_NO_NODE) {
|
||||||
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
|
for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
|
||||||
for (j = 0; j <= MAX_ORDER; ++j)
|
for (j = 0; j < NR_PAGE_ORDERS; ++j)
|
||||||
ttm_pool_type_fini(&pool->caching[i].orders[j]);
|
ttm_pool_type_fini(&pool->caching[i].orders[j]);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -656,7 +656,7 @@ static void ttm_pool_debugfs_header(struct seq_file *m)
|
|||||||
unsigned int i;
|
unsigned int i;
|
||||||
|
|
||||||
seq_puts(m, "\t ");
|
seq_puts(m, "\t ");
|
||||||
for (i = 0; i <= MAX_ORDER; ++i)
|
for (i = 0; i < NR_PAGE_ORDERS; ++i)
|
||||||
seq_printf(m, " ---%2u---", i);
|
seq_printf(m, " ---%2u---", i);
|
||||||
seq_puts(m, "\n");
|
seq_puts(m, "\n");
|
||||||
}
|
}
|
||||||
@ -667,7 +667,7 @@ static void ttm_pool_debugfs_orders(struct ttm_pool_type *pt,
|
|||||||
{
|
{
|
||||||
unsigned int i;
|
unsigned int i;
|
||||||
|
|
||||||
for (i = 0; i <= MAX_ORDER; ++i)
|
for (i = 0; i < NR_PAGE_ORDERS; ++i)
|
||||||
seq_printf(m, " %8u", ttm_pool_type_count(&pt[i]));
|
seq_printf(m, " %8u", ttm_pool_type_count(&pt[i]));
|
||||||
seq_puts(m, "\n");
|
seq_puts(m, "\n");
|
||||||
}
|
}
|
||||||
@ -776,7 +776,7 @@ int ttm_pool_mgr_init(unsigned long num_pages)
|
|||||||
spin_lock_init(&shrinker_lock);
|
spin_lock_init(&shrinker_lock);
|
||||||
INIT_LIST_HEAD(&shrinker_list);
|
INIT_LIST_HEAD(&shrinker_list);
|
||||||
|
|
||||||
for (i = 0; i <= MAX_ORDER; ++i) {
|
for (i = 0; i < NR_PAGE_ORDERS; ++i) {
|
||||||
ttm_pool_type_init(&global_write_combined[i], NULL,
|
ttm_pool_type_init(&global_write_combined[i], NULL,
|
||||||
ttm_write_combined, i);
|
ttm_write_combined, i);
|
||||||
ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i);
|
ttm_pool_type_init(&global_uncached[i], NULL, ttm_uncached, i);
|
||||||
@ -816,7 +816,7 @@ void ttm_pool_mgr_fini(void)
|
|||||||
{
|
{
|
||||||
unsigned int i;
|
unsigned int i;
|
||||||
|
|
||||||
for (i = 0; i <= MAX_ORDER; ++i) {
|
for (i = 0; i < NR_PAGE_ORDERS; ++i) {
|
||||||
ttm_pool_type_fini(&global_write_combined[i]);
|
ttm_pool_type_fini(&global_write_combined[i]);
|
||||||
ttm_pool_type_fini(&global_uncached[i]);
|
ttm_pool_type_fini(&global_uncached[i]);
|
||||||
|
|
||||||
|
@ -188,7 +188,7 @@
|
|||||||
#ifdef CONFIG_CMA_ALIGNMENT
|
#ifdef CONFIG_CMA_ALIGNMENT
|
||||||
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + CONFIG_CMA_ALIGNMENT)
|
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + CONFIG_CMA_ALIGNMENT)
|
||||||
#else
|
#else
|
||||||
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_ORDER)
|
#define Q_MAX_SZ_SHIFT (PAGE_SHIFT + MAX_PAGE_ORDER)
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
/*
|
/*
|
||||||
|
@ -884,7 +884,7 @@ static struct page **__iommu_dma_alloc_pages(struct device *dev,
|
|||||||
struct page **pages;
|
struct page **pages;
|
||||||
unsigned int i = 0, nid = dev_to_node(dev);
|
unsigned int i = 0, nid = dev_to_node(dev);
|
||||||
|
|
||||||
order_mask &= GENMASK(MAX_ORDER, 0);
|
order_mask &= GENMASK(MAX_PAGE_ORDER, 0);
|
||||||
if (!order_mask)
|
if (!order_mask)
|
||||||
return NULL;
|
return NULL;
|
||||||
|
|
||||||
|
@ -2465,8 +2465,8 @@ static bool its_parse_indirect_baser(struct its_node *its,
|
|||||||
* feature is not supported by hardware.
|
* feature is not supported by hardware.
|
||||||
*/
|
*/
|
||||||
new_order = max_t(u32, get_order(esz << ids), new_order);
|
new_order = max_t(u32, get_order(esz << ids), new_order);
|
||||||
if (new_order > MAX_ORDER) {
|
if (new_order > MAX_PAGE_ORDER) {
|
||||||
new_order = MAX_ORDER;
|
new_order = MAX_PAGE_ORDER;
|
||||||
ids = ilog2(PAGE_ORDER_TO_SIZE(new_order) / (int)esz);
|
ids = ilog2(PAGE_ORDER_TO_SIZE(new_order) / (int)esz);
|
||||||
pr_warn("ITS@%pa: %s Table too large, reduce ids %llu->%u\n",
|
pr_warn("ITS@%pa: %s Table too large, reduce ids %llu->%u\n",
|
||||||
&its->phys_base, its_base_type_string[type],
|
&its->phys_base, its_base_type_string[type],
|
||||||
|
@ -1170,7 +1170,7 @@ static void __cache_size_refresh(void)
|
|||||||
* If the allocation may fail we use __get_free_pages. Memory fragmentation
|
* If the allocation may fail we use __get_free_pages. Memory fragmentation
|
||||||
* won't have a fatal effect here, but it just causes flushes of some other
|
* won't have a fatal effect here, but it just causes flushes of some other
|
||||||
* buffers and more I/O will be performed. Don't use __get_free_pages if it
|
* buffers and more I/O will be performed. Don't use __get_free_pages if it
|
||||||
* always fails (i.e. order > MAX_ORDER).
|
* always fails (i.e. order > MAX_PAGE_ORDER).
|
||||||
*
|
*
|
||||||
* If the allocation shouldn't fail we use __vmalloc. This is only for the
|
* If the allocation shouldn't fail we use __vmalloc. This is only for the
|
||||||
* initial reserve allocation, so there's no risk of wasting all vmalloc
|
* initial reserve allocation, so there's no risk of wasting all vmalloc
|
||||||
|
@ -1673,7 +1673,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)
|
|||||||
unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
|
unsigned int nr_iovecs = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
|
||||||
gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
|
gfp_t gfp_mask = GFP_NOWAIT | __GFP_HIGHMEM;
|
||||||
unsigned int remaining_size;
|
unsigned int remaining_size;
|
||||||
unsigned int order = MAX_ORDER;
|
unsigned int order = MAX_PAGE_ORDER;
|
||||||
|
|
||||||
retry:
|
retry:
|
||||||
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
|
if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
|
||||||
|
@ -434,7 +434,7 @@ static struct bio *clone_bio(struct dm_target *ti, struct flakey_c *fc, struct b
|
|||||||
|
|
||||||
remaining_size = size;
|
remaining_size = size;
|
||||||
|
|
||||||
order = MAX_ORDER;
|
order = MAX_PAGE_ORDER;
|
||||||
while (remaining_size) {
|
while (remaining_size) {
|
||||||
struct page *pages;
|
struct page *pages;
|
||||||
unsigned size_to_add, to_copy;
|
unsigned size_to_add, to_copy;
|
||||||
|
@ -443,7 +443,7 @@ static int genwqe_mmap(struct file *filp, struct vm_area_struct *vma)
|
|||||||
if (vsize == 0)
|
if (vsize == 0)
|
||||||
return -EINVAL;
|
return -EINVAL;
|
||||||
|
|
||||||
if (get_order(vsize) > MAX_ORDER)
|
if (get_order(vsize) > MAX_PAGE_ORDER)
|
||||||
return -ENOMEM;
|
return -ENOMEM;
|
||||||
|
|
||||||
dma_map = kzalloc(sizeof(struct dma_mapping), GFP_KERNEL);
|
dma_map = kzalloc(sizeof(struct dma_mapping), GFP_KERNEL);
|
||||||
|
@ -210,7 +210,7 @@ u32 genwqe_crc32(u8 *buff, size_t len, u32 init)
|
|||||||
void *__genwqe_alloc_consistent(struct genwqe_dev *cd, size_t size,
|
void *__genwqe_alloc_consistent(struct genwqe_dev *cd, size_t size,
|
||||||
dma_addr_t *dma_handle)
|
dma_addr_t *dma_handle)
|
||||||
{
|
{
|
||||||
if (get_order(size) > MAX_ORDER)
|
if (get_order(size) > MAX_PAGE_ORDER)
|
||||||
return NULL;
|
return NULL;
|
||||||
|
|
||||||
return dma_alloc_coherent(&cd->pci_dev->dev, size, dma_handle,
|
return dma_alloc_coherent(&cd->pci_dev->dev, size, dma_handle,
|
||||||
@ -308,7 +308,7 @@ int genwqe_alloc_sync_sgl(struct genwqe_dev *cd, struct genwqe_sgl *sgl,
|
|||||||
sgl->write = write;
|
sgl->write = write;
|
||||||
sgl->sgl_size = genwqe_sgl_size(sgl->nr_pages);
|
sgl->sgl_size = genwqe_sgl_size(sgl->nr_pages);
|
||||||
|
|
||||||
if (get_order(sgl->sgl_size) > MAX_ORDER) {
|
if (get_order(sgl->sgl_size) > MAX_PAGE_ORDER) {
|
||||||
dev_err(&pci_dev->dev,
|
dev_err(&pci_dev->dev,
|
||||||
"[%s] err: too much memory requested!\n", __func__);
|
"[%s] err: too much memory requested!\n", __func__);
|
||||||
return ret;
|
return ret;
|
||||||
|
@ -1041,7 +1041,7 @@ static void hns3_init_tx_spare_buffer(struct hns3_enet_ring *ring)
|
|||||||
return;
|
return;
|
||||||
|
|
||||||
order = get_order(alloc_size);
|
order = get_order(alloc_size);
|
||||||
if (order > MAX_ORDER) {
|
if (order > MAX_PAGE_ORDER) {
|
||||||
if (net_ratelimit())
|
if (net_ratelimit())
|
||||||
dev_warn(ring_to_dev(ring), "failed to allocate tx spare buffer, exceed to max order\n");
|
dev_warn(ring_to_dev(ring), "failed to allocate tx spare buffer, exceed to max order\n");
|
||||||
return;
|
return;
|
||||||
|
@ -48,7 +48,7 @@
|
|||||||
* of 4096 jumbo frames (MTU=9000) we will need about 9K*4K = 36MB plus
|
* of 4096 jumbo frames (MTU=9000) we will need about 9K*4K = 36MB plus
|
||||||
* some padding.
|
* some padding.
|
||||||
*
|
*
|
||||||
* But the size of a single DMA region is limited by MAX_ORDER in the
|
* But the size of a single DMA region is limited by MAX_PAGE_ORDER in the
|
||||||
* kernel (about 16MB currently). To support say 4K Jumbo frames, we
|
* kernel (about 16MB currently). To support say 4K Jumbo frames, we
|
||||||
* use a set of LTBs (struct ltb_set) per pool.
|
* use a set of LTBs (struct ltb_set) per pool.
|
||||||
*
|
*
|
||||||
@ -75,7 +75,7 @@
|
|||||||
* pool for the 4MB. Thus the 16 Rx and Tx queues require 32 * 5 = 160
|
* pool for the 4MB. Thus the 16 Rx and Tx queues require 32 * 5 = 160
|
||||||
* plus 16 for the TSO pools for a total of 176 LTB mappings per VNIC.
|
* plus 16 for the TSO pools for a total of 176 LTB mappings per VNIC.
|
||||||
*/
|
*/
|
||||||
#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_ORDER) * PAGE_SIZE))
|
#define IBMVNIC_ONE_LTB_MAX ((u32)((1 << MAX_PAGE_ORDER) * PAGE_SIZE))
|
||||||
#define IBMVNIC_ONE_LTB_SIZE min((u32)(8 << 20), IBMVNIC_ONE_LTB_MAX)
|
#define IBMVNIC_ONE_LTB_SIZE min((u32)(8 << 20), IBMVNIC_ONE_LTB_MAX)
|
||||||
#define IBMVNIC_LTB_SET_SIZE (38 << 20)
|
#define IBMVNIC_LTB_SET_SIZE (38 << 20)
|
||||||
|
|
||||||
|
@ -927,8 +927,8 @@ static phys_addr_t hvfb_get_phymem(struct hv_device *hdev,
|
|||||||
if (request_size == 0)
|
if (request_size == 0)
|
||||||
return -1;
|
return -1;
|
||||||
|
|
||||||
if (order <= MAX_ORDER) {
|
if (order <= MAX_PAGE_ORDER) {
|
||||||
/* Call alloc_pages if the size is less than 2^MAX_ORDER */
|
/* Call alloc_pages if the size is less than 2^MAX_PAGE_ORDER */
|
||||||
page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
|
page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
|
||||||
if (!page)
|
if (!page)
|
||||||
return -1;
|
return -1;
|
||||||
@ -958,7 +958,7 @@ static void hvfb_release_phymem(struct hv_device *hdev,
|
|||||||
{
|
{
|
||||||
unsigned int order = get_order(size);
|
unsigned int order = get_order(size);
|
||||||
|
|
||||||
if (order <= MAX_ORDER)
|
if (order <= MAX_PAGE_ORDER)
|
||||||
__free_pages(pfn_to_page(paddr >> PAGE_SHIFT), order);
|
__free_pages(pfn_to_page(paddr >> PAGE_SHIFT), order);
|
||||||
else
|
else
|
||||||
dma_free_coherent(&hdev->device,
|
dma_free_coherent(&hdev->device,
|
||||||
|
@ -197,7 +197,7 @@ static int vmlfb_alloc_vram(struct vml_info *vinfo,
|
|||||||
va = &vinfo->vram[i];
|
va = &vinfo->vram[i];
|
||||||
order = 0;
|
order = 0;
|
||||||
|
|
||||||
while (requested > (PAGE_SIZE << order) && order <= MAX_ORDER)
|
while (requested > (PAGE_SIZE << order) && order <= MAX_PAGE_ORDER)
|
||||||
order++;
|
order++;
|
||||||
|
|
||||||
err = vmlfb_alloc_vram_area(va, order, 0);
|
err = vmlfb_alloc_vram_area(va, order, 0);
|
||||||
|
@ -33,7 +33,7 @@
|
|||||||
#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
|
#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
|
||||||
__GFP_NOMEMALLOC)
|
__GFP_NOMEMALLOC)
|
||||||
/* The order of free page blocks to report to host */
|
/* The order of free page blocks to report to host */
|
||||||
#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_ORDER
|
#define VIRTIO_BALLOON_HINT_BLOCK_ORDER MAX_PAGE_ORDER
|
||||||
/* The size of a free page block in bytes */
|
/* The size of a free page block in bytes */
|
||||||
#define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
|
#define VIRTIO_BALLOON_HINT_BLOCK_BYTES \
|
||||||
(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))
|
(1 << (VIRTIO_BALLOON_HINT_BLOCK_ORDER + PAGE_SHIFT))
|
||||||
|
@ -1154,13 +1154,13 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
|
|||||||
*/
|
*/
|
||||||
static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
|
static void virtio_mem_fake_online(unsigned long pfn, unsigned long nr_pages)
|
||||||
{
|
{
|
||||||
unsigned long order = MAX_ORDER;
|
unsigned long order = MAX_PAGE_ORDER;
|
||||||
unsigned long i;
|
unsigned long i;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* We might get called for ranges that don't cover properly aligned
|
* We might get called for ranges that don't cover properly aligned
|
||||||
* MAX_ORDER pages; however, we can only online properly aligned
|
* MAX_PAGE_ORDER pages; however, we can only online properly aligned
|
||||||
* pages with an order of MAX_ORDER at maximum.
|
* pages with an order of MAX_PAGE_ORDER at maximum.
|
||||||
*/
|
*/
|
||||||
while (!IS_ALIGNED(pfn | nr_pages, 1 << order))
|
while (!IS_ALIGNED(pfn | nr_pages, 1 << order))
|
||||||
order--;
|
order--;
|
||||||
@ -1280,7 +1280,7 @@ static void virtio_mem_online_page(struct virtio_mem *vm,
|
|||||||
bool do_online;
|
bool do_online;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* We can get called with any order up to MAX_ORDER. If our subblock
|
* We can get called with any order up to MAX_PAGE_ORDER. If our subblock
|
||||||
* size is smaller than that and we have a mixture of plugged and
|
* size is smaller than that and we have a mixture of plugged and
|
||||||
* unplugged subblocks within such a page, we have to process in
|
* unplugged subblocks within such a page, we have to process in
|
||||||
* smaller granularity. In that case we'll adjust the order exactly once
|
* smaller granularity. In that case we'll adjust the order exactly once
|
||||||
|
22
fs/Kconfig
22
fs/Kconfig
@ -258,7 +258,7 @@ config TMPFS_QUOTA
|
|||||||
config ARCH_SUPPORTS_HUGETLBFS
|
config ARCH_SUPPORTS_HUGETLBFS
|
||||||
def_bool n
|
def_bool n
|
||||||
|
|
||||||
config HUGETLBFS
|
menuconfig HUGETLBFS
|
||||||
bool "HugeTLB file system support"
|
bool "HugeTLB file system support"
|
||||||
depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
|
depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
|
||||||
depends on (SYSFS || SYSCTL)
|
depends on (SYSFS || SYSCTL)
|
||||||
@ -270,6 +270,17 @@ config HUGETLBFS
|
|||||||
|
|
||||||
If unsure, say N.
|
If unsure, say N.
|
||||||
|
|
||||||
|
if HUGETLBFS
|
||||||
|
config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
|
||||||
|
bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
|
||||||
|
default n
|
||||||
|
depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
||||||
|
help
|
||||||
|
The HugeTLB Vmemmap Optimization (HVO) defaults to off. Say Y here to
|
||||||
|
enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
|
||||||
|
(boot command line) or hugetlb_optimize_vmemmap (sysctl).
|
||||||
|
endif # HUGETLBFS
|
||||||
|
|
||||||
config HUGETLB_PAGE
|
config HUGETLB_PAGE
|
||||||
def_bool HUGETLBFS
|
def_bool HUGETLBFS
|
||||||
select XARRAY_MULTI
|
select XARRAY_MULTI
|
||||||
@ -279,15 +290,6 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
|||||||
depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
|
depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
|
||||||
depends on SPARSEMEM_VMEMMAP
|
depends on SPARSEMEM_VMEMMAP
|
||||||
|
|
||||||
config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
|
|
||||||
bool "HugeTLB Vmemmap Optimization (HVO) defaults to on"
|
|
||||||
default n
|
|
||||||
depends on HUGETLB_PAGE_OPTIMIZE_VMEMMAP
|
|
||||||
help
|
|
||||||
The HugeTLB VmemmapvOptimization (HVO) defaults to off. Say Y here to
|
|
||||||
enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
|
|
||||||
(boot command line) or hugetlb_optimize_vmemmap (sysctl).
|
|
||||||
|
|
||||||
config ARCH_HAS_GIGANTIC_PAGE
|
config ARCH_HAS_GIGANTIC_PAGE
|
||||||
bool
|
bool
|
||||||
|
|
||||||
|
@ -5,6 +5,7 @@
|
|||||||
* Copyright (C) 1997-1999 Russell King
|
* Copyright (C) 1997-1999 Russell King
|
||||||
*/
|
*/
|
||||||
#include <linux/buffer_head.h>
|
#include <linux/buffer_head.h>
|
||||||
|
#include <linux/mpage.h>
|
||||||
#include <linux/writeback.h>
|
#include <linux/writeback.h>
|
||||||
#include "adfs.h"
|
#include "adfs.h"
|
||||||
|
|
||||||
@ -33,9 +34,10 @@ abort_toobig:
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
static int adfs_writepage(struct page *page, struct writeback_control *wbc)
|
static int adfs_writepages(struct address_space *mapping,
|
||||||
|
struct writeback_control *wbc)
|
||||||
{
|
{
|
||||||
return block_write_full_page(page, adfs_get_block, wbc);
|
return mpage_writepages(mapping, wbc, adfs_get_block);
|
||||||
}
|
}
|
||||||
|
|
||||||
static int adfs_read_folio(struct file *file, struct folio *folio)
|
static int adfs_read_folio(struct file *file, struct folio *folio)
|
||||||
@ -76,10 +78,11 @@ static const struct address_space_operations adfs_aops = {
|
|||||||
.dirty_folio = block_dirty_folio,
|
.dirty_folio = block_dirty_folio,
|
||||||
.invalidate_folio = block_invalidate_folio,
|
.invalidate_folio = block_invalidate_folio,
|
||||||
.read_folio = adfs_read_folio,
|
.read_folio = adfs_read_folio,
|
||||||
.writepage = adfs_writepage,
|
.writepages = adfs_writepages,
|
||||||
.write_begin = adfs_write_begin,
|
.write_begin = adfs_write_begin,
|
||||||
.write_end = generic_write_end,
|
.write_end = generic_write_end,
|
||||||
.bmap = _adfs_bmap
|
.migrate_folio = buffer_migrate_folio,
|
||||||
|
.bmap = _adfs_bmap,
|
||||||
};
|
};
|
||||||
|
|
||||||
/*
|
/*
|
||||||
|
@ -242,7 +242,7 @@ static void afs_kill_pages(struct address_space *mapping,
|
|||||||
folio_clear_uptodate(folio);
|
folio_clear_uptodate(folio);
|
||||||
folio_end_writeback(folio);
|
folio_end_writeback(folio);
|
||||||
folio_lock(folio);
|
folio_lock(folio);
|
||||||
generic_error_remove_page(mapping, &folio->page);
|
generic_error_remove_folio(mapping, folio);
|
||||||
folio_unlock(folio);
|
folio_unlock(folio);
|
||||||
folio_put(folio);
|
folio_put(folio);
|
||||||
|
|
||||||
@ -559,8 +559,7 @@ static void afs_extend_writeback(struct address_space *mapping,
|
|||||||
|
|
||||||
if (!folio_clear_dirty_for_io(folio))
|
if (!folio_clear_dirty_for_io(folio))
|
||||||
BUG();
|
BUG();
|
||||||
if (folio_start_writeback(folio))
|
folio_start_writeback(folio);
|
||||||
BUG();
|
|
||||||
afs_folio_start_fscache(caching, folio);
|
afs_folio_start_fscache(caching, folio);
|
||||||
|
|
||||||
*_count -= folio_nr_pages(folio);
|
*_count -= folio_nr_pages(folio);
|
||||||
@ -595,8 +594,7 @@ static ssize_t afs_write_back_from_locked_folio(struct address_space *mapping,
|
|||||||
|
|
||||||
_enter(",%lx,%llx-%llx", folio_index(folio), start, end);
|
_enter(",%lx,%llx-%llx", folio_index(folio), start, end);
|
||||||
|
|
||||||
if (folio_start_writeback(folio))
|
folio_start_writeback(folio);
|
||||||
BUG();
|
|
||||||
afs_folio_start_fscache(caching, folio);
|
afs_folio_start_fscache(caching, folio);
|
||||||
|
|
||||||
count -= folio_nr_pages(folio);
|
count -= folio_nr_pages(folio);
|
||||||
|
@ -1131,7 +1131,7 @@ static const struct address_space_operations bch_address_space_operations = {
|
|||||||
#ifdef CONFIG_MIGRATION
|
#ifdef CONFIG_MIGRATION
|
||||||
.migrate_folio = filemap_migrate_folio,
|
.migrate_folio = filemap_migrate_folio,
|
||||||
#endif
|
#endif
|
||||||
.error_remove_page = generic_error_remove_page,
|
.error_remove_folio = generic_error_remove_folio,
|
||||||
};
|
};
|
||||||
|
|
||||||
struct bcachefs_fid {
|
struct bcachefs_fid {
|
||||||
|
@ -11,6 +11,7 @@
|
|||||||
*/
|
*/
|
||||||
|
|
||||||
#include <linux/fs.h>
|
#include <linux/fs.h>
|
||||||
|
#include <linux/mpage.h>
|
||||||
#include <linux/buffer_head.h>
|
#include <linux/buffer_head.h>
|
||||||
#include "bfs.h"
|
#include "bfs.h"
|
||||||
|
|
||||||
@ -150,9 +151,10 @@ out:
|
|||||||
return err;
|
return err;
|
||||||
}
|
}
|
||||||
|
|
||||||
static int bfs_writepage(struct page *page, struct writeback_control *wbc)
|
static int bfs_writepages(struct address_space *mapping,
|
||||||
|
struct writeback_control *wbc)
|
||||||
{
|
{
|
||||||
return block_write_full_page(page, bfs_get_block, wbc);
|
return mpage_writepages(mapping, wbc, bfs_get_block);
|
||||||
}
|
}
|
||||||
|
|
||||||
static int bfs_read_folio(struct file *file, struct folio *folio)
|
static int bfs_read_folio(struct file *file, struct folio *folio)
|
||||||
@ -190,9 +192,10 @@ const struct address_space_operations bfs_aops = {
|
|||||||
.dirty_folio = block_dirty_folio,
|
.dirty_folio = block_dirty_folio,
|
||||||
.invalidate_folio = block_invalidate_folio,
|
.invalidate_folio = block_invalidate_folio,
|
||||||
.read_folio = bfs_read_folio,
|
.read_folio = bfs_read_folio,
|
||||||
.writepage = bfs_writepage,
|
.writepages = bfs_writepages,
|
||||||
.write_begin = bfs_write_begin,
|
.write_begin = bfs_write_begin,
|
||||||
.write_end = generic_write_end,
|
.write_end = generic_write_end,
|
||||||
|
.migrate_folio = buffer_migrate_folio,
|
||||||
.bmap = bfs_bmap,
|
.bmap = bfs_bmap,
|
||||||
};
|
};
|
||||||
|
|
||||||
|
@ -10930,7 +10930,7 @@ static const struct address_space_operations btrfs_aops = {
|
|||||||
.release_folio = btrfs_release_folio,
|
.release_folio = btrfs_release_folio,
|
||||||
.migrate_folio = btrfs_migrate_folio,
|
.migrate_folio = btrfs_migrate_folio,
|
||||||
.dirty_folio = filemap_dirty_folio,
|
.dirty_folio = filemap_dirty_folio,
|
||||||
.error_remove_page = generic_error_remove_page,
|
.error_remove_folio = generic_error_remove_folio,
|
||||||
.swap_activate = btrfs_swap_activate,
|
.swap_activate = btrfs_swap_activate,
|
||||||
.swap_deactivate = btrfs_swap_deactivate,
|
.swap_deactivate = btrfs_swap_deactivate,
|
||||||
};
|
};
|
||||||
|
175
fs/buffer.c
175
fs/buffer.c
@ -199,7 +199,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
|
|||||||
int all_mapped = 1;
|
int all_mapped = 1;
|
||||||
static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
|
static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
|
||||||
|
|
||||||
index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
|
index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
|
||||||
folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);
|
folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);
|
||||||
if (IS_ERR(folio))
|
if (IS_ERR(folio))
|
||||||
goto out;
|
goto out;
|
||||||
@ -372,10 +372,10 @@ static void end_buffer_async_read_io(struct buffer_head *bh, int uptodate)
|
|||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Completion handler for block_write_full_page() - pages which are unlocked
|
* Completion handler for block_write_full_folio() - folios which are unlocked
|
||||||
* during I/O, and which have PageWriteback cleared upon I/O completion.
|
* during I/O, and which have the writeback flag cleared upon I/O completion.
|
||||||
*/
|
*/
|
||||||
void end_buffer_async_write(struct buffer_head *bh, int uptodate)
|
static void end_buffer_async_write(struct buffer_head *bh, int uptodate)
|
||||||
{
|
{
|
||||||
unsigned long flags;
|
unsigned long flags;
|
||||||
struct buffer_head *first;
|
struct buffer_head *first;
|
||||||
@ -415,7 +415,6 @@ still_busy:
|
|||||||
spin_unlock_irqrestore(&first->b_uptodate_lock, flags);
|
spin_unlock_irqrestore(&first->b_uptodate_lock, flags);
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
EXPORT_SYMBOL(end_buffer_async_write);
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* If a page's buffers are under async readin (end_buffer_async_read
|
* If a page's buffers are under async readin (end_buffer_async_read
|
||||||
@ -995,11 +994,12 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
|
|||||||
* Initialise the state of a blockdev folio's buffers.
|
* Initialise the state of a blockdev folio's buffers.
|
||||||
*/
|
*/
|
||||||
static sector_t folio_init_buffers(struct folio *folio,
|
static sector_t folio_init_buffers(struct folio *folio,
|
||||||
struct block_device *bdev, sector_t block, int size)
|
struct block_device *bdev, unsigned size)
|
||||||
{
|
{
|
||||||
struct buffer_head *head = folio_buffers(folio);
|
struct buffer_head *head = folio_buffers(folio);
|
||||||
struct buffer_head *bh = head;
|
struct buffer_head *bh = head;
|
||||||
bool uptodate = folio_test_uptodate(folio);
|
bool uptodate = folio_test_uptodate(folio);
|
||||||
|
sector_t block = div_u64(folio_pos(folio), size);
|
||||||
sector_t end_block = blkdev_max_block(bdev, size);
|
sector_t end_block = blkdev_max_block(bdev, size);
|
||||||
|
|
||||||
do {
|
do {
|
||||||
@ -1024,40 +1024,49 @@ static sector_t folio_init_buffers(struct folio *folio,
|
|||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Create the page-cache page that contains the requested block.
|
* Create the page-cache folio that contains the requested block.
|
||||||
*
|
*
|
||||||
* This is used purely for blockdev mappings.
|
* This is used purely for blockdev mappings.
|
||||||
|
*
|
||||||
|
* Returns false if we have a failure which cannot be cured by retrying
|
||||||
|
* without sleeping. Returns true if we succeeded, or the caller should retry.
|
||||||
*/
|
*/
|
||||||
static int
|
static bool grow_dev_folio(struct block_device *bdev, sector_t block,
|
||||||
grow_dev_page(struct block_device *bdev, sector_t block,
|
pgoff_t index, unsigned size, gfp_t gfp)
|
||||||
pgoff_t index, int size, int sizebits, gfp_t gfp)
|
|
||||||
{
|
{
|
||||||
struct inode *inode = bdev->bd_inode;
|
struct inode *inode = bdev->bd_inode;
|
||||||
struct folio *folio;
|
struct folio *folio;
|
||||||
struct buffer_head *bh;
|
struct buffer_head *bh;
|
||||||
sector_t end_block;
|
sector_t end_block = 0;
|
||||||
int ret = 0;
|
|
||||||
|
|
||||||
folio = __filemap_get_folio(inode->i_mapping, index,
|
folio = __filemap_get_folio(inode->i_mapping, index,
|
||||||
FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
|
FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp);
|
||||||
if (IS_ERR(folio))
|
if (IS_ERR(folio))
|
||||||
return PTR_ERR(folio);
|
return false;
|
||||||
|
|
||||||
bh = folio_buffers(folio);
|
bh = folio_buffers(folio);
|
||||||
if (bh) {
|
if (bh) {
|
||||||
if (bh->b_size == size) {
|
if (bh->b_size == size) {
|
||||||
end_block = folio_init_buffers(folio, bdev,
|
end_block = folio_init_buffers(folio, bdev, size);
|
||||||
(sector_t)index << sizebits, size);
|
goto unlock;
|
||||||
goto done;
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Retrying may succeed; for example the folio may finish
|
||||||
|
* writeback, or buffers may be cleaned. This should not
|
||||||
|
* happen very often; maybe we have old buffers attached to
|
||||||
|
* this blockdev's page cache and we're trying to change
|
||||||
|
* the block size?
|
||||||
|
*/
|
||||||
|
if (!try_to_free_buffers(folio)) {
|
||||||
|
end_block = ~0ULL;
|
||||||
|
goto unlock;
|
||||||
}
|
}
|
||||||
if (!try_to_free_buffers(folio))
|
|
||||||
goto failed;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
ret = -ENOMEM;
|
|
||||||
bh = folio_alloc_buffers(folio, size, gfp | __GFP_ACCOUNT);
|
bh = folio_alloc_buffers(folio, size, gfp | __GFP_ACCOUNT);
|
||||||
if (!bh)
|
if (!bh)
|
||||||
goto failed;
|
goto unlock;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Link the folio to the buffers and initialise them. Take the
|
* Link the folio to the buffers and initialise them. Take the
|
||||||
@ -1066,44 +1075,37 @@ grow_dev_page(struct block_device *bdev, sector_t block,
|
|||||||
*/
|
*/
|
||||||
spin_lock(&inode->i_mapping->i_private_lock);
|
spin_lock(&inode->i_mapping->i_private_lock);
|
||||||
link_dev_buffers(folio, bh);
|
link_dev_buffers(folio, bh);
|
||||||
end_block = folio_init_buffers(folio, bdev,
|
end_block = folio_init_buffers(folio, bdev, size);
|
||||||
(sector_t)index << sizebits, size);
|
|
||||||
spin_unlock(&inode->i_mapping->i_private_lock);
|
spin_unlock(&inode->i_mapping->i_private_lock);
|
||||||
done:
|
unlock:
|
||||||
ret = (block < end_block) ? 1 : -ENXIO;
|
|
||||||
failed:
|
|
||||||
folio_unlock(folio);
|
folio_unlock(folio);
|
||||||
folio_put(folio);
|
folio_put(folio);
|
||||||
return ret;
|
return block < end_block;
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Create buffers for the specified block device block's page. If
|
* Create buffers for the specified block device block's folio. If
|
||||||
* that page was dirty, the buffers are set dirty also.
|
* that folio was dirty, the buffers are set dirty also. Returns false
|
||||||
|
* if we've hit a permanent error.
|
||||||
*/
|
*/
|
||||||
static int
|
static bool grow_buffers(struct block_device *bdev, sector_t block,
|
||||||
grow_buffers(struct block_device *bdev, sector_t block, int size, gfp_t gfp)
|
unsigned size, gfp_t gfp)
|
||||||
{
|
{
|
||||||
pgoff_t index;
|
loff_t pos;
|
||||||
int sizebits;
|
|
||||||
|
|
||||||
sizebits = PAGE_SHIFT - __ffs(size);
|
|
||||||
index = block >> sizebits;
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Check for a block which wants to lie outside our maximum possible
|
* Check for a block which lies outside our maximum possible
|
||||||
* pagecache index. (this comparison is done using sector_t types).
|
* pagecache index.
|
||||||
*/
|
*/
|
||||||
if (unlikely(index != block >> sizebits)) {
|
if (check_mul_overflow(block, (sector_t)size, &pos) || pos > MAX_LFS_FILESIZE) {
|
||||||
printk(KERN_ERR "%s: requested out-of-range block %llu for "
|
printk(KERN_ERR "%s: requested out-of-range block %llu for device %pg\n",
|
||||||
"device %pg\n",
|
|
||||||
__func__, (unsigned long long)block,
|
__func__, (unsigned long long)block,
|
||||||
bdev);
|
bdev);
|
||||||
return -EIO;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Create a page with the proper size buffers.. */
|
/* Create a folio with the proper size buffers */
|
||||||
return grow_dev_page(bdev, block, index, size, sizebits, gfp);
|
return grow_dev_folio(bdev, block, pos / PAGE_SIZE, size, gfp);
|
||||||
}
|
}
|
||||||
|
|
||||||
static struct buffer_head *
|
static struct buffer_head *
|
||||||
@ -1124,14 +1126,12 @@ __getblk_slow(struct block_device *bdev, sector_t block,
|
|||||||
|
|
||||||
for (;;) {
|
for (;;) {
|
||||||
struct buffer_head *bh;
|
struct buffer_head *bh;
|
||||||
int ret;
|
|
||||||
|
|
||||||
bh = __find_get_block(bdev, block, size);
|
bh = __find_get_block(bdev, block, size);
|
||||||
if (bh)
|
if (bh)
|
||||||
return bh;
|
return bh;
|
||||||
|
|
||||||
ret = grow_buffers(bdev, block, size, gfp);
|
if (!grow_buffers(bdev, block, size, gfp))
|
||||||
if (ret < 0)
|
|
||||||
return NULL;
|
return NULL;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -1699,13 +1699,13 @@ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len)
|
|||||||
struct inode *bd_inode = bdev->bd_inode;
|
struct inode *bd_inode = bdev->bd_inode;
|
||||||
struct address_space *bd_mapping = bd_inode->i_mapping;
|
struct address_space *bd_mapping = bd_inode->i_mapping;
|
||||||
struct folio_batch fbatch;
|
struct folio_batch fbatch;
|
||||||
pgoff_t index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
|
pgoff_t index = ((loff_t)block << bd_inode->i_blkbits) / PAGE_SIZE;
|
||||||
pgoff_t end;
|
pgoff_t end;
|
||||||
int i, count;
|
int i, count;
|
||||||
struct buffer_head *bh;
|
struct buffer_head *bh;
|
||||||
struct buffer_head *head;
|
struct buffer_head *head;
|
||||||
|
|
||||||
end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits);
|
end = ((loff_t)(block + len - 1) << bd_inode->i_blkbits) / PAGE_SIZE;
|
||||||
folio_batch_init(&fbatch);
|
folio_batch_init(&fbatch);
|
||||||
while (filemap_get_folios(bd_mapping, &index, end, &fbatch)) {
|
while (filemap_get_folios(bd_mapping, &index, end, &fbatch)) {
|
||||||
count = folio_batch_count(&fbatch);
|
count = folio_batch_count(&fbatch);
|
||||||
@ -1748,19 +1748,6 @@ unlock_page:
|
|||||||
}
|
}
|
||||||
EXPORT_SYMBOL(clean_bdev_aliases);
|
EXPORT_SYMBOL(clean_bdev_aliases);
|
||||||
|
|
||||||
/*
|
|
||||||
* Size is a power-of-two in the range 512..PAGE_SIZE,
|
|
||||||
* and the case we care about most is PAGE_SIZE.
|
|
||||||
*
|
|
||||||
* So this *could* possibly be written with those
|
|
||||||
* constraints in mind (relevant mostly if some
|
|
||||||
* architecture has a slow bit-scan instruction)
|
|
||||||
*/
|
|
||||||
static inline int block_size_bits(unsigned int blocksize)
|
|
||||||
{
|
|
||||||
return ilog2(blocksize);
|
|
||||||
}
|
|
||||||
|
|
||||||
static struct buffer_head *folio_create_buffers(struct folio *folio,
|
static struct buffer_head *folio_create_buffers(struct folio *folio,
|
||||||
struct inode *inode,
|
struct inode *inode,
|
||||||
unsigned int b_state)
|
unsigned int b_state)
|
||||||
@ -1790,30 +1777,29 @@ static struct buffer_head *folio_create_buffers(struct folio *folio,
|
|||||||
*/
|
*/
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* While block_write_full_page is writing back the dirty buffers under
|
* While block_write_full_folio is writing back the dirty buffers under
|
||||||
* the page lock, whoever dirtied the buffers may decide to clean them
|
* the page lock, whoever dirtied the buffers may decide to clean them
|
||||||
* again at any time. We handle that by only looking at the buffer
|
* again at any time. We handle that by only looking at the buffer
|
||||||
* state inside lock_buffer().
|
* state inside lock_buffer().
|
||||||
*
|
*
|
||||||
* If block_write_full_page() is called for regular writeback
|
* If block_write_full_folio() is called for regular writeback
|
||||||
* (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
|
* (wbc->sync_mode == WB_SYNC_NONE) then it will redirty a page which has a
|
||||||
* locked buffer. This only can happen if someone has written the buffer
|
* locked buffer. This only can happen if someone has written the buffer
|
||||||
* directly, with submit_bh(). At the address_space level PageWriteback
|
* directly, with submit_bh(). At the address_space level PageWriteback
|
||||||
* prevents this contention from occurring.
|
* prevents this contention from occurring.
|
||||||
*
|
*
|
||||||
* If block_write_full_page() is called with wbc->sync_mode ==
|
* If block_write_full_folio() is called with wbc->sync_mode ==
|
||||||
* WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
|
* WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
|
||||||
* causes the writes to be flagged as synchronous writes.
|
* causes the writes to be flagged as synchronous writes.
|
||||||
*/
|
*/
|
||||||
int __block_write_full_folio(struct inode *inode, struct folio *folio,
|
int __block_write_full_folio(struct inode *inode, struct folio *folio,
|
||||||
get_block_t *get_block, struct writeback_control *wbc,
|
get_block_t *get_block, struct writeback_control *wbc)
|
||||||
bh_end_io_t *handler)
|
|
||||||
{
|
{
|
||||||
int err;
|
int err;
|
||||||
sector_t block;
|
sector_t block;
|
||||||
sector_t last_block;
|
sector_t last_block;
|
||||||
struct buffer_head *bh, *head;
|
struct buffer_head *bh, *head;
|
||||||
unsigned int blocksize, bbits;
|
size_t blocksize;
|
||||||
int nr_underway = 0;
|
int nr_underway = 0;
|
||||||
blk_opf_t write_flags = wbc_to_write_flags(wbc);
|
blk_opf_t write_flags = wbc_to_write_flags(wbc);
|
||||||
|
|
||||||
@ -1832,10 +1818,9 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
|
|||||||
|
|
||||||
bh = head;
|
bh = head;
|
||||||
blocksize = bh->b_size;
|
blocksize = bh->b_size;
|
||||||
bbits = block_size_bits(blocksize);
|
|
||||||
|
|
||||||
block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
|
block = div_u64(folio_pos(folio), blocksize);
|
||||||
last_block = (i_size_read(inode) - 1) >> bbits;
|
last_block = div_u64(i_size_read(inode) - 1, blocksize);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Get all the dirty buffers mapped to disk addresses and
|
* Get all the dirty buffers mapped to disk addresses and
|
||||||
@ -1849,7 +1834,7 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
|
|||||||
* truncate in progress.
|
* truncate in progress.
|
||||||
*/
|
*/
|
||||||
/*
|
/*
|
||||||
* The buffer was zeroed by block_write_full_page()
|
* The buffer was zeroed by block_write_full_folio()
|
||||||
*/
|
*/
|
||||||
clear_buffer_dirty(bh);
|
clear_buffer_dirty(bh);
|
||||||
set_buffer_uptodate(bh);
|
set_buffer_uptodate(bh);
|
||||||
@ -1887,7 +1872,8 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
if (test_clear_buffer_dirty(bh)) {
|
if (test_clear_buffer_dirty(bh)) {
|
||||||
mark_buffer_async_write_endio(bh, handler);
|
mark_buffer_async_write_endio(bh,
|
||||||
|
end_buffer_async_write);
|
||||||
} else {
|
} else {
|
||||||
unlock_buffer(bh);
|
unlock_buffer(bh);
|
||||||
}
|
}
|
||||||
@ -1940,7 +1926,8 @@ recover:
|
|||||||
if (buffer_mapped(bh) && buffer_dirty(bh) &&
|
if (buffer_mapped(bh) && buffer_dirty(bh) &&
|
||||||
!buffer_delay(bh)) {
|
!buffer_delay(bh)) {
|
||||||
lock_buffer(bh);
|
lock_buffer(bh);
|
||||||
mark_buffer_async_write_endio(bh, handler);
|
mark_buffer_async_write_endio(bh,
|
||||||
|
end_buffer_async_write);
|
||||||
} else {
|
} else {
|
||||||
/*
|
/*
|
||||||
* The buffer may have been set dirty during
|
* The buffer may have been set dirty during
|
||||||
@ -2014,7 +2001,7 @@ static int
|
|||||||
iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
|
iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
|
||||||
const struct iomap *iomap)
|
const struct iomap *iomap)
|
||||||
{
|
{
|
||||||
loff_t offset = block << inode->i_blkbits;
|
loff_t offset = (loff_t)block << inode->i_blkbits;
|
||||||
|
|
||||||
bh->b_bdev = iomap->bdev;
|
bh->b_bdev = iomap->bdev;
|
||||||
|
|
||||||
@ -2081,27 +2068,24 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
|
|||||||
int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
|
int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
|
||||||
get_block_t *get_block, const struct iomap *iomap)
|
get_block_t *get_block, const struct iomap *iomap)
|
||||||
{
|
{
|
||||||
unsigned from = pos & (PAGE_SIZE - 1);
|
size_t from = offset_in_folio(folio, pos);
|
||||||
unsigned to = from + len;
|
size_t to = from + len;
|
||||||
struct inode *inode = folio->mapping->host;
|
struct inode *inode = folio->mapping->host;
|
||||||
unsigned block_start, block_end;
|
size_t block_start, block_end;
|
||||||
sector_t block;
|
sector_t block;
|
||||||
int err = 0;
|
int err = 0;
|
||||||
unsigned blocksize, bbits;
|
size_t blocksize;
|
||||||
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
|
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
|
||||||
|
|
||||||
BUG_ON(!folio_test_locked(folio));
|
BUG_ON(!folio_test_locked(folio));
|
||||||
BUG_ON(from > PAGE_SIZE);
|
BUG_ON(to > folio_size(folio));
|
||||||
BUG_ON(to > PAGE_SIZE);
|
|
||||||
BUG_ON(from > to);
|
BUG_ON(from > to);
|
||||||
|
|
||||||
head = folio_create_buffers(folio, inode, 0);
|
head = folio_create_buffers(folio, inode, 0);
|
||||||
blocksize = head->b_size;
|
blocksize = head->b_size;
|
||||||
bbits = block_size_bits(blocksize);
|
block = div_u64(folio_pos(folio), blocksize);
|
||||||
|
|
||||||
block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
|
for (bh = head, block_start = 0; bh != head || !block_start;
|
||||||
|
|
||||||
for(bh = head, block_start = 0; bh != head || !block_start;
|
|
||||||
block++, block_start=block_end, bh = bh->b_this_page) {
|
block++, block_start=block_end, bh = bh->b_this_page) {
|
||||||
block_end = block_start + blocksize;
|
block_end = block_start + blocksize;
|
||||||
if (block_end <= from || block_start >= to) {
|
if (block_end <= from || block_start >= to) {
|
||||||
@ -2364,7 +2348,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
|
|||||||
struct inode *inode = folio->mapping->host;
|
struct inode *inode = folio->mapping->host;
|
||||||
sector_t iblock, lblock;
|
sector_t iblock, lblock;
|
||||||
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
|
struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
|
||||||
unsigned int blocksize, bbits;
|
size_t blocksize;
|
||||||
int nr, i;
|
int nr, i;
|
||||||
int fully_mapped = 1;
|
int fully_mapped = 1;
|
||||||
bool page_error = false;
|
bool page_error = false;
|
||||||
@ -2378,10 +2362,9 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
|
|||||||
|
|
||||||
head = folio_create_buffers(folio, inode, 0);
|
head = folio_create_buffers(folio, inode, 0);
|
||||||
blocksize = head->b_size;
|
blocksize = head->b_size;
|
||||||
bbits = block_size_bits(blocksize);
|
|
||||||
|
|
||||||
iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
|
iblock = div_u64(folio_pos(folio), blocksize);
|
||||||
lblock = (limit+blocksize-1) >> bbits;
|
lblock = div_u64(limit + blocksize - 1, blocksize);
|
||||||
bh = head;
|
bh = head;
|
||||||
nr = 0;
|
nr = 0;
|
||||||
i = 0;
|
i = 0;
|
||||||
@ -2666,8 +2649,8 @@ int block_truncate_page(struct address_space *mapping,
|
|||||||
return 0;
|
return 0;
|
||||||
|
|
||||||
length = blocksize - length;
|
length = blocksize - length;
|
||||||
iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits);
|
iblock = ((loff_t)index * PAGE_SIZE) >> inode->i_blkbits;
|
||||||
|
|
||||||
folio = filemap_grab_folio(mapping, index);
|
folio = filemap_grab_folio(mapping, index);
|
||||||
if (IS_ERR(folio))
|
if (IS_ERR(folio))
|
||||||
return PTR_ERR(folio);
|
return PTR_ERR(folio);
|
||||||
@ -2720,17 +2703,15 @@ EXPORT_SYMBOL(block_truncate_page);
|
|||||||
/*
|
/*
|
||||||
* The generic ->writepage function for buffer-backed address_spaces
|
* The generic ->writepage function for buffer-backed address_spaces
|
||||||
*/
|
*/
|
||||||
int block_write_full_page(struct page *page, get_block_t *get_block,
|
int block_write_full_folio(struct folio *folio, struct writeback_control *wbc,
|
||||||
struct writeback_control *wbc)
|
void *get_block)
|
||||||
{
|
{
|
||||||
struct folio *folio = page_folio(page);
|
|
||||||
struct inode * const inode = folio->mapping->host;
|
struct inode * const inode = folio->mapping->host;
|
||||||
loff_t i_size = i_size_read(inode);
|
loff_t i_size = i_size_read(inode);
|
||||||
|
|
||||||
/* Is the folio fully inside i_size? */
|
/* Is the folio fully inside i_size? */
|
||||||
if (folio_pos(folio) + folio_size(folio) <= i_size)
|
if (folio_pos(folio) + folio_size(folio) <= i_size)
|
||||||
return __block_write_full_folio(inode, folio, get_block, wbc,
|
return __block_write_full_folio(inode, folio, get_block, wbc);
|
||||||
end_buffer_async_write);
|
|
||||||
|
|
||||||
/* Is the folio fully outside i_size? (truncate in progress) */
|
/* Is the folio fully outside i_size? (truncate in progress) */
|
||||||
if (folio_pos(folio) >= i_size) {
|
if (folio_pos(folio) >= i_size) {
|
||||||
@ -2747,10 +2728,8 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
|
|||||||
*/
|
*/
|
||||||
folio_zero_segment(folio, offset_in_folio(folio, i_size),
|
folio_zero_segment(folio, offset_in_folio(folio, i_size),
|
||||||
folio_size(folio));
|
folio_size(folio));
|
||||||
return __block_write_full_folio(inode, folio, get_block, wbc,
|
return __block_write_full_folio(inode, folio, get_block, wbc);
|
||||||
end_buffer_async_write);
|
|
||||||
}
|
}
|
||||||
EXPORT_SYMBOL(block_write_full_page);
|
|
||||||
|
|
||||||
sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
|
sector_t generic_block_bmap(struct address_space *mapping, sector_t block,
|
||||||
get_block_t *get_block)
|
get_block_t *get_block)
|
||||||
|
@ -907,8 +907,8 @@ static void writepages_finish(struct ceph_osd_request *req)
|
|||||||
doutc(cl, "unlocking %p\n", page);
|
doutc(cl, "unlocking %p\n", page);
|
||||||
|
|
||||||
if (remove_page)
|
if (remove_page)
|
||||||
generic_error_remove_page(inode->i_mapping,
|
generic_error_remove_folio(inode->i_mapping,
|
||||||
page);
|
page_folio(page));
|
||||||
|
|
||||||
unlock_page(page);
|
unlock_page(page);
|
||||||
}
|
}
|
||||||
|
@ -428,7 +428,8 @@ static void d_lru_add(struct dentry *dentry)
|
|||||||
this_cpu_inc(nr_dentry_unused);
|
this_cpu_inc(nr_dentry_unused);
|
||||||
if (d_is_negative(dentry))
|
if (d_is_negative(dentry))
|
||||||
this_cpu_inc(nr_dentry_negative);
|
this_cpu_inc(nr_dentry_negative);
|
||||||
WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
|
WARN_ON_ONCE(!list_lru_add_obj(
|
||||||
|
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
|
||||||
}
|
}
|
||||||
|
|
||||||
static void d_lru_del(struct dentry *dentry)
|
static void d_lru_del(struct dentry *dentry)
|
||||||
@ -438,7 +439,8 @@ static void d_lru_del(struct dentry *dentry)
|
|||||||
this_cpu_dec(nr_dentry_unused);
|
this_cpu_dec(nr_dentry_unused);
|
||||||
if (d_is_negative(dentry))
|
if (d_is_negative(dentry))
|
||||||
this_cpu_dec(nr_dentry_negative);
|
this_cpu_dec(nr_dentry_negative);
|
||||||
WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
|
WARN_ON_ONCE(!list_lru_del_obj(
|
||||||
|
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
|
||||||
}
|
}
|
||||||
|
|
||||||
static void d_shrink_del(struct dentry *dentry)
|
static void d_shrink_del(struct dentry *dentry)
|
||||||
@ -1240,7 +1242,7 @@ static enum lru_status dentry_lru_isolate(struct list_head *item,
|
|||||||
*
|
*
|
||||||
* This is guaranteed by the fact that all LRU management
|
* This is guaranteed by the fact that all LRU management
|
||||||
* functions are intermediated by the LRU API calls like
|
* functions are intermediated by the LRU API calls like
|
||||||
* list_lru_add and list_lru_del. List movement in this file
|
* list_lru_add_obj and list_lru_del_obj. List movement in this file
|
||||||
* only ever occur through this functions or through callbacks
|
* only ever occur through this functions or through callbacks
|
||||||
* like this one, that are called from the LRU API.
|
* like this one, that are called from the LRU API.
|
||||||
*
|
*
|
||||||
|
@ -969,7 +969,7 @@ const struct address_space_operations ext2_aops = {
|
|||||||
.writepages = ext2_writepages,
|
.writepages = ext2_writepages,
|
||||||
.migrate_folio = buffer_migrate_folio,
|
.migrate_folio = buffer_migrate_folio,
|
||||||
.is_partially_uptodate = block_is_partially_uptodate,
|
.is_partially_uptodate = block_is_partially_uptodate,
|
||||||
.error_remove_page = generic_error_remove_page,
|
.error_remove_folio = generic_error_remove_folio,
|
||||||
};
|
};
|
||||||
|
|
||||||
static const struct address_space_operations ext2_dax_aops = {
|
static const struct address_space_operations ext2_dax_aops = {
|
||||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user