linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-12-29 17:25:38 +00:00

History

Vlastimil Babka 1fcb702a6e mm, thp: respect MPOL_PREFERRED policy with non-local node commit `0867a57c4f` upstream. Since commit `077fcf116c` ("mm/thp: allocate transparent hugepages on local node"), we handle THP allocations on page fault in a special way - for non-interleave memory policies, the allocation is only attempted on the node local to the current CPU, if the policy's nodemask allows the node. This is motivated by the assumption that THP benefits cannot offset the cost of remote accesses, so it's better to fallback to base pages on the local node (which might still be available, while huge pages are not due to fragmentation) than to allocate huge pages on a remote node. The nodemask check prevents us from violating e.g. MPOL_BIND policies where the local node is not among the allowed nodes. However, the current implementation can still give surprising results for the MPOL_PREFERRED policy when the preferred node is different than the current CPU's local node. In such case we should honor the preferred node and not use the local node, which is what this patch does. If hugepage allocation on the preferred node fails, we fall back to base pages and don't try other nodes, with the same motivation as is done for the local node hugepage allocations. The patch also moves the MPOL_INTERLEAVE check around to simplify the hugepage specific test. The difference can be demonstrated using in-tree transhuge-stress test on the following 2-node machine where half memory on one node was occupied to show the difference. > numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35 node 0 size: 7878 MB node 0 free: 3623 MB node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 8045 MB node 1 free: 7818 MB node distances: node 0 1 0: 10 21 1: 21 10 Before the patch: > numactl -p0 -C0 ./transhuge-stress transhuge-stress: 2.197 s/loop, 0.276 ms/page, 7249.168 MiB/s 7962 succeed, 0 failed, 1786 different pages > numactl -p0 -C12 ./transhuge-stress transhuge-stress: 2.962 s/loop, 0.372 ms/page, 5376.172 MiB/s 7962 succeed, 0 failed, 3873 different pages Number of successful THP allocations corresponds to free memory on node 0 in the first case and node 1 in the second case, i.e. -p parameter is ignored and cpu binding "wins". After the patch: > numactl -p0 -C0 ./transhuge-stress transhuge-stress: 2.183 s/loop, 0.274 ms/page, 7295.516 MiB/s 7962 succeed, 0 failed, 1760 different pages > numactl -p0 -C12 ./transhuge-stress transhuge-stress: 2.878 s/loop, 0.361 ms/page, 5533.638 MiB/s 7962 succeed, 0 failed, 1750 different pages > numactl -p1 -C0 ./transhuge-stress transhuge-stress: 4.628 s/loop, 0.581 ms/page, 3440.893 MiB/s 7962 succeed, 0 failed, 3918 different pages The -p parameter is respected regardless of cpu binding. > numactl -C0 ./transhuge-stress transhuge-stress: 2.202 s/loop, 0.277 ms/page, 7230.003 MiB/s 7962 succeed, 0 failed, 1750 different pages > numactl -C12 ./transhuge-stress transhuge-stress: 3.020 s/loop, 0.379 ms/page, 5273.324 MiB/s 7962 succeed, 0 failed, 3916 different pages Without -p parameter, hugepage restriction to CPU-local node works as before. Fixes: `077fcf116c` ("mm/thp: allocate transparent hugepages on local node") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>		2015-07-21 10:10:13 -07:00
..
kasan	kasan, module, vmalloc: rework shadow allocation for modules	2015-03-12 18:46:08 -07:00
backing-dev.c	block: discard bdi_unregister() in favour of bdi_destroy()	2015-06-22 17:03:29 -07:00
balloon_compaction.c	mm/balloon_compaction: fix deflation when compaction is disabled	2014-10-29 16:33:15 -07:00
bootmem.c	mem-hotplug: reset node managed pages when hot-adding a new pgdat	2014-11-13 16:17:06 -08:00
cleancache.c	mm: fix cleancache debugfs directory path	2015-01-20 14:08:31 +01:00
cma.c	mm: cma: fix CMA aligned offset calculation	2015-03-12 18:46:07 -07:00
compaction.c	mm: page_alloc: add kasan hooks on alloc and free paths	2015-02-13 21:21:41 -08:00
debug-pagealloc.c	mm/debug-pagealloc: make debug-pagealloc boottime configurable	2014-12-13 12:42:48 -08:00
debug.c	mm: account pmd page tables to the process	2015-02-11 17:06:04 -08:00
dmapool.c	mm/dmapool.c: fixed a brace coding style issue	2014-10-09 22:26:00 -04:00
early_ioremap.c	mm: create generic early_ioremap() support	2014-04-07 16:36:15 -07:00
fadvise.c	vfs: remove get_xip_mem	2015-02-16 17:56:03 -08:00
failslab.c	switch debugfs to umode_t	2012-01-03 22:54:56 -05:00
filemap.c	dax,ext2: replace XIP read and write with DAX I/O	2015-02-16 17:56:03 -08:00
frontswap.c	mm/frontswap.c: fix the condition in BUG_ON	2014-12-10 17:41:08 -08:00
gup.c	Tighten rules for ACCESS_ONCE	2015-02-14 10:54:28 -08:00
highmem.c	mm/highmem: make kmap cache coloring aware	2014-08-06 18:01:22 -07:00
huge_memory.c	mm, thp: really limit transparent hugepage allocation to local node	2015-05-06 22:04:07 +02:00
hugetlb_cgroup.c	mm: page_counter: pull "-1" handling out of page_counter_memparse()	2015-02-11 17:06:02 -08:00
hugetlb.c	mm/hugetlb: use pmd_page() in follow_huge_pmd()	2015-05-06 22:03:38 +02:00
hwpoison-inject.c	mm/hwpoison-inject.c: remove unnecessary null test before debugfs_remove_recursive	2014-08-06 18:01:19 -07:00
init-mm.c	atomic: use <linux/atomic.h>	2011-07-26 16:49:47 -07:00
internal.h	mm/internal.h: don't split printk call in two	2015-02-12 18:54:10 -08:00
interval_tree.c	mm: replace vma->sharead.linear with vma->shared	2015-02-10 14:30:31 -08:00
Kconfig	Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild	2015-02-19 10:36:45 -08:00
Kconfig.debug	mm/debug_pagealloc: remove obsolete Kconfig options	2015-01-08 15:10:52 -08:00
kmemcheck.c	mm/slab_common: move kmem_cache definition to internal header	2014-10-09 22:25:50 -04:00
kmemleak-test.c	mm/kmemleak-test.c: use pr_fmt for logging	2014-06-06 16:08:18 -07:00
kmemleak.c	mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()	2015-07-21 10:10:13 -07:00
ksm.c	mm: remove rest usage of VM_NONLINEAR and pte_file()	2015-02-10 14:30:31 -08:00
list_lru.c	memcg: reparent list_lrus and free kmemcg_id on css offline	2015-02-12 18:54:10 -08:00
maccess.c	mm: Map most files to use export.h instead of module.h	2011-10-31 09:20:12 -04:00
madvise.c	vfs: remove get_xip_mem	2015-02-16 17:56:03 -08:00
Makefile	move iov_iter.c from mm/ to lib/	2015-02-17 22:22:17 -05:00
memblock.c	mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUG	2014-12-13 12:42:46 -08:00
memcontrol.c	memcg: disable hierarchy support if bound to the legacy cgroup hierarchy	2015-03-12 18:46:08 -07:00
memory_hotplug.c	mm/memory_hotplug.c: set zone->wait_table to null after freeing it	2015-06-22 17:03:35 -07:00
memory-failure.c	mm: soft-offline: fix num_poisoned_pages counting on concurrent events	2015-05-17 09:55:07 -07:00
memory.c	mm: numa: slow PTE scan rate if migration failures occur	2015-03-25 16:20:31 -07:00
mempolicy.c	mm, thp: respect MPOL_PREFERRED policy with non-local node	2015-07-21 10:10:13 -07:00
mempool.c	mm/mempool.c: update the kmemleak stack trace for mempool allocations	2014-06-06 16:08:17 -07:00
migrate.c	mm: convert p[te\|md]_mknonnuma and remaining page table manipulations	2015-02-12 18:54:08 -08:00
mincore.c	mincore: apply page table walker on do_mincore()	2015-02-11 17:06:06 -08:00
mlock.c	mm: reorder can_do_mlock to fix audit denial	2015-03-12 18:46:08 -07:00
mm_init.c	mm/mm_init.c: mark mminit_loglevel __meminitdata	2015-02-12 18:54:11 -08:00
mmap.c	mm: fix anon_vma->degree underflow in anon_vma endless growing prevention	2015-03-25 16:20:30 -07:00
mmu_context.c	sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mm	2014-02-21 08:50:17 +01:00
mmu_notifier.c	mmu_notifier: add the callback for mmu_notifier_invalidate_range()	2014-11-13 13:46:09 +11:00
mmzone.c	mm: microoptimize zonelist operations	2015-02-11 17:06:02 -08:00
mprotect.c	mm: numa: preserve PTE write permissions across a NUMA hinting fault	2015-03-25 16:20:31 -07:00
mremap.c	fix mremap() vs. ioctx_kill() race	2015-04-06 17:50:59 -04:00
msync.c	mm: remove rest usage of VM_NONLINEAR and pte_file()	2015-02-10 14:30:31 -08:00
nobootmem.c	mem-hotplug: reset node managed pages when hot-adding a new pgdat	2014-11-13 16:17:06 -08:00
nommu.c	mm/nommu.c: export symbol max_mapnr	2015-03-12 18:46:08 -07:00
oom_kill.c	mm: account pmd page tables to the process	2015-02-11 17:06:04 -08:00
page_alloc.c	mm, oom: do not fail __GFP_NOFAIL allocation if oom killer is disabled	2015-03-12 18:46:07 -07:00
page_counter.c	mm: page_counter: pull "-1" handling out of page_counter_memparse()	2015-02-11 17:06:02 -08:00
page_ext.c	mm/page_owner: keep track of page owners	2014-12-13 12:42:48 -08:00
page_io.c	new helper: iov_iter_bvec()	2015-01-29 00:13:11 -05:00
page_isolation.c	mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate	2015-03-25 16:20:30 -07:00
page_owner.c	mm/page_owner.c: remove unnecessary stack_trace field	2015-02-11 17:06:07 -08:00
page-writeback.c	writeback: use \|1 instead of +1 to protect against div by zero	2015-05-17 09:55:07 -07:00
pagewalk.c	mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers	2015-03-25 16:20:30 -07:00
percpu-km.c	percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated	2014-09-02 14:46:05 -04:00
percpu-vm.c	percpu: move region iterations out of pcpu_[de]populate_chunk()	2014-09-02 14:46:02 -04:00
percpu.c	mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()	2015-07-21 10:10:13 -07:00
pgtable-generic.c	mm: convert p[te\|md]_mknonnuma and remaining page table manipulations	2015-02-12 18:54:08 -08:00
process_vm_access.c	mm: gup: use get_user_pages_unlocked	2015-02-11 17:06:05 -08:00
quicklist.c	mm: delete various needless include <linux/module.h>	2011-10-31 09:20:11 -04:00
readahead.c	fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info	2015-01-20 14:03:04 -07:00
rmap.c	mm: fix anon_vma->degree underflow in anon_vma endless growing prevention	2015-03-25 16:20:30 -07:00
shmem.c	mm: shmem: check for mapping owner before dereferencing	2015-02-23 10:00:11 -08:00
slab_common.c	mm: slub: add kernel address sanitizer support for slub allocator	2015-02-13 21:21:41 -08:00
slab.c	slub: make dead caches discard free slabs immediately	2015-02-12 18:54:10 -08:00
slab.h	slub: make dead caches discard free slabs immediately	2015-02-12 18:54:10 -08:00
slob.c	slub: make dead caches discard free slabs immediately	2015-02-12 18:54:10 -08:00
slub.c	mm/slub: fix lockups on PREEMPT && !SMP kernels	2015-03-25 16:20:30 -07:00
sparse-vmemmap.c	mm/sparse: use memblock apis for early memory allocations	2014-01-21 16:19:47 -08:00
sparse.c	mm: use macros from compiler.h instead of __attribute__((...))	2014-04-07 16:35:54 -07:00
swap_cgroup.c	mm: page_cgroup: rename file to mm/swap_cgroup.c	2014-12-10 17:41:09 -08:00
swap_state.c	fs: remove mapping->backing_dev_info	2015-01-20 14:03:05 -07:00
swap.c	Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block	2015-02-12 13:50:21 -08:00
swapfile.c	mm: page_cgroup: rename file to mm/swap_cgroup.c	2014-12-10 17:41:09 -08:00
truncate.c	fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info	2015-01-20 14:03:04 -07:00
util.c	mm/util: add kstrdup_const	2015-02-13 21:21:35 -08:00
vmacache.c	mm,vmacache: count number of system-wide flushes	2014-12-13 12:42:48 -08:00
vmalloc.c	kasan, module, vmalloc: rework shadow allocation for modules	2015-03-12 18:46:08 -07:00
vmpressure.c	mm/vmpressure.c: fix race in vmpressure_work_fn()	2014-12-02 17:32:07 -08:00
vmscan.c	Merge branch 'akpm' (patches from Andrew)	2015-02-12 18:54:28 -08:00
vmstat.c	vmstat: Reduce time interval to stat update on idle cpu	2015-02-11 17:06:07 -08:00
workingset.c	list_lru: add helpers to isolate items	2015-02-12 18:54:10 -08:00
zbud.c	mm/zpool: add name argument to create zpool	2015-02-12 18:54:12 -08:00
zpool.c	mm/zpool: add name argument to create zpool	2015-02-12 18:54:12 -08:00
zsmalloc.c	mm/zsmalloc: add statistics support	2015-02-12 18:54:12 -08:00
zswap.c	mm/zpool: add name argument to create zpool	2015-02-12 18:54:12 -08:00