linux-stable/mm
Vlastimil Babka 1fcb702a6e mm, thp: respect MPOL_PREFERRED policy with non-local node
commit 0867a57c4f upstream.

Since commit 077fcf116c ("mm/thp: allocate transparent hugepages on
local node"), we handle THP allocations on page fault in a special way -
for non-interleave memory policies, the allocation is only attempted on
the node local to the current CPU, if the policy's nodemask allows the
node.

This is motivated by the assumption that THP benefits cannot offset the
cost of remote accesses, so it's better to fallback to base pages on the
local node (which might still be available, while huge pages are not due
to fragmentation) than to allocate huge pages on a remote node.

The nodemask check prevents us from violating e.g.  MPOL_BIND policies
where the local node is not among the allowed nodes.  However, the
current implementation can still give surprising results for the
MPOL_PREFERRED policy when the preferred node is different than the
current CPU's local node.

In such case we should honor the preferred node and not use the local
node, which is what this patch does.  If hugepage allocation on the
preferred node fails, we fall back to base pages and don't try other
nodes, with the same motivation as is done for the local node hugepage
allocations.  The patch also moves the MPOL_INTERLEAVE check around to
simplify the hugepage specific test.

The difference can be demonstrated using in-tree transhuge-stress test
on the following 2-node machine where half memory on one node was
occupied to show the difference.

> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 7878 MB
node 0 free: 3623 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 8045 MB
node 1 free: 7818 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Before the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.197 s/loop, 0.276 ms/page,   7249.168 MiB/s 7962 succeed,    0 failed, 1786 different pages

> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.962 s/loop, 0.372 ms/page,   5376.172 MiB/s 7962 succeed,    0 failed, 3873 different pages

Number of successful THP allocations corresponds to free memory on node 0 in
the first case and node 1 in the second case, i.e. -p parameter is ignored and
cpu binding "wins".

After the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.183 s/loop, 0.274 ms/page,   7295.516 MiB/s 7962 succeed,    0 failed, 1760 different pages

> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.878 s/loop, 0.361 ms/page,   5533.638 MiB/s 7962 succeed,    0 failed, 1750 different pages

> numactl -p1 -C0 ./transhuge-stress
transhuge-stress: 4.628 s/loop, 0.581 ms/page,   3440.893 MiB/s 7962 succeed,    0 failed, 3918 different pages

The -p parameter is respected regardless of cpu binding.

> numactl -C0 ./transhuge-stress
transhuge-stress: 2.202 s/loop, 0.277 ms/page,   7230.003 MiB/s 7962 succeed,    0 failed, 1750 different pages

> numactl -C12 ./transhuge-stress
transhuge-stress: 3.020 s/loop, 0.379 ms/page,   5273.324 MiB/s 7962 succeed,    0 failed, 3916 different pages

Without -p parameter, hugepage restriction to CPU-local node works as before.

Fixes: 077fcf116c ("mm/thp: allocate transparent hugepages on local node")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21 10:10:13 -07:00
..
kasan kasan, module, vmalloc: rework shadow allocation for modules 2015-03-12 18:46:08 -07:00
backing-dev.c block: discard bdi_unregister() in favour of bdi_destroy() 2015-06-22 17:03:29 -07:00
balloon_compaction.c mm/balloon_compaction: fix deflation when compaction is disabled 2014-10-29 16:33:15 -07:00
bootmem.c mem-hotplug: reset node managed pages when hot-adding a new pgdat 2014-11-13 16:17:06 -08:00
cleancache.c mm: fix cleancache debugfs directory path 2015-01-20 14:08:31 +01:00
cma.c mm: cma: fix CMA aligned offset calculation 2015-03-12 18:46:07 -07:00
compaction.c mm: page_alloc: add kasan hooks on alloc and free paths 2015-02-13 21:21:41 -08:00
debug-pagealloc.c mm/debug-pagealloc: make debug-pagealloc boottime configurable 2014-12-13 12:42:48 -08:00
debug.c mm: account pmd page tables to the process 2015-02-11 17:06:04 -08:00
dmapool.c mm/dmapool.c: fixed a brace coding style issue 2014-10-09 22:26:00 -04:00
early_ioremap.c mm: create generic early_ioremap() support 2014-04-07 16:36:15 -07:00
fadvise.c vfs: remove get_xip_mem 2015-02-16 17:56:03 -08:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap.c dax,ext2: replace XIP read and write with DAX I/O 2015-02-16 17:56:03 -08:00
frontswap.c mm/frontswap.c: fix the condition in BUG_ON 2014-12-10 17:41:08 -08:00
gup.c Tighten rules for ACCESS_ONCE 2015-02-14 10:54:28 -08:00
highmem.c mm/highmem: make kmap cache coloring aware 2014-08-06 18:01:22 -07:00
huge_memory.c mm, thp: really limit transparent hugepage allocation to local node 2015-05-06 22:04:07 +02:00
hugetlb_cgroup.c mm: page_counter: pull "-1" handling out of page_counter_memparse() 2015-02-11 17:06:02 -08:00
hugetlb.c mm/hugetlb: use pmd_page() in follow_huge_pmd() 2015-05-06 22:03:38 +02:00
hwpoison-inject.c mm/hwpoison-inject.c: remove unnecessary null test before debugfs_remove_recursive 2014-08-06 18:01:19 -07:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm/internal.h: don't split printk call in two 2015-02-12 18:54:10 -08:00
interval_tree.c mm: replace vma->sharead.linear with vma->shared 2015-02-10 14:30:31 -08:00
Kconfig Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild 2015-02-19 10:36:45 -08:00
Kconfig.debug mm/debug_pagealloc: remove obsolete Kconfig options 2015-01-08 15:10:52 -08:00
kmemcheck.c mm/slab_common: move kmem_cache definition to internal header 2014-10-09 22:25:50 -04:00
kmemleak-test.c mm/kmemleak-test.c: use pr_fmt for logging 2014-06-06 16:08:18 -07:00
kmemleak.c mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc() 2015-07-21 10:10:13 -07:00
ksm.c mm: remove rest usage of VM_NONLINEAR and pte_file() 2015-02-10 14:30:31 -08:00
list_lru.c memcg: reparent list_lrus and free kmemcg_id on css offline 2015-02-12 18:54:10 -08:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c vfs: remove get_xip_mem 2015-02-16 17:56:03 -08:00
Makefile move iov_iter.c from mm/ to lib/ 2015-02-17 22:22:17 -05:00
memblock.c mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUG 2014-12-13 12:42:46 -08:00
memcontrol.c memcg: disable hierarchy support if bound to the legacy cgroup hierarchy 2015-03-12 18:46:08 -07:00
memory_hotplug.c mm/memory_hotplug.c: set zone->wait_table to null after freeing it 2015-06-22 17:03:35 -07:00
memory-failure.c mm: soft-offline: fix num_poisoned_pages counting on concurrent events 2015-05-17 09:55:07 -07:00
memory.c mm: numa: slow PTE scan rate if migration failures occur 2015-03-25 16:20:31 -07:00
mempolicy.c mm, thp: respect MPOL_PREFERRED policy with non-local node 2015-07-21 10:10:13 -07:00
mempool.c mm/mempool.c: update the kmemleak stack trace for mempool allocations 2014-06-06 16:08:17 -07:00
migrate.c mm: convert p[te|md]_mknonnuma and remaining page table manipulations 2015-02-12 18:54:08 -08:00
mincore.c mincore: apply page table walker on do_mincore() 2015-02-11 17:06:06 -08:00
mlock.c mm: reorder can_do_mlock to fix audit denial 2015-03-12 18:46:08 -07:00
mm_init.c mm/mm_init.c: mark mminit_loglevel __meminitdata 2015-02-12 18:54:11 -08:00
mmap.c mm: fix anon_vma->degree underflow in anon_vma endless growing prevention 2015-03-25 16:20:30 -07:00
mmu_context.c sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mm 2014-02-21 08:50:17 +01:00
mmu_notifier.c mmu_notifier: add the callback for mmu_notifier_invalidate_range() 2014-11-13 13:46:09 +11:00
mmzone.c mm: microoptimize zonelist operations 2015-02-11 17:06:02 -08:00
mprotect.c mm: numa: preserve PTE write permissions across a NUMA hinting fault 2015-03-25 16:20:31 -07:00
mremap.c fix mremap() vs. ioctx_kill() race 2015-04-06 17:50:59 -04:00
msync.c mm: remove rest usage of VM_NONLINEAR and pte_file() 2015-02-10 14:30:31 -08:00
nobootmem.c mem-hotplug: reset node managed pages when hot-adding a new pgdat 2014-11-13 16:17:06 -08:00
nommu.c mm/nommu.c: export symbol max_mapnr 2015-03-12 18:46:08 -07:00
oom_kill.c mm: account pmd page tables to the process 2015-02-11 17:06:04 -08:00
page_alloc.c mm, oom: do not fail __GFP_NOFAIL allocation if oom killer is disabled 2015-03-12 18:46:07 -07:00
page_counter.c mm: page_counter: pull "-1" handling out of page_counter_memparse() 2015-02-11 17:06:02 -08:00
page_ext.c mm/page_owner: keep track of page owners 2014-12-13 12:42:48 -08:00
page_io.c new helper: iov_iter_bvec() 2015-01-29 00:13:11 -05:00
page_isolation.c mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate 2015-03-25 16:20:30 -07:00
page_owner.c mm/page_owner.c: remove unnecessary stack_trace field 2015-02-11 17:06:07 -08:00
page-writeback.c writeback: use |1 instead of +1 to protect against div by zero 2015-05-17 09:55:07 -07:00
pagewalk.c mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers 2015-03-25 16:20:30 -07:00
percpu-km.c percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated 2014-09-02 14:46:05 -04:00
percpu-vm.c percpu: move region iterations out of pcpu_[de]populate_chunk() 2014-09-02 14:46:02 -04:00
percpu.c mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc() 2015-07-21 10:10:13 -07:00
pgtable-generic.c mm: convert p[te|md]_mknonnuma and remaining page table manipulations 2015-02-12 18:54:08 -08:00
process_vm_access.c mm: gup: use get_user_pages_unlocked 2015-02-11 17:06:05 -08:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info 2015-01-20 14:03:04 -07:00
rmap.c mm: fix anon_vma->degree underflow in anon_vma endless growing prevention 2015-03-25 16:20:30 -07:00
shmem.c mm: shmem: check for mapping owner before dereferencing 2015-02-23 10:00:11 -08:00
slab_common.c mm: slub: add kernel address sanitizer support for slub allocator 2015-02-13 21:21:41 -08:00
slab.c slub: make dead caches discard free slabs immediately 2015-02-12 18:54:10 -08:00
slab.h slub: make dead caches discard free slabs immediately 2015-02-12 18:54:10 -08:00
slob.c slub: make dead caches discard free slabs immediately 2015-02-12 18:54:10 -08:00
slub.c mm/slub: fix lockups on PREEMPT && !SMP kernels 2015-03-25 16:20:30 -07:00
sparse-vmemmap.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
sparse.c mm: use macros from compiler.h instead of __attribute__((...)) 2014-04-07 16:35:54 -07:00
swap_cgroup.c mm: page_cgroup: rename file to mm/swap_cgroup.c 2014-12-10 17:41:09 -08:00
swap_state.c fs: remove mapping->backing_dev_info 2015-01-20 14:03:05 -07:00
swap.c Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block 2015-02-12 13:50:21 -08:00
swapfile.c mm: page_cgroup: rename file to mm/swap_cgroup.c 2014-12-10 17:41:09 -08:00
truncate.c fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info 2015-01-20 14:03:04 -07:00
util.c mm/util: add kstrdup_const 2015-02-13 21:21:35 -08:00
vmacache.c mm,vmacache: count number of system-wide flushes 2014-12-13 12:42:48 -08:00
vmalloc.c kasan, module, vmalloc: rework shadow allocation for modules 2015-03-12 18:46:08 -07:00
vmpressure.c mm/vmpressure.c: fix race in vmpressure_work_fn() 2014-12-02 17:32:07 -08:00
vmscan.c Merge branch 'akpm' (patches from Andrew) 2015-02-12 18:54:28 -08:00
vmstat.c vmstat: Reduce time interval to stat update on idle cpu 2015-02-11 17:06:07 -08:00
workingset.c list_lru: add helpers to isolate items 2015-02-12 18:54:10 -08:00
zbud.c mm/zpool: add name argument to create zpool 2015-02-12 18:54:12 -08:00
zpool.c mm/zpool: add name argument to create zpool 2015-02-12 18:54:12 -08:00
zsmalloc.c mm/zsmalloc: add statistics support 2015-02-12 18:54:12 -08:00
zswap.c mm/zpool: add name argument to create zpool 2015-02-12 18:54:12 -08:00