linux-stable/mm
Naoya Horiguchi 014275c884 mm/hugetlb: take page table lock in follow_huge_pmd()
commit e66f17ff71 upstream.

We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing.  This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.

This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.

This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned.  So the caller must be changed to
properly handle the returned tail pages.

We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.

Here is the reproducer:

  $ cat movepages.c
  #include <stdio.h>
  #include <stdlib.h>
  #include <numaif.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000
  #define PS              0x1000

  int main(int argc, char *argv[]) {
          int i;
          int nr_hp = strtol(argv[1], NULL, 0);
          int nr_p  = nr_hp * HPS / PS;
          int ret;
          void **addrs;
          int *status;
          int *nodes;
          pid_t pid;

          pid = strtol(argv[2], NULL, 0);
          addrs  = malloc(sizeof(char *) * nr_p + 1);
          status = malloc(sizeof(char *) * nr_p + 1);
          nodes  = malloc(sizeof(char *) * nr_p + 1);

          while (1) {
                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 1;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");

                  for (i = 0; i < nr_p; i++) {
                          addrs[i] = (void *)ADDR_INPUT + i * PS;
                          nodes[i] = 0;
                          status[i] = 0;
                  }
                  ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                        MPOL_MF_MOVE_ALL);
                  if (ret == -1)
                          err("move_pages");
          }
          return 0;
  }

  $ cat hugepage.c
  #include <stdio.h>
  #include <sys/mman.h>
  #include <string.h>

  #define ADDR_INPUT      0x700000000000UL
  #define HPS             0x200000

  int main(int argc, char *argv[]) {
          int nr_hp = strtol(argv[1], NULL, 0);
          char *p;

          while (1) {
                  p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                  if (p != (void *)ADDR_INPUT) {
                          perror("mmap");
                          break;
                  }
                  memset(p, 0, nr_hp * HPS);
                  munmap(p, nr_hp * HPS);
          }
  }

  $ sysctl vm.nr_hugepages=40
  $ ./hugepage 10 &
  $ ./movepages 10 $(pgrep -f hugepage)


[n-horiguchi@ah.jp.nec.com: resolve conflict to apply to v3.19.1]
Fixes: e632a938d9 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29 10:23:47 +02:00
..
backing-dev.c Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block 2014-10-18 11:53:51 -07:00
balloon_compaction.c mm/balloon_compaction: fix deflation when compaction is disabled 2014-10-29 16:33:15 -07:00
bootmem.c mem-hotplug: reset node managed pages when hot-adding a new pgdat 2014-11-13 16:17:06 -08:00
cleancache.c mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE 2014-01-23 16:36:50 -08:00
cma.c mm: cma: fix CMA aligned offset calculation 2015-03-26 13:59:45 +01:00
compaction.c mm: fix negative nr_isolated counts 2015-03-18 14:10:54 +01:00
debug-pagealloc.c mm/debug-pagealloc: make debug-pagealloc boottime configurable 2014-12-13 12:42:48 -08:00
debug.c mm: move page->mem_cgroup bad page handling into generic code 2014-12-10 17:41:09 -08:00
dmapool.c mm/dmapool.c: fixed a brace coding style issue 2014-10-09 22:26:00 -04:00
early_ioremap.c mm: create generic early_ioremap() support 2014-04-07 16:36:15 -07:00
fadvise.c mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages 2014-12-13 12:42:49 -08:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap_xip.c mm/xip: share the i_mmap_rwsem 2014-12-13 12:42:45 -08:00
filemap.c mm: get rid of radix tree gfp mask for pagecache_get_page 2014-12-29 12:45:45 -08:00
fremap.c Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux 2014-12-15 15:52:01 -08:00
frontswap.c mm/frontswap.c: fix the condition in BUG_ON 2014-12-10 17:41:08 -08:00
gup.c mm/hugetlb: take page table lock in follow_huge_pmd() 2015-04-29 10:23:47 +02:00
highmem.c mm/highmem: make kmap cache coloring aware 2014-08-06 18:01:22 -07:00
huge_memory.c Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux 2014-12-15 15:52:01 -08:00
hugetlb_cgroup.c mm: hugetlb_cgroup: convert to lockless page counters 2014-12-10 17:41:04 -08:00
hugetlb.c mm/hugetlb: take page table lock in follow_huge_pmd() 2015-04-29 10:23:47 +02:00
hwpoison-inject.c mm/hwpoison-inject.c: remove unnecessary null test before debugfs_remove_recursive 2014-08-06 18:01:19 -07:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm, compaction: always update cached scanner positions 2014-12-10 17:41:06 -08:00
interval_tree.c mm: convert a few VM_BUG_ON callers to VM_BUG_ON_VMA 2014-10-09 22:25:57 -04:00
iov_iter.c copy_from_iter_nocache() 2014-12-08 20:25:23 -05:00
Kconfig mm/balloon_compaction: add vmstat counters and kpageflags bit 2014-10-09 22:26:01 -04:00
Kconfig.debug mm/debug_pagealloc: remove obsolete Kconfig options 2015-01-08 15:10:52 -08:00
kmemcheck.c mm/slab_common: move kmem_cache definition to internal header 2014-10-09 22:25:50 -04:00
kmemleak-test.c mm/kmemleak-test.c: use pr_fmt for logging 2014-06-06 16:08:18 -07:00
kmemleak.c mm: introduce kmemleak_update_trace() 2014-06-06 16:08:17 -07:00
ksm.c vm: add VM_FAULT_SIGSEGV handling support 2015-01-29 10:51:32 -08:00
list_lru.c mm: keep page cache radix tree nodes in check 2014-04-03 16:21:01 -07:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c VFS: Rename do_fallocate() to vfs_fallocate() 2014-11-07 16:17:44 -05:00
Makefile mm/page_owner: keep track of page owners 2014-12-13 12:42:48 -08:00
memblock.c mm/memblock.c: refactor functions to set/clear MEMBLOCK_HOTPLUG 2014-12-13 12:42:46 -08:00
memcontrol.c memcg, shmem: fix shmem migration to use lrucare 2015-02-05 13:35:29 -08:00
memory_hotplug.c mm/memory hotplug: postpone the reset of obsolete pgdat 2015-04-19 10:10:17 +02:00
memory-failure.c mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page() 2015-03-18 14:10:54 +01:00
memory.c mm/memory.c: actually remap enough memory 2015-03-18 14:10:54 +01:00
mempolicy.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-19 18:19:19 -08:00
mempool.c mm/mempool.c: update the kmemleak stack trace for mempool allocations 2014-06-06 16:08:17 -07:00
migrate.c mm/hugetlb: take page table lock in follow_huge_pmd() 2015-04-29 10:23:47 +02:00
mincore.c mm: mincore: add hwpoison page handle 2014-12-13 12:42:46 -08:00
mlock.c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-10-13 15:44:12 +02:00
mm_init.c mm: bring back /sys/kernel/mm 2014-01-27 21:02:39 -08:00
mmap.c mm: fix anon_vma->degree underflow in anon_vma endless growing prevention 2015-04-19 10:10:16 +02:00
mmu_context.c sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mm 2014-02-21 08:50:17 +01:00
mmu_notifier.c mmu_notifier: add the callback for mmu_notifier_invalidate_range() 2014-11-13 13:46:09 +11:00
mmzone.c mm: numa: Change page last {nid,pid} into {cpu,pid} 2013-10-09 14:47:45 +02:00
mprotect.c mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared 2014-10-14 02:18:28 +02:00
mremap.c Merge git://git.kvack.org/~bcrl/aio-next 2014-12-14 13:36:57 -08:00
msync.c msync: fix incorrect fstart calculation 2014-07-03 09:21:53 -07:00
nobootmem.c mem-hotplug: reset node managed pages when hot-adding a new pgdat 2014-11-13 16:17:06 -08:00
nommu.c mm/nommu: fix memory leak 2015-03-18 14:10:55 +01:00
oom_kill.c oom: kill the insufficient and no longer needed PT_TRACE_EXIT check 2014-12-13 12:42:49 -08:00
page_alloc.c mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change 2015-03-18 14:10:55 +01:00
page_counter.c mm: memcontrol: remove obsolete kmemcg pinning tricks 2014-12-10 17:41:05 -08:00
page_ext.c mm/page_owner: keep track of page owners 2014-12-13 12:42:48 -08:00
page_io.c fix __swap_writepage() compile failure on old gcc versions 2014-06-14 19:30:48 -05:00
page_isolation.c mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate 2015-04-19 10:10:17 +02:00
page_owner.c mm/page_owner: correct owner information for early allocated pages 2014-12-13 12:42:48 -08:00
page-writeback.c writeback: fix possible underflow in write bandwidth calculation 2015-04-19 10:10:17 +02:00
pagewalk.c mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range 2015-02-05 13:35:29 -08:00
percpu-km.c percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated 2014-09-02 14:46:05 -04:00
percpu-vm.c percpu: move region iterations out of pcpu_[de]populate_chunk() 2014-09-02 14:46:02 -04:00
percpu.c percpu: off by one in BUG_ON() 2014-10-29 10:34:34 -04:00
pgtable-generic.c mm: actually clear pmd_numa before invalidating 2014-08-29 16:28:15 -07:00
process_vm_access.c start adding the tag to iov_iter 2014-05-06 17:32:49 -04:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c mm/readahead.c: remove unused file_ra_state from count_history_pages 2014-08-06 18:01:15 -07:00
rmap.c mm: fix anon_vma->degree underflow in anon_vma endless growing prevention 2015-04-19 10:10:16 +02:00
shmem.c memcg, shmem: fix shmem migration to use lrucare 2015-02-05 13:35:29 -08:00
slab_common.c memcg: use generic slab iterators for showing slabinfo 2014-12-10 17:41:07 -08:00
slab.c slab: fix cpuset check in fallback_alloc 2014-12-13 12:42:53 -08:00
slab.h memcg: use generic slab iterators for showing slabinfo 2014-12-10 17:41:07 -08:00
slob.c mm/sl[ao]b: always track caller in kmalloc_(node_)track_caller() 2014-10-09 22:25:50 -04:00
slub.c slub: fix cpuset check in get_any_partial 2014-12-13 12:42:53 -08:00
sparse-vmemmap.c mm/sparse: use memblock apis for early memory allocations 2014-01-21 16:19:47 -08:00
sparse.c mm: use macros from compiler.h instead of __attribute__((...)) 2014-04-07 16:35:54 -07:00
swap_cgroup.c mm: page_cgroup: rename file to mm/swap_cgroup.c 2014-12-10 17:41:09 -08:00
swap_state.c mm: page_cgroup: rename file to mm/swap_cgroup.c 2014-12-10 17:41:09 -08:00
swap.c mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache 2014-10-09 22:25:59 -04:00
swapfile.c mm: page_cgroup: rename file to mm/swap_cgroup.c 2014-12-10 17:41:09 -08:00
truncate.c mm: Fix comment before truncate_setsize() 2014-11-07 08:29:25 +11:00
util.c proc/maps: make vm_is_stack() logic namespace-friendly 2014-10-09 22:25:50 -04:00
vmacache.c mm,vmacache: count number of system-wide flushes 2014-12-13 12:42:48 -08:00
vmalloc.c mm/vmalloc.c: fix memory ordering bug 2014-12-13 12:42:49 -08:00
vmpressure.c mm/vmpressure.c: fix race in vmpressure_work_fn() 2014-12-02 17:32:07 -08:00
vmscan.c mm/vmscan: fix highidx argument type 2015-01-26 13:37:18 -08:00
vmstat.c vmstat: do not use deferrable delayed work for vmstat_update 2015-03-18 14:11:12 +01:00
workingset.c mm: keep page cache radix tree nodes in check 2014-04-03 16:21:01 -07:00
zbud.c mm/zbud: init user ops only when it is needed 2014-12-13 12:42:51 -08:00
zpool.c mm/zpool: use prefixed module loading 2014-08-29 16:28:16 -07:00
zsmalloc.c mm/zsmalloc: adjust order of functions 2014-12-18 19:08:11 -08:00
zswap.c mm/zswap: delete unnecessary check before calling free_percpu() 2014-12-13 12:42:50 -08:00