linux-stable/mm
Mel Gorman 2cc7be2ed2 mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
commit 67d46b296a upstream.

Rob van der Heij reported the following (paraphrased) on private mail.

	The scenario is that I want to avoid backups to fill up the page
	cache and purge stuff that is more likely to be used again (this is
	with s390x Linux on z/VM, so I don't give it as much memory that
	we don't care anymore). So I have something with LD_PRELOAD that
	intercepts the close() call (from tar, in this case) and issues
	a posix_fadvise() just before closing the file.

	This mostly works, except for small files (less than 14 pages)
	that remains in page cache after the face.

Unfortunately Rob has not had a chance to test this exact patch but the
test program below should be reproducing the problem he described.

The issue is the per-cpu pagevecs for LRU additions.  If the pages are
added by one CPU but fadvise() is called on another then the pages
remain resident as the invalidate_mapping_pages() only drains the local
pagevecs via its call to pagevec_release().  The user-visible effect is
that a program that uses fadvise() properly is not obeyed.

A possible fix for this is to put the necessary smarts into
invalidate_mapping_pages() to globally drain the LRU pagevecs if a
pagevec page could not be discarded.  The downside with this is that an
inode cache shrink would send a global IPI and memory pressure
potentially causing global IPI storms is very undesirable.

Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
check if invalidate_mapping_pages() discarded all the requested pages.
If a subset of pages are discarded it drains the LRU pagevecs and tries
again.  If the second attempt fails, it assumes it is due to the pages
being mapped, locked or dirty and does not care.  With this patch, an
application using fadvise() correctly will be obeyed but there is a
downside that a malicious application can force the kernel to send
global IPIs and increase overhead.

If accepted, I would like this to be considered as a -stable candidate.
It's not an urgent issue but it's a system call that is not working as
advertised which is weak.

The following test program demonstrates the problem.  It should never
report that pages are still resident but will without this patch.  It
assumes that CPU 0 and 1 exist.

int main() {
	int fd;
	int pagesize = getpagesize();
	ssize_t written = 0, expected;
	char *buf;
	unsigned char *vec;
	int resident, i;
	cpu_set_t set;

	/* Prepare a buffer for writing */
	expected = FILESIZE_PAGES * pagesize;
	buf = malloc(expected + 1);
	if (buf == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}
	buf[expected] = 0;
	memset(buf, 'a', expected);

	/* Prepare the mincore vec */
	vec = malloc(FILESIZE_PAGES);
	if (vec == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}

	/* Bind ourselves to CPU 0 */
	CPU_ZERO(&set);
	CPU_SET(0, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* open file, unlink and write buffer */
	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
	if (fd == -1) {
		perror("open");
		exit(EXIT_FAILURE);
	}
	unlink("fadvise-test-file");
	while (written < expected) {
		ssize_t this_write;
		this_write = write(fd, buf + written, expected - written);

		if (this_write == -1) {
			perror("write");
			exit(EXIT_FAILURE);
		}

		written += this_write;
	}
	free(buf);

	/*
	 * Force ourselves to another CPU. If fadvise only flushes the local
	 * CPUs pagevecs then the fadvise will fail to discard all file pages
	 */
	CPU_ZERO(&set);
	CPU_SET(1, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* sync and fadvise to discard the page cache */
	fsync(fd);
	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
		perror("posix_fadvise");
		exit(EXIT_FAILURE);
	}

	/* map the file and use mincore to see which parts of it are resident */
	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
	if (buf == NULL) {
		perror("mmap");
		exit(EXIT_FAILURE);
	}
	if (mincore(buf, expected, vec) == -1) {
		perror("mincore");
		exit(EXIT_FAILURE);
	}

	/* Check residency */
	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
		if (vec[i])
			resident++;
	}
	if (resident != 0) {
		printf("Nr unexpected pages resident: %d\n", resident);
		exit(EXIT_FAILURE);
	}

	munmap(buf, expected);
	close(fd);
	free(vec);
	exit(EXIT_SUCCESS);
}

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Rob van der Heij <rvdheij@gmail.com>
Tested-by: Rob van der Heij <rvdheij@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-27 09:21:09 -08:00
..
backing-dev.c backing-dev: use kstrto* in preference to simple_strtoul 2012-08-25 16:58:14 +08:00
bootmem.c mm: bootmem: fix free_all_bootmem_core() with odd bitmap alignment 2013-01-17 08:45:55 -08:00
bounce.c bounce: allow use of bounce pool via config option 2012-07-18 16:40:35 -04:00
cleancache.c ->encode_fh() API change 2012-05-29 23:28:33 -04:00
compaction.c mm: compaction: partially revert capture of suitable high-order page 2013-01-17 08:46:10 -08:00
debug-pagealloc.c mm, x86: Remove debug_pagealloc_enabled 2011-12-06 09:24:07 +01:00
dmapool.c mm: dmapool: use provided gfp flags for all dma_alloc_coherent() calls 2012-12-17 11:06:52 -08:00
fadvise.c mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages 2013-02-27 09:21:09 -08:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap_xip.c mm: move all mmu notifier invocations to be done outside the PT lock 2012-10-09 16:22:58 +09:00
filemap.c readahead: fault retry breaks mmap file read random detection 2012-10-09 16:22:47 +09:00
fremap.c remap_file_pages: correctly handle the case of a NULL vm_ops pointer 2012-10-19 13:37:57 -07:00
frontswap.c frontswap: support exclusive gets if tmem backend is capable 2012-09-21 10:38:12 -04:00
highmem.c mm: highmem: export kmap_to_page for modules 2013-01-11 09:18:18 -08:00
huge_memory.c mm: huge_memory: Fix build error. 2012-10-15 07:59:15 -07:00
hugetlb_cgroup.c mm/hugetlb: create hugetlb cgroup file in hugetlb_init 2013-01-11 09:18:50 -08:00
hugetlb.c mm/hugetlb: set PTE as huge in hugetlb_change_protection and remove_migration_pte 2013-02-11 09:04:44 -08:00
hwpoison-inject.c memcg: rename config variables 2012-07-31 18:42:43 -07:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm: compaction: partially revert capture of suitable high-order page 2013-01-17 08:46:10 -08:00
interval_tree.c mm: add CONFIG_DEBUG_VM_RB build option 2012-10-09 16:22:42 +09:00
Kconfig mm: enable CONFIG_COMPACTION by default 2012-10-09 16:22:53 +09:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
kmemcheck.c kmemcheck: Fix build errors due to missing slab.h 2010-03-30 22:02:32 +09:00
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c kmemleak: use rbtree instead of prio tree 2012-10-09 16:22:39 +09:00
ksm.c mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end 2012-10-09 16:22:58 +09:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c mm: prepare VM_DONTDUMP for using in drivers 2012-10-09 16:22:18 +09:00
Makefile mm: replace vma prio_tree with an interval tree 2012-10-09 16:22:39 +09:00
memblock.c x86, mm: Trim memory in memblock to be page aligned 2012-10-24 11:52:21 -07:00
memcontrol.c memcg: fix hotplugged memory zone oops 2012-11-16 14:33:04 -08:00
memory_hotplug.c revert "mm: fix-up zone present pages" 2012-11-16 14:33:04 -08:00
memory-failure.c mm: soft offline: split thp at the beginning of soft_offline_page() 2012-11-30 08:51:18 -08:00
memory.c mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT 2013-01-11 09:19:05 -08:00
mempolicy.c tmpfs mempolicy: fix /proc/mounts corrupting memory 2013-01-11 09:18:19 -08:00
mempool.c mempool: add @gfp_mask to mempool_create_node() 2012-06-25 11:53:47 +02:00
migrate.c mm/hugetlb: set PTE as huge in hugetlb_change_protection and remove_migration_pte 2013-02-11 09:04:44 -08:00
mincore.c mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode 2012-03-21 17:54:54 -07:00
mlock.c mm: don't overwrite mm->def_flags in do_mlockall() 2013-02-17 10:53:12 -08:00
mm_init.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mmap.c mm: add anon_vma_lock to validate_mm() 2012-11-16 14:33:03 -08:00
mmu_context.c mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() 2012-03-21 17:54:59 -07:00
mmu_notifier.c mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts 2013-02-27 09:21:08 -08:00
mmzone.c memcg: fix hotplugged memory zone oops 2012-11-16 14:33:04 -08:00
mprotect.c Merge branch 'akpm' (Andrew's patch-bomb) 2012-03-22 09:04:48 -07:00
mremap.c mm: move all mmu notifier invocations to be done outside the PT lock 2012-10-09 16:22:58 +09:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nobootmem.c revert "mm: fix-up zone present pages" 2012-11-16 14:33:04 -08:00
nommu.c mm: replace vma prio_tree with an interval tree 2012-10-09 16:22:39 +09:00
oom_kill.c oom: remove deprecated oom_adj 2012-10-09 16:22:24 +09:00
page_alloc.c mm: fix pageblock bitmap allocation 2013-02-27 09:20:54 -08:00
page_cgroup.c memcg: rename config variables 2012-07-31 18:42:43 -07:00
page_io.c mm: add support for direct_IO to highmem pages 2012-07-31 18:42:47 -07:00
page_isolation.c mm/page_alloc: refactor out __alloc_contig_migrate_alloc() 2012-10-09 16:22:52 +09:00
page-writeback.c mm: fix calculation of dirtyable memory 2013-01-11 09:18:18 -08:00
pagewalk.c mm: fix kernel-doc warnings 2012-06-20 14:39:36 -07:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c mm: fix kernel-doc warnings 2012-06-20 14:39:36 -07:00
percpu.c sections: fix section conflicts in mm/percpu.c 2012-10-06 03:04:44 +09:00
pgtable-generic.c thp: introduce pmdp_invalidate() 2012-10-09 16:22:29 +09:00
process_vm_access.c aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() 2012-05-31 17:49:32 -07:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c switch simple cases of fget_light to fdget 2012-09-26 22:20:08 -04:00
rmap.c mm: fix XFS oops due to dirty pages without buffers on s390 2012-10-25 14:37:52 -07:00
shmem.c tmpfs: fix use-after-free of mempolicy object 2013-02-27 09:21:09 -08:00
slab_common.c mm, slab: release slab_mutex earlier in kmem_cache_destroy() 2012-10-10 09:25:08 +03:00
slab.c Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux 2012-10-07 07:53:13 +09:00
slab.h Revert "mm/sl[aou]b: Move sysfs_slab_add to common" 2012-09-05 12:07:44 +03:00
slob.c Merge branch 'testing/driver-warnings' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc into fixes 2012-10-19 15:40:18 -07:00
slub.c slub: assign refcount for kmalloc_caches 2013-02-03 18:27:08 -06:00
sparse-vmemmap.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
sparse.c mm/vmemmap: fix wrong use of virt_to_page 2012-11-30 08:51:17 -08:00
swap_state.c mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages 2012-07-31 18:42:47 -07:00
swap.c mm: remove vma arg from page_evictable 2012-10-09 16:22:55 +09:00
swapfile.c swapfile: fix name leak in swapoff 2012-11-16 14:33:04 -08:00
truncate.c mm: use clear_page_mlock() in page_remove_rmap() 2012-10-09 16:22:56 +09:00
util.c mm: Use __do_krealloc to do the krealloc job 2012-09-04 10:22:58 +03:00
vmalloc.c mm: use %pK for /proc/vmallocinfo 2012-10-09 16:23:03 +09:00
vmscan.c mm: vmscan: fix inappropriate zone congestion clearing 2012-12-08 08:41:18 -08:00
vmstat.c mm: remove unevictable_pgs_mlockfreed 2012-10-09 16:22:59 +09:00