5424 Commits

Author SHA1 Message Date
Pekka Enberg
d20bbfab01 Merge branch 'slab/urgent' into slab/next 2011-09-19 17:46:07 +03:00
Jiri Kosina
e060c38434 Merge branch 'master' into for-next
Fast-forward merge with Linus to be able to merge patches
based on more recent version of the tree.
2011-09-15 15:08:18 +02:00
Joe Perches
8c1fec1ba8 mm: Convert vmalloc/memset to vzalloc
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-09-15 13:56:56 +02:00
Shaohua Li
cc39c6a9bb mm: account skipped entries to avoid looping in find_get_pages
The found entries by find_get_pages() could be all swap entries.  In
this case we skip the entries, but make sure the skipped entries are
accounted, so we don't keep looping.

Using nr_found > nr_skip to simplify code as suggested by Eric.

Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14 18:17:56 -07:00
David Vrabel
461ae488ec mm: sync vmalloc address space page tables in alloc_vm_area()
Xen backend drivers (e.g., blkback and netback) would sometimes fail to
map grant pages into the vmalloc address space allocated with
alloc_vm_area().  The GNTTABOP_map_grant_ref would fail because Xen could
not find the page (in the L2 table) containing the PTEs it needed to
update.

(XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000

netback and blkback were making the hypercall from a kernel thread where
task->active_mm != &init_mm and alloc_vm_area() was only updating the page
tables for init_mm.  The usual method of deferring the update to the page
tables of other processes (i.e., after taking a fault) doesn't work as a
fault cannot occur during the hypercall.

This would work on some systems depending on what else was using vmalloc.

Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all()
from alloc_vm_area()") and add a comment to explain why it's needed.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Keir Fraser <keir.xen@gmail.com>
Cc: <stable@kernel.org>		[3.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14 18:09:38 -07:00
Johannes Weiner
185efc0f9a memcg: Revert "memcg: add memory.vmscan_stat"
Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
memory.vmscan_stat").

The implementation of per-memcg reclaim statistics violates how memcg
hierarchies usually behave: hierarchically.

The reclaim statistics are accounted to child memcgs and the parent
hitting the limit, but not to hierarchy levels in between.  Usually,
hierarchical statistics are perfectly recursive, with each level
representing the sum of itself and all its children.

Since this exports statistics to userspace, this may lead to confusion
and problems with changing things after the release, so revert it now,
we can try again later.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14 18:09:38 -07:00
Johannes Weiner
a4d3e9e763 mm: vmscan: fix force-scanning small targets without swap
Without swap, anonymous pages are not scanned.  As such, they should not
count when considering force-scanning a small target if there is no swap.

Otherwise, targets are not force-scanned even when their effective scan
number is zero and the other conditions--kswapd/memcg--apply.

This fixes 246e87a93934 ("memcg: fix get_scan_count() for small
targets").

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14 18:09:37 -07:00
David Rientjes
0d6617c773 numa: fix NUMA compile error when sysfs and procfs are disabled
The vmstat_text array is only defined for CONFIG_SYSFS or CONFIG_PROC_FS,
yet it is referenced for per-node vmstat with CONFIG_NUMA:

	drivers/built-in.o: In function `node_read_vmstat':
	node.c:(.text+0x1106df): undefined reference to `vmstat_text'

Introduced in commit fa25c503dfa2 ("mm: per-node vmstat: show proper
vmstats").

Define the array for CONFIG_NUMA as well.

[akpm@linux-foundation.org: remove unneeded ifdefs]
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Cong Wang <amwang@redhat.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14 18:09:37 -07:00
KAMEZAWA Hiroyuki
2bbff6c761 mm/mempolicy.c: make copy_from_user() provably correct
When compiling mm/mempolicy.c with struct user copy checks the following
warning is shown:

  In file included from arch/x86/include/asm/uaccess.h:572,
                   from include/linux/uaccess.h:5,
                   from include/linux/highmem.h:7,
                   from include/linux/pagemap.h:10,
                   from include/linux/mempolicy.h:70,
                   from mm/mempolicy.c:68:
  In function `copy_from_user',
      inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415:
  arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
    LD      mm/built-in.o

Fix this by passing correct buffer size value.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14 18:09:36 -07:00
Caspar Zhang
8aacc9f550 mm/mempolicy.c: fix pgoff in mbind vma merge
commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") didn't really
fix the mbind vma merge problem due to wrong pgoff value passing to
vma_merge(), which made vma_merge() always return NULL.

Before the patch applied, we are getting a result like:

  addr = 0x7fa58f00c000
  [snip]
  7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0
  7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0
  7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0

here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be
merged described as described in commit 9d8cebd.

Re-testing the patched kernel with the reproducer provided in commit
9d8cebd, we get the correct result:

  addr = 0x7ffa5aaa2000
  [snip]
  7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0
  7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0                          [stack]

Signed-off-by: Caspar Zhang <caspar@casparzhang.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14 18:09:36 -07:00
Alex,Shi
12d79634f8 slub: Code optimization in get_partial_node()
I find a way to reduce a variable in get_partial_node(). That is also helpful
for code understanding.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-13 20:41:25 +03:00
Jan Kara
09f40f98bf mm: Add comment explaining task state setting in bdi_forker_thread()
CC: Wu Fengguang <fengguang.wu@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-02 17:17:02 -06:00
Jan Kara
5a042aa4b8 mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
bdi_forker_thread() clears BDI_pending bit at the end of the main loop.
However clearing of this bit must not be done in some cases which is
handled by calling 'continue' from switch statement. That's kind of
unusual construct and without a good reason so change the function into
more intuitive code flow.

CC: Wu Fengguang <fengguang.wu@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-02 17:17:02 -06:00
Shaohua Li
136333d104 slub: explicitly document position of inserting slab to partial list
Adding slab to partial list head/tail is sensitive to performance.
So explicitly uses DEACTIVATE_TO_TAIL/DEACTIVATE_TO_HEAD to document
it to avoid we get it wrong.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-27 11:59:00 +03:00
Shaohua Li
130655ef09 slub: add slab with one free object to partial list tail
The slab has just one free object, adding it to partial list head doesn't make
sense. And it can cause lock contentation. For example,
1. CPU takes the slab from partial list
2. fetch an object
3. switch to another slab
4. free an object, then the slab is added to partial list again
In this way n->list_lock will be heavily contended.
In fact, Alex had a hackbench regression. 3.1-rc1 performance drops about 70%
against 3.0. This patch fixes it.

Acked-by: Christoph Lameter <cl@linux.com>
Reported-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-27 11:58:59 +03:00
Johannes Weiner
23751be009 memcg: fix hierarchical oom locking
Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
counter") tried to oom lock the hierarchy and roll back upon
encountering an already locked memcg.

The code is confused when it comes to detecting a locked memcg, though,
so it would fail and rollback after locking one memcg and encountering
an unlocked second one.

The result is that oom-locking hierarchies fails unconditionally and
that every oom killer invocation simply goes to sleep on the oom
waitqueue forever.  The tasks practically hang forever without anyone
intervening, possibly holding locks that trip up unrelated tasks, too.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25 16:25:34 -07:00
Shaohua Li
439423f689 vmscan: clear ZONE_CONGESTED for zone with good watermark
ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
task.  It's possible ZONE_CONGESTED isn't cleared in some cases:

 1. the zone is already balanced just entering balance_pgdat() for
    order-0 because concurrent tasks free memory.  In this case, later
    check will skip the zone as it's balanced so the flag isn't cleared.

 2. high order balance fallbacks to order-0.  quote from Mel: At the
    end of balance_pgdat(), kswapd uses the following logic;

	If reclaiming at high order {
		for each zone {
			if all_unreclaimable
				skip
			if watermark is not met
				order = 0
				loop again

			/* watermark is met */
			clear congested
		}
	}

    i.e. it clears ZONE_CONGESTED if it the zone is balanced.  if not,
    it restarts balancing at order-0.  However, if the higher zones are
    balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
    that only happens after a zone is shrunk.  This can mean that
    wait_iff_congested() stalls unnecessarily.

This patch makes kswapd clear ZONE_CONGESTED during its initial
highmem->dma scan for zones that are already balanced.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25 16:25:34 -07:00
Shaohua Li
f51bdd2e97 mm: fix a vmscan warning
I get the below warning:

  BUG: using smp_processor_id() in preemptible [00000000] code: bash/746
  caller is native_sched_clock+0x37/0x6e
  Pid: 746, comm: bash Tainted: G        W   3.0.0+ #254
  Call Trace:
   [<ffffffff813435c6>] debug_smp_processor_id+0xc2/0xdc
   [<ffffffff8104158d>] native_sched_clock+0x37/0x6e
   [<ffffffff81116219>] try_to_free_mem_cgroup_pages+0x7d/0x270
   [<ffffffff8114f1f8>] mem_cgroup_force_empty+0x24b/0x27a
   [<ffffffff8114ff21>] ? sys_close+0x38/0x138
   [<ffffffff8114ff21>] ? sys_close+0x38/0x138
   [<ffffffff8114f257>] mem_cgroup_force_empty_write+0x17/0x19
   [<ffffffff810c72fb>] cgroup_file_write+0xa8/0xba
   [<ffffffff811522d2>] vfs_write+0xb3/0x138
   [<ffffffff8115241a>] sys_write+0x4a/0x71
   [<ffffffff8114ffd9>] ? sys_close+0xf0/0x138
   [<ffffffff8176deab>] system_call_fastpath+0x16/0x1b

sched_clock() can't be used with preempt enabled.  And we don't need
fast approach to get clock here, so let's use ktime API.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25 16:25:34 -07:00
Johannes Weiner
5af12d0efd memcg: pin execution to current cpu while draining stock
Commit d1a05b6973c7 ("memcg do not try to drain per-cpu caches without
pages") added a drain_local_stock() call to a preemptible section.

The draining task looks up the cpu-local stock twice to set the
draining-flag, then to drain the stock and clear the flag again.  If the
task is migrated to a different CPU in between, noone will clear the
flag on the first stock and it will be forever undrainable.  Its charge
can not be recovered and the cgroup can not be deleted anymore.

Properly pin the task to the executing CPU while draining stocks.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25 16:25:33 -07:00
Linus Torvalds
e33f2d238e Merge branch 'urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback
* 'urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback:
  squeeze max-pause area and drop pass-good area
2011-08-25 10:40:12 -07:00
Justin P. Mattock
81d66c70b5 mm/vmscan.c: fix a typo in a comment "relaimed" to "reclaimed"
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-08-24 16:45:10 +02:00
Christoph Lameter
49e2258586 slub: per cpu cache for partial pages
Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
partial pages. The partial page list is used in slab_free() to avoid
per node lock taking.

In __slab_alloc() we can then take multiple partial pages off the per
node partial list in one go reducing node lock pressure.

We can also use the per cpu partial list in slab_alloc() to avoid scanning
partial lists for pages with free objects.

The main effect of a per cpu partial list is that the per node list_lock
is taken for batches of partial pages instead of individual ones.

Potential future enhancements:

1. The pickup from the partial list could be perhaps be done without disabling
   interrupts with some work. The free path already puts the page into the
   per cpu partial list without disabling interrupts.

2. __slab_free() may have some code paths that could use optimization.

Performance:

				Before		After
./hackbench 100 process 200000
				Time: 1953.047	1564.614
./hackbench 100 process 20000
				Time: 207.176   156.940
./hackbench 100 process 20000
				Time: 204.468	156.940
./hackbench 100 process 20000
				Time: 204.879	158.772
./hackbench 10 process 20000
				Time: 20.153	15.853
./hackbench 10 process 20000
				Time: 20.153	15.986
./hackbench 10 process 20000
				Time: 19.363	16.111
./hackbench 1 process 20000
				Time: 2.518	2.307
./hackbench 1 process 20000
				Time: 2.258	2.339
./hackbench 1 process 20000
				Time: 2.864	2.163

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19 19:34:27 +03:00
Christoph Lameter
497b66f2ec slub: return object pointer from get_partial() / new_slab().
There is no need anymore to return the pointer to a slab page from get_partial()
since the page reference can be stored in the kmem_cache_cpu structures "page" field.

Return an object pointer instead.

That in turn allows a simplification of the spaghetti code in __slab_alloc().

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19 19:34:27 +03:00
Christoph Lameter
acd19fd1a7 slub: pass kmem_cache_cpu pointer to get_partial()
Pass the kmem_cache_cpu pointer to get_partial(). That way
we can avoid the this_cpu_write() statements.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19 19:34:26 +03:00
Christoph Lameter
e6e82ea112 slub: Prepare inuse field in new_slab()
inuse will always be set to page->objects. There is no point in
initializing the field to zero in new_slab() and then overwriting
the value in __slab_alloc().

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19 19:34:26 +03:00
Christoph Lameter
7db0d70540 slub: Remove useless statements in __slab_alloc
Two statements in __slab_alloc() do not have any effect.

1. c->page is already set to NULL by deactivate_slab() called right before.

2. gfpflags are masked in new_slab() before being passed to the page
   allocator. There is no need to mask gfpflags in __slab_alloc in particular
   since most frequent processing in __slab_alloc does not require the use of a
   gfpmask.

Cc: torvalds@linux-foundation.org
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19 19:34:25 +03:00
Christoph Lameter
69cb8e6b7c slub: free slabs without holding locks
There are two situations in which slub holds a lock while releasing
pages:

	A. During kmem_cache_shrink()
	B. During kmem_cache_close()

For A build a list while holding the lock and then release the pages
later. In case of B we are the last remaining user of the slab so
there is no need to take the listlock.

After this patch all calls to the page allocator to free pages are
done without holding any spinlocks. kmem_cache_destroy() will still
hold the slub_lock semaphore.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-19 19:34:25 +03:00
Wu Fengguang
bb0822954a squeeze max-pause area and drop pass-good area
Revert the pass-good area introduced in ffd1f609ab10 ("writeback:
introduce max-pause and pass-good dirty limits") and make the max-pause
area smaller and safe.

This fixes ~30% performance regression in the ext3 data=writeback
fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.

Using deadline scheduler also has a regression, but not that big as CFQ,
so this suggests we have some write starvation.

The test logs show that

- the disks are sometimes under utilized

- global dirty pages sometimes rush high to the pass-good area for
  several hundred seconds, while in the mean time some bdi dirty pages
  drop to very low value (bdi_dirty << bdi_thresh).  Then suddenly the
  global dirty pages dropped under global dirty threshold and bdi_dirty
  rush very high (for example, 2 times higher than bdi_thresh). During
  which time balance_dirty_pages() is not called at all.

So the problems are

1) The random writes progress so slow that they break the assumption of
   the max-pause logic that "8 pages per 200ms is typically more than
   enough to curb heavy dirtiers".

2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
   for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
   and then (bdi_thresh >> bdi_dirty) for others.

3) The higher max-pause/pass-good thresholds somehow leads to the bad
   swing of dirty pages.

The fix is to allow the task to slightly dirty over task_bdi_thresh, but
no way to exceed bdi_dirty and/or global dirty_thresh.

Tests show that it fixed the JBOD regression completely (both behavior
and performance), while still being able to cut down large pause times
in balance_dirty_pages() for single-disk cases.

Reported-by: Li Shaohua <shaohua.li@intel.com>
Tested-by: Li Shaohua <shaohua.li@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-08-19 22:42:07 +08:00
Ian Campbell
f991879473 mm: make HASHED_PAGE_VIRTUAL page_address' struct page argument const.
Followup to 33dd4e0ec911 "mm: make some struct page's const" which missed the
HASHED_PAGE_VIRTUAL case.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-17 13:00:20 -07:00
Clemens Ladisch
f982f91516 mm: fix wrong vmap address calculations with odd NR_CPUS values
Commit db64fe02258f ("mm: rewrite vmap layer") introduced code that does
address calculations under the assumption that VMAP_BLOCK_SIZE is a
power of two.  However, this might not be true if CONFIG_NR_CPUS is not
set to a power of two.

Wrong vmap_block index/offset values could lead to memory corruption.
However, this has never been observed in practice (or never been
diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that
checks for inconsistent vmap_block indices.

To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572
Reported-by: Pavel Kysilka <goldenfish@linuxsoft.cz>
Reported-by: Matias A. Fonzo <selk@dragora.org>
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: 2.6.28+ <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-14 12:32:52 -07:00
Michal Hocko
9f50fad65b Revert "memcg: get rid of percpu_charge_mutex lock"
This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc.

The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
bit operations is sufficient but that is not true.  Johannes Weiner has
reported a crash during parallel memory cgroup removal:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
  IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
  Oops: 0000 [#1] PREEMPT SMP
  Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
  RIP: 0010:[<ffffffff81083b70>]  css_is_ancestor+0x20/0x70
  RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
  Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
  Call Trace:
   [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
   [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
   [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
   [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
   [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
   [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
   [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
   [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
   [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b

We are crashing because we try to dereference cached memcg when we are
checking whether we should wait for draining on the cache.  The cache is
already cleaned up, though.

There is also a theoretical chance that the cached memcg gets freed
between we test for the FLUSHING_CACHED_CHARGE and dereference it in
mem_cgroup_same_or_subtree:

        CPU0                    CPU1                         CPU2
  mem=stock->cached
  stock->cached=NULL
                              clear_bit
                                                        test_and_set_bit
  test_bit()                    ...
  <preempted>             mem_cgroup_destroy
  use after free

The percpu_charge_mutex protected from this race because sync draining
is exclusive.

It is safer to revert now and come up with a more parallel
implementation later.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-09 17:04:43 -07:00
Christoph Lameter
81107188f1 slub: Fix partial count comparison confusion
deactivate_slab() has the comparison if more than the minimum number of
partial pages are in the partial list wrong. An effect of this may be that
empty pages are not freed from deactivate_slab(). The result could be an
OOM due to growth of the partial slabs per node. Frees mostly occur from
__slab_free which is okay so this would only affect use cases where a lot
of switching around of per cpu slabs occur.

Switching per cpu slabs occurs with high frequency if debugging options are
enabled.

Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09 21:12:31 +03:00
Akinobu Mita
ef62fb32b7 slub: fix check_bytes() for slub debugging
The check_bytes() function is used by slub debugging.  It returns a pointer
to the first unmatching byte for a character in the given memory area.

If the character for matching byte is greater than 0x80, check_bytes()
doesn't work.  Becuase 64-bit pattern is generated as below.

	value64 = value | value << 8 | value << 16 | value << 24;
	value64 = value64 | value64 << 32;

The integer promotions are performed and sign-extended as the type of value
is u8.  The upper 32 bits of value64 is 0xffffffff in the first line, and
the second line has no effect.

This fixes the 64-bit pattern generation.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Matt Mackall <mpm@selenic.com>
Reviewed-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09 16:37:48 +03:00
Christoph Lameter
6fbabb20fa slub: Fix full list corruption if debugging is on
When a slab is freed by __slab_free() and the slab can only contain a
single object ever then it was full (and therefore not on the partial
lists but on the full list in the debug case) before we reached
slab_empty.

This caused the following full list corruption when SLUB debugging was enabled:

  [ 5913.233035] ------------[ cut here ]------------
  [ 5913.233097] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
  [ 5913.233101] Hardware name: Adamo 13
  [ 5913.233105] list_del corruption. prev->next should be ffffea000434fd20, but was ffffea0004199520
  [ 5913.233108] Modules linked in: nfs fscache fuse ebtable_nat ebtables ppdev parport_pc lp parport ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss xt_CHECKSUM sunrpc iptable_mangle bridge stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables rfcomm bnep arc4 iwlagn snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel btusb mac80211 snd_hda_codec bluetooth snd_hwdep snd_seq snd_seq_device snd_pcm usb_debug dell_wmi sparse_keymap cdc_ether usbnet cdc_acm uvcvideo cdc_wdm mii cfg80211 snd_timer dell_laptop videodev dcdbas snd microcode v4l2_compat_ioctl32 soundcore joydev tg3 pcspkr snd_page_alloc iTCO_wdt i2c_i801 rfkill iTCO_vendor_support wmi virtio_net kvm_intel kvm ipv6 xts gf128mul dm_crypt i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
  [ 5913.233213] Pid: 0, comm: swapper Not tainted 3.0.0+ #127
  [ 5913.233213] Call Trace:
  [ 5913.233213]  <IRQ>  [<ffffffff8105df18>] warn_slowpath_common+0x83/0x9b
  [ 5913.233213]  [<ffffffff8105dfd3>] warn_slowpath_fmt+0x46/0x48
  [ 5913.233213]  [<ffffffff8127e7c1>] __list_del_entry+0x8d/0x98
  [ 5913.233213]  [<ffffffff8127e7da>] list_del+0xe/0x2d
  [ 5913.233213]  [<ffffffff814e0430>] __slab_free+0x1db/0x235
  [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
  [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
  [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
  [ 5913.233213]  [<ffffffff81133085>] kmem_cache_free+0x88/0x102
  [ 5913.233213]  [<ffffffff811706ab>] bvec_free_bs+0x35/0x37
  [ 5913.233213]  [<ffffffff811706e1>] bio_free+0x34/0x64
  [ 5913.233213]  [<ffffffff813dc390>] dm_bio_destructor+0x12/0x14
  [ 5913.233213]  [<ffffffff8116fef6>] bio_put+0x2b/0x2d
  [ 5913.233213]  [<ffffffff813dccab>] clone_endio+0x9e/0xb4
  [ 5913.233213]  [<ffffffff8116f7dd>] bio_endio+0x2d/0x2f
  [ 5913.233213]  [<ffffffffa00148da>] crypt_dec_pending+0x5c/0x8b [dm_crypt]
  [ 5913.233213]  [<ffffffffa00150a9>] crypt_endio+0x78/0x81 [dm_crypt]

[ Full discussion here: https://lkml.org/lkml/2011/8/4/375 ]

Make sure that we remove such a slab also from the full lists.

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09 16:36:02 +03:00
James Morris
5a2f3a02ae Merge branch 'next-evm' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/ima-2.6 into next
Conflicts:
	fs/attr.c

Resolve conflict manually.

Signed-off-by: James Morris <jmorris@namei.org>
2011-08-09 10:31:03 +10:00
Linus Torvalds
f03683b8fb Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  slab, lockdep: Annotate the locks before using them
  lockdep: Clear whole lockdep_map on initialization
  slab, lockdep: Annotate slab -> rcu -> debug_object -> slab
  lockdep: Fix up warning
  lockdep: Fix trace_hardirqs_on_caller()
  futex: Fix regression with read only mappings
2011-08-04 16:44:04 -10:00
Peter Zijlstra
30765b92ad slab, lockdep: Annotate the locks before using them
Fernando found we hit the regular OFF_SLAB 'recursion' before we
annotate the locks, cure this.

The relevant portion of the stack-trace:

> [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
> [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
> [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
> [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
> [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
> [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
> [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
> [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
> [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
> [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf

Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-04 10:18:00 +02:00
Peter Zijlstra
83835b3d9a slab, lockdep: Annotate slab -> rcu -> debug_object -> slab
Lockdep thinks there's lock recursion through:

	kmem_cache_free()
	  cache_flusharray()
	    spin_lock(&l3->list_lock)  <----------------.
	    free_block()                                |
	      slab_destroy()                            |
		call_rcu()                              |
		  debug_object_activate()               |
		    debug_object_init()                 |
		      __debug_object_init()             |
			kmem_cache_alloc()              |
			  cache_alloc_refill()          |
			    spin_lock(&l3->list_lock) --'

Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no
actual possibility of recursing. Luckily debug objects marks it slab
with SLAB_DEBUG_OBJECTS so we can identify the thing.

Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special
lockdep key so that lockdep sees its a different cachep.

Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU |
SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble.

Reported-and-tested-by: Sebastian Siewior <sebastian@breakpoint.cc>
[ fixes to the initial patch ]
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-04 10:17:54 +02:00
Linus Torvalds
c0c770e610 Merge branch 'apei-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'apei-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
  ACPI, APEI, EINJ Param support is disabled by default
  APEI GHES: 32-bit buildfix
  ACPI: APEI build fix
  ACPI, APEI, GHES: Add hardware memory error recovery support
  HWPoison: add memory_failure_queue()
  ACPI, APEI, GHES, Error records content based throttle
  ACPI, APEI, GHES, printk support for recoverable error via NMI
  lib, Make gen_pool memory allocator lockless
  lib, Add lock-less NULL terminated single list
  Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG
  ACPI, APEI, Add WHEA _OSC support
  ACPI, APEI, Add APEI bit support in generic _OSC call
  ACPI, APEI, GHES, Support disable GHES at boot time
  ACPI, APEI, GHES, Prevent GHES to be built as module
  ACPI, APEI, Use apei_exec_run_optional in APEI EINJ and ERST
  ACPI, APEI, Add apei_exec_run_optional
  ACPI, APEI, GHES, Do not ratelimit fatal error printk before panic
  ACPI, APEI, ERST, Fix erst-dbg long record reading issue
  ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled
2011-08-03 21:53:27 -10:00
Hugh Dickins
8079b1c859 mm: clarify the radix_tree exceptional cases
Make the radix_tree exceptional cases, mostly in filemap.c, clearer.

It's hard to devise a suitable snappy name that illuminates the use by
shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
generality.  And akpm points out that /* radix_tree_deref_retry(page) */
comments look like calls that have been commented out for unknown
reason.

Skirt the naming difficulty by rearranging these blocks to handle the
transient radix_tree_deref_retry(page) case first; then just explain the
remaining shmem/tmpfs swap case in a comment.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:24 -10:00
Hugh Dickins
e504f3fdd6 tmpfs radix_tree: locate_item to speed up swapoff
We have already acknowledged that swapoff of a tmpfs file is slower than
it was before conversion to the generic radix_tree: a little slower
there will be acceptable, if the hotter paths are faster.

But it was a shock to find swapoff of a 500MB file 20 times slower on my
laptop, taking 10 minutes; and at that rate it significantly slows down
my testing.

Now, most of that turned out to be overhead from PROVE_LOCKING and
PROVE_RCU: without those it was only 4 times slower than before; and
more realistic tests on other machines don't fare as badly.

I've tried a number of things to improve it, including tagging the swap
entries, then doing lookup by tag: I'd expected that to halve the time,
but in practice it's erratic, and often counter-productive.

The only change I've so far found to make a consistent improvement, is
to short-circuit the way we go back and forth, gang lookup packing
entries into the array supplied, then shmem scanning that array for the
target entry.  Scanning in place doubles the speed, so it's now only
twice as slow as before (or three times slower when the PROVEs are on).

So, add radix_tree_locate_item() as an expedient, once-off,
single-caller hack to do the lookup directly in place.  #ifdef it on
CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
applicability as save space in other configurations.  And, sadly,
#include sched.h for cond_resched().

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:24 -10:00
Hugh Dickins
31475dd611 mm: a few small updates for radix-swap
Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
go through shmem_add_to_page_cache().

Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().

And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:24 -10:00
Hugh Dickins
69f07ec938 tmpfs: use kmemdup for short symlinks
But we've not yet removed the old swp_entry_t i_direct[16] from
shmem_inode_info.  That's because it was still being shared with the
inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
size), and use kmemdup() for short symlinks, say, those up to 128 bytes.

I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
rather than shmem_evict_inode(), where we usually do such freeing? I
guess it doesn't matter, and I'm not into NUMA mpol testing right now.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:24 -10:00
Hugh Dickins
6922c0c7ab tmpfs: convert shmem_writepage and enable swap
Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
shmem_radix_tree_replace() to substitute swap entry for page pointer
atomically in the radix tree.

As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
copying such code from delete_from_swap_cache, but again judged easier
to sell than making its other callers go through the extras.

Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
now unreferenced, and the hack to disable swap: it's now good to go.

The way things have worked out, info->lock no longer helps to guard the
shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
That global mutex exclusion between shmem_writepage() and shmem_unuse()
is not pretty, and we ought to find another way; but it's been forced on
us by recent race discoveries, not a consequence of this patchset.

And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
swap entry was found already present? That's no longer possible, the
(unknown) one inserting this page into filecache would hit the swap
entry occupying that slot.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:24 -10:00
Hugh Dickins
aa3b189551 tmpfs: convert mem_cgroup shmem to radix-swap
Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
had to move swappage to filecache with GFP_NOWAIT.

Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
moving its call out from shmem_add_to_page_cache() to two of thats three
callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
although asymmetrical, it's easier for all 3 callers to handle.

These two changes would also be appropriate if anyone were to start
using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
radix_tree_exceptional_entry() to get what it needs for itself.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:24 -10:00
Hugh Dickins
54af604218 tmpfs: convert shmem_getpage_gfp to radix-swap
Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
swap entry returned from radix tree by find_lock_page().

Whereas the repetitive old method proceeded mainly under info->lock,
dropping and repeating whenever one of the conditions needed was not
met, now we can proceed without it, leaving shmem_add_to_page_cache() to
check for a race.

This way there is no need to preallocate a page, no need for an early
radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().

Move the error unwinding down to the bottom instead of repeating it
throughout.  ENOSPC handling is a little different from before: there is
no longer any race between find_lock_page() and finding swap, but we can
arrive at ENOSPC before calling shmem_recalc_inode(), which might
occasionally discover freed space.

Be stricter to check i_size before returning.  info->lock is used for
little but alloced, swapped, i_blocks updates.  Move i_blocks updates
out from under the max_blocks check, so even an unlimited size=0 mount
can show accurate du.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:23 -10:00
Hugh Dickins
46f65ec15c tmpfs: convert shmem_unuse_inode to radix-swap
Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
tree, searching for matching swap.

This is somewhat slower than the old method: because of repeated radix
tree descents, because of copying entries up, but probably most because
the old method noted and skipped once a vector page was cleared of swap.
Perhaps we can devise a use of radix tree tagging to achieve that later.

shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
for the lockless lookup by checking that the expected entry is in place,
under lock.  It is not very satisfactory to be copying this much from
add_to_page_cache_locked(), but I think easier to sell than insisting
that every caller of add_to_page_cache*() go through the extras.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:23 -10:00
Hugh Dickins
7a5d0fbb29 tmpfs: convert shmem_truncate_range to radix-swap
Disable the toy swapping implementation in shmem_writepage() - it's hard
to support two schemes at once - and convert shmem_truncate_range() to a
lockless gang lookup of swap entries along with pages, freeing both.

Since the second loop tightens its noose until all entries of either
kind have been squeezed out (and we shall make sure that there's not an
instant when neither is visible), there is no longer a need for yet
another pass below.

shmem_radix_tree_replace() compensates for the lockless lookup by
checking that the expected entry is in place, under lock, before
replacing it.  Here it just deletes, but will be used in later patches
to substitute swap entry for page or page for swap entry.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:23 -10:00
Hugh Dickins
bda97eab0c tmpfs: copy truncate_inode_pages_range
Bring truncate.c's code for truncate_inode_pages_range() inline into
shmem_truncate_range(), replacing its first call (there's a followup
call below, but leave that one, it will disappear next).

Don't play with it yet, apart from leaving out the cleancache flush, and
(importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
partial page preparation into its partial page handling.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:23 -10:00
Hugh Dickins
41ffe5d5ce tmpfs: miscellaneous trivial cleanups
While it's at its least, make a number of boring nitpicky cleanups to
shmem.c, mostly for consistency of variable naming.  Things like "swap"
instead of "entry", "pgoff_t index" instead of "unsigned long idx".

And since everything else here is prefixed "shmem_", better change
init_tmpfs() to shmem_init().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:23 -10:00