linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-15 13:15:57 +00:00

Author	SHA1	Message	Date
Linus Torvalds	1a52bb0b68	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: ensure prealloc_blob is in place when removing xattr rbd: initialize snap_rwsem in rbd_add() ceph: enable/disable dentry complete flags via mount option vfs: export symbol d_find_any_alias() ceph: always initialize the dentry in open_root_dentry() libceph: remove useless return value for osd_client __send_request() ceph: avoid iput() while holding spinlock in ceph_dir_fsync ceph: avoid useless dget/dput in encode_fh ceph: dereference pointer after checking for NULL crush: fix force for non-root TAKE ceph: remove unnecessary d_fsdata conditional checks ceph: Use kmemdup rather than duplicating its implementation Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs always initialize the dentry in open_root_dentry)	2012-01-13 10:29:21 -08:00
Christoph Hellwig	3d2b3129c2	xfs: remove the unused dm_attrs structure .. and the just as dead bhv_desc forward declaration while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-13 12:11:46 -06:00
Christoph Hellwig	bf322d983e	xfs: cleanup xfs_iomap_eof_align_last_fsb Replace the nasty if, else if, elseif condition with more natural C flow that expressed the logic we want here better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-13 12:11:46 -06:00
Christoph Hellwig	673e8e597c	xfs: remove xfs_itruncate_data This wrapper isn't overly useful, not to say rather confusing. Around the call to xfs_itruncate_extents it does: - add tracing - add a few asserts in debug builds - conditionally update the inode size in two places - log the inode Both the tracing and the inode logging can be moved to xfs_itruncate_extents as they are useful for the attribute fork as well - in fact the attr code already does an equivalent xfs_trans_log_inode call just after calling xfs_itruncate_extents. The conditional size updates are a mess, and there was no reason to do them in two places anyway, as the first one was conditional on the inode having extents - but without extents we xfs_itruncate_extents would be a no-op and the placement wouldn't matter anyway. Instead move the size assignments and the asserts that make sense to the callers that want it. As a side effect of this clean up xfs_setattr_size by introducing variables for the old and new inode size, and moving the size updates into a common place. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>	2012-01-13 12:11:45 -06:00
Ian Kent	8638094e95	autofs4 - fix deal with autofs4_write races I don't know how I missed this obvious mistake when I reviewed Als' patches, sorry. [ Quoting Al: Grr... Note to self: do git status and git stash show -p before git push. Nothing like "WTF? I'd fixed that braino" feeling ;-/ Al sent the same patch - it got broken in commit d668dc56631d: "autofs4: deal with autofs4_write/autofs4_write races". ] Reported-and-tested-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-13 08:30:49 -08:00
Artem Bityutskiy	515315a123	UBIFS: fix key printing Before commit 56e46742e846e4de167dde0e1e1071ace1c882a5 we have had locking around all printing macros and we could use static buffers for creating key strings and printing them. However, now we do not have that locking and we cannot use static buffers. This commit removes the old DBGKEY() macros and introduces few new helper macros for printing debugging messages plus a key at the end. Thankfully, all the messages are already structures in a way that the key is printed in the end. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2012-01-13 12:50:42 +02:00
Artem Bityutskiy	beba006074	UBIFS: use snprintf instead of sprintf when printing keys Switch to 'snprintf()' which is more secure and reliable. This is also a preparation to the subsequent key printing fixes. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>	2012-01-13 12:46:21 +02:00
Linus Torvalds	099469502f	Merge branch 'akpm' (aka "Andrew's patch-bomb, take two") Andrew explains: - various misc stuff - Most of the rest of MM: memcg, threaded hugepages, others. - cpumask - kexec - kdump - some direct-io performance tweaking - radix-tree optimisations - new selftests code A note on this: often people will develop a new userspace-visible feature and will develop userspace code to exercise/test that feature. Then they merge the patch and the selftest code dies. Sometimes we paste it into the changelog. Sometimes the code gets thrown into Documentation/(!). This saddens me. So this patch creates a bare-bones framework which will henceforth allow me to ask people to include their test apps in the kernel tree so we can keep them alive. Then when people enhance or fix the feature, I can ask them to update the test app too. The infrastruture is terribly trivial at present - let's see how it evolves. - checkpoint/restart feature work. A note on this: this is a project by various mad Russians to perform c/r mainly from userspace, with various oddball helper code added into the kernel where the need is demonstrated. So rather than some large central lump of code, what we have is little bits and pieces popping up in various places which either expose something new or which permit something which is normally kernel-private to be modified. The overall project is an ongoing thing. I've judged that the size and scope of the thing means that we're more likely to be successful with it if we integrate the support into mainline piecemeal rather than allowing it all to develop out-of-tree. However I'm less confident than the developers that it will all eventually work! So what I'm asking them to do is to wrap each piece of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all eventually comes to tears and the project as a whole fails, it should be a simple matter to go through and delete all trace of it. This lot pretty much wraps up the -rc1 merge for me. * akpm: (96 commits) unlzo: fix input buffer free ramoops: update parameters only after successful init ramoops: fix use of rounddown_pow_of_two() c/r: prctl: add PR_SET_MM codes to set up mm_struct entries c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4 c/r: introduce CHECKPOINT_RESTORE symbol selftests: new x86 breakpoints selftest selftests: new very basic kernel selftests directory radix_tree: take radix_tree_path off stack radix_tree: remove radix_tree_indirect_to_ptr() dio: optimize cache misses in the submission path vfs: cache request_queue in struct block_device fs/direct-io.c: calculate fs_count correctly in get_more_blocks() drivers/parport/parport_pc.c: fix warnings panic: don't print redundant backtraces on oops sysctl: add the kernel.ns_last_pid control kdump: add udev events for memory online/offline include/linux/crash_dump.h needs elf.h kdump: fix crash_kexec()/smp_send_stop() race in panic() kdump: crashk_res init check for /sys/kernel/kexec_crash_size ...	2012-01-12 20:42:54 -08:00
Cyrill Gorcunov	b3f7f573a2	c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4 The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are involved into calculation of program text/data segment sizes (which might be seen in /proc/<pid>/statm) and into brk() call final address. For restore we need to know all these values. While mm->start_code/end_code already present in /proc/$pid/stat, the rest members are not, so this patch brings them in. The restore procedure of these members is addressed in another patch using prctl(). Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:13 -08:00
Andi Kleen	65dd2aa90a	dio: optimize cache misses in the submission path Some investigation of a transaction processing workload showed that a major consumer of cycles in __blockdev_direct_IO is the cache miss while accessing the block size. This is because it has to walk the chain from block_dev to gendisk to queue. The block size is needed early on to check alignment and sizes. It's only done if the check for the inode block size fails. But the costly block device state is unconditionally fetched. - Reorganize the code to only fetch block dev state when actually needed. Then do a prefetch on the block dev early on in the direct IO path. This is worth it, because there is substantial code run before we actually touch the block dev now. - I also added some unlikelies to make it clear the compiler that block device fetch code is not normally executed. This gave a small, but measurable improvement on a large database benchmark (about 0.3%) [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: using prefetch requires including prefetch.h] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:12 -08:00
Andi Kleen	87192a2a49	vfs: cache request_queue in struct block_device This makes it possible to get from the inode to the request_queue with one less cache miss. Used in followon optimization. The livetime of the pointer is the same as the gendisk. This assumes that the queue will always stay the same in the gendisk while it's visible to block_devices. I think that's safe correct? Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:12 -08:00
Tao Ma	ae55e1aaa7	fs/direct-io.c: calculate fs_count correctly in get_more_blocks() In get_more_blocks(), we use dio_count to calcuate fs_count and do some tricky things to increase fs_count if dio_count isn't aligned. But actually it still has some corner cases that can't be coverd. See the following example: dio_write foo -s 1024 -w 4096 (direct write 4096 bytes at offset 1024). The same goes if the offset isn't aligned to fs_blocksize. In this case, the old calculation counts fs_count to be 1, but actually we will write into 2 different blocks (if fs_blocksize=4096). The old code just works, since it will call get_block twice (and may have to allocate and create extents twice for filesystems like ext4). So we'd better call get_block just once with the proper fs_count. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:12 -08:00
Mel Gorman	a6bc32b899	mm: compaction: introduce sync-light migration for use by compaction This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT mode that avoids writing back pages to backing storage. Async compaction maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is used. This avoids sync compaction stalling for an excessive length of time, particularly when copying files to a USB stick where there might be a large number of dirty pages backed by a filesystem that does not support ->writepages. [aarcange@redhat.com: This patch is heavily based on Andrea's work] [akpm@linux-foundation.org: fix fs/nfs/write.c build] [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:09 -08:00
Mel Gorman	b969c4ab9f	mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage Asynchronous compaction is used when allocating transparent hugepages to avoid blocking for long periods of time. Due to reports of stalling, there was a debate on disabling synchronous compaction but this severely impacted allocation success rates. Part of the reason was that many dirty pages are skipped in asynchronous compaction by the following check; if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page) rc = -EBUSY; This skips over all mapping aops using buffer_migrate_page() even though it is possible to migrate some of these pages without blocking. This patch updates the ->migratepage callback with a "sync" parameter. It is the responsibility of the callback to fail gracefully if migration would block. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:09 -08:00
Jason Baron	28d82dc1c4	epoll: limit paths The current epoll code can be tickled to run basically indefinitely in both loop detection path check (on ep_insert()), and in the wakeup paths. The programs that tickle this behavior set up deeply linked networks of epoll file descriptors that cause the epoll algorithms to traverse them indefinitely. A couple of these sample programs have been previously posted in this thread: https://lkml.org/lkml/2011/2/25/297. To fix the loop detection path check algorithms, I simply keep track of the epoll nodes that have been already visited. Thus, the loop detection becomes proportional to the number of epoll file descriptor and links. This dramatically decreases the run-time of the loop check algorithm. In one diabolical case I tried it reduced the run-time from 15 mintues (all in kernel time) to .3 seconds. Fixing the wakeup paths could be done at wakeup time in a similar manner by keeping track of nodes that have already been visited, but the complexity is harder, since there can be multiple wakeups on different cpus...Thus, I've opted to limit the number of possible wakeup paths when the paths are created. This is accomplished, by noting that the end file descriptor points that are found during the loop detection pass (from the newly added link), are actually the sources for wakeup events. I keep a list of these file descriptors and limit the number and length of these paths that emanate from these 'source file descriptors'. In the current implemetation I allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of length 4 and 10 of length 5. Note that it is sufficient to check the 'source file descriptors' reachable from the newly added link, since no other 'source file descriptors' will have newly added links. This allows us to check only the wakeup paths that may have gotten too long, and not re-check all possible wakeup paths on the system. In terms of the path limit selection, I think its first worth noting that the most common case for epoll, is probably the model where you have 1 epoll file descriptor that is monitoring n number of 'source file descriptors'. In this case, each 'source file descriptor' has a 1 path of length 1. Thus, I believe that the limits I'm proposing are quite reasonable and in fact may be too generous. Thus, I'm hoping that the proposed limits will not prevent any workloads that currently work to fail. In terms of locking, I have extended the use of the 'epmutex' to all epoll_ctl add and remove operations. Currently its only used in a subset of the add paths. I need to hold the epmutex, so that we can correctly traverse a coherent graph, to check the number of paths. I believe that this additional locking is probably ok, since its in the setup/teardown paths, and doesn't affect the running paths, but it certainly is going to add some extra overhead. Also, worth noting is that the epmuex was recently added to the ep_ctl add operations in the initial path loop detection code using the argument that it was not on a critical path. Another thing to note here, is the length of epoll chains that is allowed. Currently, eventpoll.c defines: /* Maximum number of nesting allowed inside epoll sets */ #define EP_MAX_NESTS 4 This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS + 1). However, this limit is currently only enforced during the loop check detection code, and only when the epoll file descriptors are added in a certain order. Thus, this limit is currently easily bypassed. The newly added check for wakeup paths, stricly limits the wakeup paths to a length of 5, regardless of the order in which ep's are linked together. Thus, a side-effect of the new code is a more consistent enforcement of the graph depth. Thus far, I've tested this, using the sample programs previously mentioned, which now either return quickly or return -EINVAL. I've also testing using the piptest.c epoll tester, which showed no difference in performance. I've also created a number of different epoll networks and tested that they behave as expectded. I believe this solves the original diabolical test cases, while still preserving the sane epoll nesting. Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Nelson Elhage <nelhage@ksplice.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:04 -08:00
Sasha Levin	2ccd4f4d47	pipe: fail cleanly when root tries F_SETPIPE_SZ with big size When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe with size bigger than kmalloc() can alloc it spits out an ugly warning: ------------[ cut here ]------------ WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0() Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4 Call Trace: warn_slowpath_common+0x75/0xb0 warn_slowpath_null+0x15/0x20 __alloc_pages_nodemask+0x5d3/0x7a0 __get_free_pages+0x12/0x50 __kmalloc+0x12b/0x150 pipe_set_size+0x75/0x120 pipe_fcntl+0xf8/0x140 do_fcntl+0x2d4/0x410 sys_fcntl+0x66/0xa0 system_call_fastpath+0x16/0x1b ---[ end trace 432f702e6db7b5ee ]--- Instead, make kcalloc() handle the overflow case and fail quietly. [akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness] Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:04 -08:00
Xiaotian Feng	a2ef990ab5	proc: fix null pointer deref in proc_pid_permission() get_proc_task() can fail to search the task and return NULL, put_task_struct() will then bomb the kernel with following oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffff81217d34>] proc_pid_permission+0x64/0xe0 PGD 112075067 PUD 112814067 PMD 0 Oops: 0002 [#1] PREEMPT SMP This is a regression introduced by commit 0499680a ("procfs: add hidepid= and gid= mount options"). The kernel should return -ESRCH if get_proc_task() failed. Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Stephen Wilson <wilsons@start.ca> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-01-12 20:13:02 -08:00
Rusty Russell	90ab5ee941	module_param: make bool parameters really bool (drivers & misc) module_param(bool) used to counter-intuitively take an int. In fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy trick. It's time to remove the int/unsigned int option. For this version it'll simply give a warning, but it'll break next kernel version. Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2012-01-13 09:32:20 +10:30
Rusty Russell	69116f279a	module_param: avoid bool abuse, add bint for special cases. For historical reasons, we allow module_param(bool) to take an int (or an unsigned int). That's going away. A few drivers really want an int: they set it to -1 and a parameter will set it to 0 or 1. This sucks: reading them from sysfs will give 'Y' for both -1 and 1, but if we change it to an int, then the users might be broken (if they did "param" instead of "param=1"). Use a new 'bint' parser for them. (ntfs has a different problem: it needs an int for debug_msgs because it's also exposed via sysctl.) Cc: Steve Glendinning <steve.glendinning@smsc.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: Guenter Roeck <guenter.roeck@ericsson.com> Cc: Hoang-Nam Nguyen <hnguyen@de.ibm.com> Cc: Christoph Raisch <raisch@de.ibm.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: linux390@de.ibm.com Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.de> Cc: lm-sensors@lm-sensors.org Cc: linux-rdma@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: alsa-devel@alsa-project.org Acked-by: Takashi Iwai <tiwai@suse.de> (For the sound part) Acked-by: Guenter Roeck <guenter.roeck@ericsson.com> (For the hwmon driver) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2012-01-13 09:32:17 +10:30
Peng Tao	7c5465d6cc	pnfsblock: alloc short extent before submit bio As discussed earlier, it is better for block client to allocate memory for tracking extents state before submitting bio. So the patch does it by allocating a short_extent for every INVALID extent touched by write pagelist and for every zeroing page we created, saving them in layout header. Then in end_io we can just use them to create commit list items and avoid memory allocation there. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:52:10 -05:00
Peng Tao	c0411a94a8	pnfsblock: remove rpc_call_ops from struct parallel_io block layout can just make use of generic read/write_done. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:52:10 -05:00
Peng Tao	72c5088799	pnfsblock: move find lock page logic out of bl_write_pagelist Also avoid unnecessary lock_page if page is handled by others. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:52:10 -05:00
Peng Tao	60c52e3a72	pnfsblock: cleanup bl_mark_sectors_init It does not need to manipulate on partial initialized blocks. Writeback code takes care of it. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:52:09 -05:00
Peng Tao	74a6eeb44c	pnfsblock: limit bio page count One bio can have at most BIO_MAX_PAGES pages. We should limit it bec otherwise bio_alloc will fail when there are many pages in one read/write_pagelist. Cc: <stable@vger.kernel.org> #3.1+ Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:39:05 -05:00
Peng Tao	93a3844ee0	pnfsblock: don't spinlock when freeing block_dev bl_free_block_dev() may sleep. We can not call it with spinlock held. Besides, there is no need to take bm_lock as we are last user freeing bm_devlist. Cc: <stable@vger.kernel.org> #3.1+ Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:39:04 -05:00
Peng Tao	57582b372f	pnfsblock: clean up _add_entry It is wrong to kmalloc in _add_entry() as it is inside spinlock. memory should be already allocated _add_entry() is called. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:38:55 -05:00
Peng Tao	82b906d655	pnfsblock: set read/write tk_status to pnfs_error To pass the IO status to upper layer. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:38:51 -05:00
Peng Tao	39e567ae36	pnfsblock: acquire im_lock in _preload_range When calling _add_entry, we should take the im_lock to protect agains other modifiers. Cc: <stable@vger.kernel.org> #3.1+ Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:38:49 -05:00
Peng Tao	de040beccd	NFS4: fix compile warnings in nfs4proc.c compile in nfs-for-3.3 branch shows following warnings. Fix it here. fs/nfs/nfs4proc.c: In function ‘__nfs4_get_acl_uncached’: fs/nfs/nfs4proc.c:3589: warning: format ‘%ld’ expects type ‘long int’, but argument 4 has type ‘size_t’ fs/nfs/nfs4proc.c:3589: warning: format ‘%ld’ expects type ‘long int’, but argument 6 has type ‘size_t’ Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:31:51 -05:00
Dan Carpenter	363e0df057	nfs: check for integer overflow in decode_devicenotify_args() On 32 bit, if n is too large then "n * sizeof(*args->devs)" could overflow and args->devs would be smaller than expected. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:30:07 -05:00
Dan Carpenter	13fff2f35f	NFS: cleanup endian type in decode_ds_addr() port is supposed to be a __be16 here. The existing code should work fine, but this is a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:30:03 -05:00
Dan Carpenter	0e0243dc35	NFS: add an endian notation This function returns a big endian value. The implementation in fs/nfs/callback_proc.c is declared with "__be32" but the .h file uses "unsigned" instead. It makes sparse complain: fs/nfs/callback_proc.c:232:8: error: symbol 'nfs4_callback_layoutrecall' redeclared with different type (originally declared at fs/nfs/callback.h:165) - different base types Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2012-01-12 16:29:51 -05:00
Linus Torvalds	6733e54b66	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: FUSE: Notifying the kernel of deletion. fuse: support ioctl on directories fuse: Use kcalloc instead of kzalloc to allocate array fuse: llseek optimize SEEK_CUR and SEEK_SET	2012-01-12 12:39:21 -08:00
Dan Carpenter	7250170c9e	cifs: integer overflow in parse_dacl() On 32 bit systems num_aces * sizeof(struct cifs_ace *) could overflow leading to a smaller ppace buffer than we expected. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>	2012-01-12 13:17:36 -06:00
Alex Elder	83eb26af0d	ceph: ensure prealloc_blob is in place when removing xattr In __ceph_build_xattrs_blob(), if a ceph inode's extended attributes are marked dirty, all attributes recorded in its rb_tree index are formatted into a "blob" buffer. The target buffer is recorded in ceph_inode->i_xattrs.prealloc_blob, and it is expected to exist and be of sufficient size to hold the attributes. The extended attributes are marked dirty in two cases: when a new attribute is added to the inode; or when one is removed. In the former case work is done to ensure the prealloc_blob buffer is properly set up, but in the latter it is not. Change the logic in ceph_removexattr() so it matches what is done in ceph_setxattr(). Note that this is done in a way that keeps the two blocks of code nearly identical, in anticipation of a subsequent patch that encapsulates some of this logic into one or more helper routines. Signed-off-by: Alex Elder <elder@dreamhost.com> Signed-off-by: Sage Weil <sage@newdream.net>	2012-01-12 11:00:51 -08:00
Sage Weil	a40dc6cc2e	ceph: enable/disable dentry complete flags via mount option Enable/disable use of the dentry dir 'complete' flag via a mount option. This lets the admin control whether ceph uses the dcache to satisfy negative lookups or readdir when it has the entire directory contents in its cache. This is purely a performance optimization; correctness is guaranteed whether it is enabled or not. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net>	2012-01-12 11:00:40 -08:00
Sage Weil	46f72b3492	vfs: export symbol d_find_any_alias() Ceph needs this. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sage Weil <sage@newdream.net>	2012-01-12 11:00:28 -08:00
Jan Kara	46fe44ce87	quota: Pass information that quota is stored in system file to userspace Quota tools need to know whether quota is stored in a system file or in classical aquota.{user\|group} files. So pass this information as a flag in GETINFO quotactl. Signed-off-by: Jan Kara <jack@suse.cz>	2012-01-12 13:09:09 +01:00
Namjae Jeon	0b4156eb27	fs: remove unneeded plug in mpage_readpages() The block plug in mpage_readpages() duplicates the one in read_pages(). Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-12 09:19:54 +01:00
Alex Elder	d46cfba536	ceph: always initialize the dentry in open_root_dentry() When open_root_dentry() gets a dentry via d_obtain_alias() it does not get initialized. If the dentry obtained came from the cache, this is OK. But if not, the result is an improperly initialized dentry. To fix this, call ceph_init_dentry() regardless of which path produced the dentry. That function returns immediately for a dentry that is already initialized, it is safe to use either way. (Credit to Sage, who suggested this fix.) Signed-off-by: Alex Elder <aelder@sgi.com>	2012-01-11 16:28:25 -08:00
Artem Bityutskiy	d34315da91	UBIFS: fix debugging messages Patch 56e46742e846e4de167dde0e1e1071ace1c882a5 broke UBIFS debugging messages: before that commit when UBIFS debugging was enabled, users saw few useful debugging messages after mount. However, that patch turned 'dbg_msg()' into 'pr_debug()', so to enable the debugging messages users have to enable them first via /sys/kernel/debug/dynamic_debug/control, which is very impractical. This commit makes 'dbg_msg()' to use 'printk()' instead of 'pr_debug()', just as it was before the breakage. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: stable@kernel.org [3.0+]	2012-01-11 18:44:53 +02:00
Artem Bityutskiy	1f5d78dc48	UBIFS: make debugging messages light again We switch to dynamic debugging in commit 56e46742e846e4de167dde0e1e1071ace1c882a5 but did not take into account that now we do not control anymore whether a specific message is enabled or not. So now we lock the "dbg_lock" and release it in every debugging macro, which make them not so light-weight. This commit removes the "dbg_lock" protection from the debugging macros to fix the issue. The downside is that now our DBGKEY() stuff is broken, but this is not critical at all and will be fixed later. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: stable@kernel.org [3.0+]	2012-01-11 18:44:53 +02:00
Djalal Harouni	34b0784056	ext2: protect inode changes in the SETVERSION and SETFLAGS ioctls Unlock mutex after i_flags and i_ctime updates in the EXT2_IOC_SETFLAGS ioctl. Use i_mutex in the EXT2_IOC_SETVERSION ioctl to protect i_ctime and i_generation updates and make the ioctl consistent since i_mutex is also used in other places to protect timestamps and inode changes. Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Djalal Harouni <tixxdz@opendz.org> Signed-off-by: Jan Kara <jack@suse.cz>	2012-01-11 13:39:02 +01:00
Jan Kara	353b67d8ce	jbd: Issue cache flush after checkpointing When we reach cleanup_journal_tail(), there is no guarantee that checkpointed buffers are on a stable storage - especially if buffers were written out by log_do_checkpoint(), they are likely to be only in disk's caches. Thus when we update journal superblock, effectively removing old transaction from journal, this write of superblock can get to stable storage before those checkpointed buffers which can result in filesystem corruption after a crash. A similar problem can happen if we replay the journal and wipe it before flushing disk's caches. Thus we must unconditionally issue a cache flush before we update journal superblock in these cases. The fix is slightly complicated by the fact that we have to get log tail before we issue cache flush but we can store it in the journal superblock only after the cache flush. Otherwise we risk races where new tail is written before appropriate cache flush is finished. I managed to reproduce the corruption using somewhat tweaked Chris Mason's barrier-test scheduler. Also this should fix occasional reports of 'Bit already freed' filesystem errors which are totally unreproducible but inspection of several fs images I've gathered over time points to a problem like this. CC: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2012-01-11 13:36:57 +01:00
Steven Whitehouse	66ad863b41	GFS2: Fix nlink setting on inode creation Since the nlink count will be 0, we need to use set_nlink rather than inc_nlink in order to avoid triggering the inc_nlink warning which was added recently. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2012-01-11 12:35:05 +00:00
David Teigland	376d37788b	GFS2: fail mount if journal recovery fails If the first mounter fails to recover one of the journals during mount, the mount should fail. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2012-01-11 09:24:48 +00:00
David Teigland	e8ca5cc571	GFS2: let spectator mount do read only recovery Previously, a spectator mount would not even attempt to do journal recovery for a failed node. This meant that if all mounted nodes were spectators, everyone would be stuck after a node failed, all waiting for recovery to be performed. This is unnecessary since the failed node had a clean journal. Instead, allow a spectator mount to do a partial "read only" recovery, which means it will check if the failed journal is clean, and if so, report a successful recovery. If the failed journal is not clean, it reports that journal recovery failed. This makes it work the same as a read only mount on a read only block device. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2012-01-11 09:23:40 +00:00
Bob Peterson	49528b4e47	GFS2: Fix a use-after-free that coverity spotted In function gfs2_inplace_release it was trying to unlock a gfs2_holder structure associated with a reservation, after said reservation was freed. The problem is that the statements have the wrong order. This patch corrects the order so that the reservation is freed after the gfs2_holder is unlocked. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2012-01-11 09:23:26 +00:00
David Teigland	e0c2a9aa1e	GFS2: dlm based recovery coordination This new method of managing recovery is an alternative to the previous approach of using the userland gfs_controld. - use dlm slot numbers to assign journal id's - use dlm recovery callbacks to initiate journal recovery - use a dlm lock to determine the first node to mount fs - use a dlm lock to track journals that need recovery Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2012-01-11 09:23:05 +00:00
Linus Torvalds	5cd9599bba	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: autofs4: deal with autofs4_write/autofs4_write races autofs4: catatonic_mode vs. notify_daemon race autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race hfsplus: creation of hidden dir on mount can fail block_dev: Suppress bdev_cache_init() kmemleak warninig fix shrink_dcache_parent() livelock coda: switch coda_cnode_make() to sane API as well, clean coda_lookup() coda: deal correctly with allocation failure from coda_cnode_makectl() securityfs: fix object creation races	2012-01-10 21:46:36 -08:00

1 2 3 4 5 ...

25588 Commits