Nilfs maintains two super blocks, and selects the new one on mount
time if they both have valid checksums and their timestamps differ.
However, this has potential for mis-selection since the system clock
may be rewinded and the resolution of the timestamps is not high.
Usually this doesn't become an issue because both super blocks are
updated at the same time when the file system is unmounted. Even if
the file system wasn't unmounted cleanly, the roll-forward recovery
will find the proper log which stores the latest super root. Thus,
the issue can appear only if update of one super block fails and the
clock happens to be rewinded.
This fixes the issue by using checkpoint numbers instead of timestamps
to pick the super block storing the location of the latest log.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds missing endian conversions in comparision of the magic
number of super blocks. It was coincidence that prior versions didn't
incur problems; the upper byte of the magic number happened to be
equal to the lower byte. But, semantically it's wrong to depend on
this.
This won't change anything else nor suffer any compatibility issues.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This kills the following sparse warnings:
fs/nilfs2/segment.c:567:28: warning: symbol 'nilfs_sc_file_ops' was not declared. Should it be static?
fs/nilfs2/segment.c:617:28: warning: symbol 'nilfs_sc_dat_ops' was not declared. Should it be static?
fs/nilfs2/segment.c:625:28: warning: symbol 'nilfs_sc_dsync_ops' was not declared. Should it be static?
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The implementation of persistent object allocator (alloc.c) is poorly
documented. This adds kernel doc style comments on that functions.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
In nilfs_segctor_thread(), timer is a local variable allocated on stack. Its
address can't be set to sci->sc_timer and passed in several procedures.
It works now by chance, just because other procedures are called by
nilfs_segctor_thread() directly or indirectly and the stack hasn't been
deallocated yet.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
There are only two lines of code in nilfs_segctor_init(). From a logic
design view, the first line 'sci->sc_seq_done = sci->sc_seq_request;'
should be put in nilfs_segctor_new(). Even in nilfs_segctor_new(),
this initialization is needless because sci is kzalloc-ed. So
nilfs_segctor_init() is only a wrap call to
nilfs_segctor_start_thread().
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds a field to record the latest checkpoint number in the
nilfs_segment_summary structure. This will help to recover the latest
checkpoint number from logs on disk. This field is intended for
crucial cases in which super blocks have lost pointer to the latest
log.
Even though this will change the disk format, both backward and
forward compatibility is preserved by a size field prepared in the
segment summary header.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Printing a message after loading a file system is a practice. Add this to
provide a better user-friendly experience.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This cleanup patch gives several improvements:
- Moving all kmem_cache_{create_destroy} calls into one place, which removes
some small function calls, cleans up error check code and clarify the logic.
- Mark all initial code in __init section.
- Remove some very obvious comments.
- Adjust some declarations.
- Fix some space-tab issues.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This moves a pointer to buffer storing super root block to each log
buffer from nilfs_sc_info struct for simplicity.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Like ext3, nilfs has 'errors' mount option to allow specifying desired
behavior on severe errors.
Currently, the default action is 'errors=continue' and has potential
to advance filesystem corruption for severe errors.
This will change the action to 'errors=remount-ro' to avoid the issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_release_path() and nilfs_btree_free_path() are bound into each other
tightly. Make them into one procedure to clearify the logic and avoid some
misusages.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs_btree_alloc_path() and nilfs_btree_init_path() are bound into each other
tightly. Make them into one procedure to clearify the logic and avoid some
misusages.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix RCU issues in the NFSv4 delegation code
NFSv4: Fix the locking in nfs_inode_reclaim_delegation()
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Avoid a gcc warning in ocfs2_wipe_inode().
ocfs2: Avoid direct write if we fall back to buffered I/O
ocfs2_dlmfs: Fix math error when reading LVB.
ocfs2: Update VFS inode's id info after reflink.
ocfs2: potential ERR_PTR dereference on error paths
ocfs2: Add directory entry later in ocfs2_symlink() and ocfs2_mknod()
ocfs2: use OCFS2_INODE_SKIP_ORPHAN_DIR in ocfs2_mknod error path
ocfs2: use OCFS2_INODE_SKIP_ORPHAN_DIR in ocfs2_symlink error path
ocfs2: add OCFS2_INODE_SKIP_ORPHAN_DIR flag and honor it in the inode wipe code
ocfs2: Reset status if we want to restart file extension.
ocfs2: Compute metaecc for superblocks during online resize.
ocfs2: Check the owner of a lockres inside the spinlock
ocfs2: one more warning fix in ocfs2_file_aio_write(), v2
ocfs2_dlmfs: User DLM_* when decoding file open flags.
gcc warns that a variable is uninitialized. It's actually handled, but
an early return fools gcc. Let's just initialize the variable to a
garbage value that will crash if the usage is ever broken.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: remove bad auth_x kmem_cache
ceph: fix lockless caps check
ceph: clear dir complete, invalidate dentry on replayed rename
ceph: fix direct io truncate offset
ceph: discard incoming messages with bad seq #
ceph: fix seq counting for skipped messages
ceph: add missing #includes
ceph: fix leaked spinlock during mds reconnect
ceph: print more useful version info on module load
ceph: fix snap realm splits
ceph: clear dir complete on d_move
It's useless, since our allocations are already a power of 2. And it was
allocated per-instance (not globally), which caused a name collision when
we tried to mount a second file system with auth_x enabled.
Signed-off-by: Sage Weil <sage@newdream.net>
If a rename operation is resent to the MDS following an MDS restart, the
client does not get a full reply (containing the resulting metadata) back.
In that case, a ceph_rename() needs to compensate by doing anything useful
that fill_inode() would have, like d_move().
It also needs to invalidate the dentry (to workaround the vfs_rename_dir()
bug) and clear the dir complete flag, just like fill_trace().
Signed-off-by: Sage Weil <sage@newdream.net>
We can get old message seq #'s after a tcp reconnect for stateful sessions
(i.e., the MDS). If we get a higher seq #, that is an error, and we
shouldn't see any bad seq #'s for stateless (mon, osd) connections.
Signed-off-by: Sage Weil <sage@newdream.net>
The snap realm split was checking i_snap_realm, not the list_head, to
determine if an inode belonged in the new realm. The check always failed,
which meant we always moved the inode, corrupting the old realm's list and
causing various crashes.
Also wait to release old realm reference to avoid possibility of use after
free.
Signed-off-by: Sage Weil <sage@newdream.net>
d_move() reorders the d_subdirs list, breaking the readdir result caching.
Unless/until d_move preserves that ordering, clear CEPH_I_COMPLETE on
rename.
Signed-off-by: Sage Weil <sage@newdream.net>
As of 32a88aa1, __sync_filesystem() will return 0 if s_bdi is not set.
And nilfs does not set s_bdi anywhere. I noticed this problem by the
warning introduced by the recent commit 5129a469 ("Catch filesystem
lacking s_bdi").
WARNING: at fs/super.c:959 vfs_kern_mount+0xc5/0x14e()
Hardware name: PowerEdge 2850
Modules linked in: nilfs2 loop tpm_tis tpm tpm_bios video shpchp pci_hotplug output dcdbas
Pid: 3773, comm: mount.nilfs2 Not tainted 2.6.34-rc6-debug #38
Call Trace:
[<c1028422>] warn_slowpath_common+0x60/0x90
[<c102845f>] warn_slowpath_null+0xd/0x10
[<c1095936>] vfs_kern_mount+0xc5/0x14e
[<c1095a03>] do_kern_mount+0x32/0xbd
[<c10a811e>] do_mount+0x671/0x6d0
[<c1073794>] ? __get_free_pages+0x1f/0x21
[<c10a684f>] ? copy_mount_options+0x2b/0xe2
[<c107b634>] ? strndup_user+0x48/0x67
[<c10a81de>] sys_mount+0x61/0x8f
[<c100280c>] sysenter_do_call+0x12/0x32
This ensures to set s_bdi for nilfs and fixes the sync silent failure.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a number of RCU issues in the NFSv4 delegation code.
(1) delegation->cred doesn't need to be RCU protected as it's essentially an
invariant refcounted structure.
By the time we get to nfs_free_delegation(), the delegation is being
released, so no one else should be attempting to use the saved
credentials, and they can be cleared.
However, since the list of delegations could still be under traversal at
this point by such as nfs_client_return_marked_delegations(), the cred
should be released in nfs_do_free_delegation() rather than in
nfs_free_delegation(). Simply using rcu_assign_pointer() to clear it is
insufficient as that doesn't stop the cred from being destroyed, and nor
does calling put_rpccred() after call_rcu(), given that the latter is
asynchronous.
(2) nfs_detach_delegation_locked() and nfs_inode_set_delegation() should use
rcu_derefence_protected() because they can only be called if
nfs_client::cl_lock is held, and that guards against anyone changing
nfsi->delegation under it. Furthermore, the barrier imposed by
rcu_dereference() is superfluous, given that the spin_lock() is also a
barrier.
(3) nfs_detach_delegation_locked() is now passed a pointer to the nfs_client
struct so that it can issue lockdep advice based on clp->cl_lock for (2).
(4) nfs_inode_return_delegation_noreclaim() and nfs_inode_return_delegation()
should use rcu_access_pointer() outside the spinlocked region as they
merely examine the pointer and don't follow it, thus rendering unnecessary
the need to impose a partial ordering over the one item of interest.
These result in an RCU warning like the following:
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
fs/nfs/delegation.c:332 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by mount.nfs4/2281:
#0: (&type->s_umount_key#34){+.+...}, at: [<ffffffff810b25b4>] deactivate_super+0x60/0x80
#1: (iprune_sem){+.+...}, at: [<ffffffff810c332a>] invalidate_inodes+0x39/0x13a
stack backtrace:
Pid: 2281, comm: mount.nfs4 Not tainted 2.6.34-rc1-cachefs #110
Call Trace:
[<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffffa00b4591>] nfs_inode_return_delegation_noreclaim+0x5b/0xa0 [nfs]
[<ffffffffa0095d63>] nfs4_clear_inode+0x11/0x1e [nfs]
[<ffffffff810c2d92>] clear_inode+0x9e/0xf8
[<ffffffff810c3028>] dispose_list+0x67/0x10e
[<ffffffff810c340d>] invalidate_inodes+0x11c/0x13a
[<ffffffff810b1dc1>] generic_shutdown_super+0x42/0xf4
[<ffffffff810b1ebe>] kill_anon_super+0x11/0x4f
[<ffffffffa009893c>] nfs4_kill_super+0x3f/0x72 [nfs]
[<ffffffff810b25bc>] deactivate_super+0x68/0x80
[<ffffffff810c6744>] mntput_no_expire+0xbb/0xf8
[<ffffffff810c681b>] release_mounts+0x9a/0xb0
[<ffffffff810c689b>] put_mnt_ns+0x6a/0x79
[<ffffffffa00983a1>] nfs_follow_remote_path+0x5a/0x146 [nfs]
[<ffffffffa0098334>] ? nfs_do_root_mount+0x82/0x95 [nfs]
[<ffffffffa00985a9>] nfs4_try_mount+0x75/0xaf [nfs]
[<ffffffffa0098874>] nfs4_get_sb+0x291/0x31a [nfs]
[<ffffffff810b2059>] vfs_kern_mount+0xb8/0x177
[<ffffffff810b2176>] do_kern_mount+0x48/0xe8
[<ffffffff810c810b>] do_mount+0x782/0x7f9
[<ffffffff810c8205>] sys_mount+0x83/0xbe
[<ffffffff81001eeb>] system_call_fastpath+0x16/0x1b
Also on:
fs/nfs/delegation.c:215 invoked rcu_dereference_check() without protection!
[<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffffa00b4223>] nfs_inode_set_delegation+0xfe/0x219 [nfs]
[<ffffffffa00a9c6f>] nfs4_opendata_to_nfs4_state+0x2c2/0x30d [nfs]
[<ffffffffa00aa15d>] nfs4_do_open+0x2a6/0x3a6 [nfs]
...
And:
fs/nfs/delegation.c:40 invoked rcu_dereference_check() without protection!
[<ffffffff8105149f>] lockdep_rcu_dereference+0xaa/0xb2
[<ffffffffa00b3bef>] nfs_free_delegation+0x3d/0x6e [nfs]
[<ffffffffa00b3e71>] nfs_do_return_delegation+0x26/0x30 [nfs]
[<ffffffffa00b406a>] __nfs_inode_return_delegation+0x1ef/0x1fe [nfs]
[<ffffffffa00b448a>] nfs_client_return_marked_delegations+0xc9/0x124 [nfs]
...
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that we correctly rcu-dereference the delegation itself, and that we
protect against removal while we're changing the contents.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
when we fall back to buffered write from direct write, we call
__generic_file_aio_write() but that will end up doing direct write
even we are only prepared to do buffered write because the file
has the O_DIRECT flag set. This is a fix for
https://bugzilla.novell.com/show_bug.cgi?id=591039
revised with Joel's comments.
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
CONFIG_INOTIFY_USER defined but CONFIG_ANON_INODES undefined will result
in the following build failure:
LD vmlinux
fs/built-in.o: In function 'sys_inotify_init1':
(.text.sys_inotify_init1+0x22c): undefined reference to 'anon_inode_getfd'
fs/built-in.o: In function `sys_inotify_init1':
(.text.sys_inotify_init1+0x22c): relocation truncated to fit: R_MIPS_26 against 'anon_inode_getfd'
make[2]: *** [vmlinux] Error 1
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
exofs: Fix "add bdi backing to mount session" fall out
fs: fs/super.c needs to include backing-dev.h for !CONFIG_BLOCK
On low memory boxes or those with highmem, kernel can OOM before the
background reclaims inodes via xfssyncd. Add a shrinker to run inode
reclaim so that it inode reclaim is expedited when memory is low.
This is more complex than it needs to be because the VM folk don't
want a context added to the shrinker infrastructure. Hence we need
to add a global list of XFS mount structures so the shrinker can
traverse them.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The patch: add bdi backing to mount session
(b3d0ab7e60d1865bb6f6a79a77aaba22f2543236)
Has a bug in the placement of the bdi member at
struct exofs_sb_info. The layout member must be kept
last.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When CONFIG_BLOCK is set, it ends up getting backing-dev.h included.
But for !CONFIG_BLOCK, it isn't so lucky. The proper thing to do is
include <linux/backing-dev.h> directly from the file it's used from,
so do that.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
nfs: fix memory leak in nfs_get_sb with CONFIG_NFS_V4
nfs: fix some issues in nfs41_proc_reclaim_complete()
NFS: Ensure that nfs_wb_page() waits for Pg_writeback to clear
NFS: Fix an unstable write data integrity race
nfs: testing for null instead of ERR_PTR()
NFS: rsize and wsize settings ignored on v4 mounts
NFSv4: Don't attempt an atomic open if the file is a mountpoint
SUNRPC: Fix a bug in rpcauth_prune_expired
The pktcdvd driver uses proper locking and does not need the BKL in the
ioctl and llseek functions of the character device, so kill both.
Moving the compat_ioctl handling from common code into the driver itself
fixes build problems when CONFIG_BLOCK is disabled.
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b3d0ab7e60d1865bb6f6a79a77aaba22f2543236 ("exofs: add bdi backing
to mount session") has a bug in the placement of the bdi member at
struct exofs_sb_info. The layout member must be kept last.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If dentry found stale happens to be a root of disconnected tree, we
can't d_drop() it; its d_hash is actually part of s_anon and d_drop()
would simply hide it from shrink_dcache_for_umount(), leading to
all sorts of fun, including busy inodes on umount and oopsen after
that.
Bug had been there since at least 2006 (commit c636eb already has it),
so it's definitely -stable fodder.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original code passed an ERR_PTR() to rpc_put_task() and instead of
returning zero on success it returned -ENOMEM.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
coda: move backing-dev.h kernel include inside __KERNEL__
mtd: ensure that bdi entries are properly initialized and registered
Move mtd_bdi_*mappable to mtdcore.c
btrfs: convert to using bdi_setup_and_register()
Catch filesystems lacking s_bdi
drbd: Terminate a connection early if sending the protocol fails
drbd: fix memory leak
Fix JFFS2 sync silent failure
smbfs: add bdi backing to mount session
ncpfs: add bdi backing to mount session
exofs: add bdi backing to mount session
ecryptfs: add bdi backing to mount session
coda: add bdi backing to mount session
cifs: add bdi backing to mount session
afs: add bdi backing to mount session.
9p: add bdi backing to mount session
bdi: add helper function for doing init and register of a bdi for a file system
block: ensure jiffies wrap is handled correctly in blk_rq_timed_out_timer
Neil Brown reports that he is seeing the BUG_ON(ret == 0) trigger in
nfs_page_async_flush. According to the trace in
https://bugzilla.novell.com/show_bug.cgi?id=599628
the problem appears to be due to nfs_wb_page() not waiting for the
PG_writeback flag to clear.
There is a ditto problem in nfs_wb_page_cancel()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>