2007-06-12 09:07:21 -04:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public
|
|
|
|
* License v2 as published by the Free Software Foundation.
|
|
|
|
*
|
|
|
|
* This program is distributed in the hope that it will be useful,
|
|
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
|
|
* General Public License for more details.
|
|
|
|
*
|
|
|
|
* You should have received a copy of the GNU General Public
|
|
|
|
* License along with this program; if not, write to the
|
|
|
|
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
|
|
|
|
* Boston, MA 021110-1307, USA.
|
|
|
|
*/
|
|
|
|
|
2008-01-08 15:46:30 -05:00
|
|
|
#ifndef __BTRFS_CTREE__
|
|
|
|
#define __BTRFS_CTREE__
|
2007-02-02 09:18:22 -05:00
|
|
|
|
2007-10-15 16:18:55 -04:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/highmem.h>
|
2007-03-22 12:13:20 -04:00
|
|
|
#include <linux/fs.h>
|
2011-03-08 14:14:00 +01:00
|
|
|
#include <linux/rwsem.h>
|
2007-08-29 15:47:34 -04:00
|
|
|
#include <linux/completion.h>
|
2008-03-26 10:28:07 -04:00
|
|
|
#include <linux/backing-dev.h>
|
2008-07-17 12:53:50 -04:00
|
|
|
#include <linux/wait.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
|
|
|
#include <linux/slab.h>
|
2011-01-12 10:30:42 +01:00
|
|
|
#include <linux/kobject.h>
|
Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-24 11:18:59 +00:00
|
|
|
#include <trace/events/btrfs.h>
|
2007-10-15 16:14:27 -04:00
|
|
|
#include <asm/kmap_types.h>
|
2011-09-21 15:05:58 -04:00
|
|
|
#include <linux/pagemap.h>
|
2008-01-24 16:13:08 -05:00
|
|
|
#include "extent_io.h"
|
2007-10-15 16:14:19 -04:00
|
|
|
#include "extent_map.h"
|
2008-06-11 16:50:36 -04:00
|
|
|
#include "async-thread.h"
|
2011-03-08 14:14:00 +01:00
|
|
|
#include "ioctl.h"
|
2007-03-22 12:13:20 -04:00
|
|
|
|
2007-03-16 16:20:31 -04:00
|
|
|
struct btrfs_trans_handle;
|
2007-03-22 15:59:16 -04:00
|
|
|
struct btrfs_transaction;
|
2010-05-16 10:48:46 -04:00
|
|
|
struct btrfs_pending_snapshot;
|
2007-05-02 15:53:43 -04:00
|
|
|
extern struct kmem_cache *btrfs_trans_handle_cachep;
|
|
|
|
extern struct kmem_cache *btrfs_transaction_cachep;
|
|
|
|
extern struct kmem_cache *btrfs_bit_radix_cachep;
|
2007-04-02 10:50:19 -04:00
|
|
|
extern struct kmem_cache *btrfs_path_cachep;
|
2011-01-28 17:05:48 -05:00
|
|
|
extern struct kmem_cache *btrfs_free_space_cachep;
|
2008-07-17 12:53:50 -04:00
|
|
|
struct btrfs_ordered_sum;
|
2007-03-16 16:20:31 -04:00
|
|
|
|
2008-12-02 09:58:02 -05:00
|
|
|
#define BTRFS_MAGIC "_BHRfS_M"
|
2007-02-02 09:18:22 -05:00
|
|
|
|
2009-02-12 14:09:45 -05:00
|
|
|
#define BTRFS_MAX_LEVEL 8
|
2008-03-24 15:01:56 -04:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
#define BTRFS_COMPAT_EXTENT_TREE_V0
|
|
|
|
|
2009-03-31 13:27:11 -04:00
|
|
|
/*
|
|
|
|
* files bigger than this get some pre-flushing when they are added
|
|
|
|
* to the ordered operations list. That way we limit the total
|
|
|
|
* work done by the commit
|
|
|
|
*/
|
|
|
|
#define BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT (8 * 1024 * 1024)
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
/* holds pointers to all of the tree roots */
|
2007-03-27 06:33:00 -04:00
|
|
|
#define BTRFS_ROOT_TREE_OBJECTID 1ULL
|
2008-03-24 15:01:56 -04:00
|
|
|
|
|
|
|
/* stores information about which extents are in use, and reference counts */
|
2007-06-09 09:22:25 -04:00
|
|
|
#define BTRFS_EXTENT_TREE_OBJECTID 2ULL
|
2008-03-24 15:01:56 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* chunk tree stores translations from logical -> physical block numbering
|
|
|
|
* the super block points to the chunk tree
|
|
|
|
*/
|
2008-03-24 15:02:07 -04:00
|
|
|
#define BTRFS_CHUNK_TREE_OBJECTID 3ULL
|
2008-03-24 15:01:56 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* stores information about which areas of a given device are in use.
|
|
|
|
* one per device. The tree of tree roots points to the device tree
|
|
|
|
*/
|
2008-03-24 15:02:07 -04:00
|
|
|
#define BTRFS_DEV_TREE_OBJECTID 4ULL
|
|
|
|
|
|
|
|
/* one per subvolume, storing files and directories */
|
|
|
|
#define BTRFS_FS_TREE_OBJECTID 5ULL
|
|
|
|
|
|
|
|
/* directory objectid inside the root tree */
|
|
|
|
#define BTRFS_ROOT_TREE_DIR_OBJECTID 6ULL
|
2008-03-24 15:01:56 -04:00
|
|
|
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
/* holds checksums of all the data extents */
|
|
|
|
#define BTRFS_CSUM_TREE_OBJECTID 7ULL
|
|
|
|
|
2008-07-24 12:17:14 -04:00
|
|
|
/* orhpan objectid for tracking unlinked/truncated files */
|
|
|
|
#define BTRFS_ORPHAN_OBJECTID -5ULL
|
|
|
|
|
2008-09-05 16:13:11 -04:00
|
|
|
/* does write ahead logging to speed up fsyncs */
|
|
|
|
#define BTRFS_TREE_LOG_OBJECTID -6ULL
|
|
|
|
#define BTRFS_TREE_LOG_FIXUP_OBJECTID -7ULL
|
|
|
|
|
2008-09-26 10:04:53 -04:00
|
|
|
/* for space balancing */
|
|
|
|
#define BTRFS_TREE_RELOC_OBJECTID -8ULL
|
|
|
|
#define BTRFS_DATA_RELOC_TREE_OBJECTID -9ULL
|
|
|
|
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
/*
|
|
|
|
* extent checksums all have this objectid
|
|
|
|
* this allows them to share the logging tree
|
|
|
|
* for fsyncs
|
|
|
|
*/
|
|
|
|
#define BTRFS_EXTENT_CSUM_OBJECTID -10ULL
|
|
|
|
|
2010-06-21 14:48:16 -04:00
|
|
|
/* For storing free space cache */
|
|
|
|
#define BTRFS_FREE_SPACE_OBJECTID -11ULL
|
|
|
|
|
2011-04-20 10:33:24 +08:00
|
|
|
/*
|
|
|
|
* The inode number assigned to the special inode for sotring
|
|
|
|
* free ino cache
|
|
|
|
*/
|
|
|
|
#define BTRFS_FREE_INO_OBJECTID -12ULL
|
|
|
|
|
2008-09-23 13:14:14 -04:00
|
|
|
/* dummy objectid represents multiple objectids */
|
|
|
|
#define BTRFS_MULTIPLE_OBJECTIDS -255ULL
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
/*
|
2008-09-05 16:43:53 -04:00
|
|
|
* All files have objectids in this range.
|
2008-03-24 15:01:56 -04:00
|
|
|
*/
|
2007-12-13 11:13:32 -05:00
|
|
|
#define BTRFS_FIRST_FREE_OBJECTID 256ULL
|
2008-09-05 16:43:53 -04:00
|
|
|
#define BTRFS_LAST_FREE_OBJECTID -256ULL
|
2008-04-15 15:41:47 -04:00
|
|
|
#define BTRFS_FIRST_CHUNK_TREE_OBJECTID 256ULL
|
2007-03-13 16:47:54 -04:00
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* the device items go into the chunk tree. The key is in the form
|
|
|
|
* [ 1 BTRFS_DEV_ITEM_KEY device_id ]
|
|
|
|
*/
|
|
|
|
#define BTRFS_DEV_ITEMS_OBJECTID 1ULL
|
|
|
|
|
2009-09-21 15:56:00 -04:00
|
|
|
#define BTRFS_BTREE_INODE_OBJECTID 1
|
|
|
|
|
|
|
|
#define BTRFS_EMPTY_SUBVOL_DIR_OBJECTID 2
|
|
|
|
|
2007-03-22 12:13:20 -04:00
|
|
|
/*
|
|
|
|
* we can actually store much bigger names, but lets not confuse the rest
|
|
|
|
* of linux
|
|
|
|
*/
|
|
|
|
#define BTRFS_NAME_LEN 255
|
|
|
|
|
2007-03-29 15:15:27 -04:00
|
|
|
/* 32 bytes in various csum fields */
|
|
|
|
#define BTRFS_CSUM_SIZE 32
|
2008-12-02 07:17:45 -05:00
|
|
|
|
|
|
|
/* csum types */
|
|
|
|
#define BTRFS_CSUM_TYPE_CRC32 0
|
|
|
|
|
|
|
|
static int btrfs_csum_sizes[] = { 4, 0 };
|
|
|
|
|
2007-05-10 12:36:17 -04:00
|
|
|
/* four bytes for CRC32 */
|
2007-12-12 14:38:19 -05:00
|
|
|
#define BTRFS_EMPTY_DIR_SIZE 0
|
2007-03-29 15:15:27 -04:00
|
|
|
|
2007-06-07 22:13:21 -04:00
|
|
|
#define BTRFS_FT_UNKNOWN 0
|
|
|
|
#define BTRFS_FT_REG_FILE 1
|
|
|
|
#define BTRFS_FT_DIR 2
|
|
|
|
#define BTRFS_FT_CHRDEV 3
|
|
|
|
#define BTRFS_FT_BLKDEV 4
|
|
|
|
#define BTRFS_FT_FIFO 5
|
|
|
|
#define BTRFS_FT_SOCK 6
|
|
|
|
#define BTRFS_FT_SYMLINK 7
|
2007-11-16 11:45:54 -05:00
|
|
|
#define BTRFS_FT_XATTR 8
|
|
|
|
#define BTRFS_FT_MAX 9
|
2007-06-07 22:13:21 -04:00
|
|
|
|
2007-02-26 10:40:21 -05:00
|
|
|
/*
|
2009-04-02 16:46:06 -04:00
|
|
|
* The key defines the order in the tree, and so it also defines (optimal)
|
|
|
|
* block layout.
|
|
|
|
*
|
|
|
|
* objectid corresponds to the inode number.
|
|
|
|
*
|
|
|
|
* type tells us things about the object, and is a kind of stream selector.
|
|
|
|
* so for a given inode, keys with type of 1 might refer to the inode data,
|
|
|
|
* type of 2 may point to file data in the btree and type == 3 may point to
|
|
|
|
* extents.
|
2007-02-26 10:40:21 -05:00
|
|
|
*
|
|
|
|
* offset is the starting byte offset for this key in the stream.
|
2007-03-12 16:22:34 -04:00
|
|
|
*
|
|
|
|
* btrfs_disk_key is in disk byte order. struct btrfs_key is always
|
|
|
|
* in cpu native order. Otherwise they are identical and their sizes
|
|
|
|
* should be the same (ie both packed)
|
2007-02-26 10:40:21 -05:00
|
|
|
*/
|
2007-03-12 16:22:34 -04:00
|
|
|
struct btrfs_disk_key {
|
|
|
|
__le64 objectid;
|
2007-10-15 16:14:19 -04:00
|
|
|
u8 type;
|
2007-04-17 15:39:32 -04:00
|
|
|
__le64 offset;
|
2007-03-12 16:22:34 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_key {
|
2007-02-02 09:18:22 -05:00
|
|
|
u64 objectid;
|
2007-10-15 16:14:19 -04:00
|
|
|
u8 type;
|
2007-04-17 15:39:32 -04:00
|
|
|
u64 offset;
|
2007-02-02 09:18:22 -05:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
struct btrfs_mapping_tree {
|
|
|
|
struct extent_map_tree map_tree;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct btrfs_dev_item {
|
|
|
|
/* the internal btrfs device id */
|
|
|
|
__le64 devid;
|
|
|
|
|
|
|
|
/* size of the device */
|
|
|
|
__le64 total_bytes;
|
|
|
|
|
|
|
|
/* bytes used */
|
|
|
|
__le64 bytes_used;
|
|
|
|
|
|
|
|
/* optimal io alignment for this device */
|
|
|
|
__le32 io_align;
|
|
|
|
|
|
|
|
/* optimal io width for this device */
|
|
|
|
__le32 io_width;
|
|
|
|
|
|
|
|
/* minimal io size for this device */
|
|
|
|
__le32 sector_size;
|
|
|
|
|
|
|
|
/* type and info about this device */
|
|
|
|
__le64 type;
|
|
|
|
|
2008-11-17 21:11:30 -05:00
|
|
|
/* expected generation for this device */
|
|
|
|
__le64 generation;
|
|
|
|
|
2008-12-08 16:40:21 -05:00
|
|
|
/*
|
|
|
|
* starting byte of this partition on the device,
|
2009-04-02 16:46:06 -04:00
|
|
|
* to allow for stripe alignment in the future
|
2008-12-08 16:40:21 -05:00
|
|
|
*/
|
|
|
|
__le64 start_offset;
|
|
|
|
|
2008-04-15 15:41:47 -04:00
|
|
|
/* grouping information for allocation decisions */
|
|
|
|
__le32 dev_group;
|
|
|
|
|
|
|
|
/* seek speed 0-100 where 100 is fastest */
|
|
|
|
u8 seek_speed;
|
|
|
|
|
|
|
|
/* bandwidth 0-100 where 100 is fastest */
|
|
|
|
u8 bandwidth;
|
|
|
|
|
2008-03-24 15:02:07 -04:00
|
|
|
/* btrfs generated uuid for this device */
|
2008-04-15 15:41:47 -04:00
|
|
|
u8 uuid[BTRFS_UUID_SIZE];
|
2008-11-17 21:11:30 -05:00
|
|
|
|
|
|
|
/* uuid of FS who owns this device */
|
|
|
|
u8 fsid[BTRFS_UUID_SIZE];
|
2008-03-24 15:01:56 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_stripe {
|
|
|
|
__le64 devid;
|
|
|
|
__le64 offset;
|
2008-04-15 15:41:47 -04:00
|
|
|
u8 dev_uuid[BTRFS_UUID_SIZE];
|
2008-03-24 15:01:56 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_chunk {
|
2008-04-15 15:41:47 -04:00
|
|
|
/* size of this chunk in bytes */
|
|
|
|
__le64 length;
|
|
|
|
|
|
|
|
/* objectid of the root referencing this chunk */
|
2008-03-24 15:01:56 -04:00
|
|
|
__le64 owner;
|
2008-04-15 15:41:47 -04:00
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
__le64 stripe_len;
|
|
|
|
__le64 type;
|
|
|
|
|
|
|
|
/* optimal io alignment for this chunk */
|
|
|
|
__le32 io_align;
|
|
|
|
|
|
|
|
/* optimal io width for this chunk */
|
|
|
|
__le32 io_width;
|
|
|
|
|
|
|
|
/* minimal io size for this chunk */
|
|
|
|
__le32 sector_size;
|
|
|
|
|
|
|
|
/* 2^16 stripes is quite a lot, a second limit is the size of a single
|
|
|
|
* item in the btree
|
|
|
|
*/
|
|
|
|
__le16 num_stripes;
|
2008-04-16 10:49:51 -04:00
|
|
|
|
|
|
|
/* sub stripes only matter for raid10 */
|
|
|
|
__le16 sub_stripes;
|
2008-03-24 15:01:56 -04:00
|
|
|
struct btrfs_stripe stripe;
|
|
|
|
/* additional stripes go here */
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2010-06-21 14:48:16 -04:00
|
|
|
#define BTRFS_FREE_SPACE_EXTENT 1
|
|
|
|
#define BTRFS_FREE_SPACE_BITMAP 2
|
|
|
|
|
|
|
|
struct btrfs_free_space_entry {
|
|
|
|
__le64 offset;
|
|
|
|
__le64 bytes;
|
|
|
|
u8 type;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_free_space_header {
|
|
|
|
struct btrfs_disk_key location;
|
|
|
|
__le64 generation;
|
|
|
|
__le64 num_entries;
|
|
|
|
__le64 num_bitmaps;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
static inline unsigned long btrfs_chunk_item_size(int num_stripes)
|
|
|
|
{
|
|
|
|
BUG_ON(num_stripes == 0);
|
|
|
|
return sizeof(struct btrfs_chunk) +
|
|
|
|
sizeof(struct btrfs_stripe) * (num_stripes - 1);
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
#define BTRFS_HEADER_FLAG_WRITTEN (1ULL << 0)
|
|
|
|
#define BTRFS_HEADER_FLAG_RELOC (1ULL << 1)
|
2011-01-06 19:30:25 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* File system states
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* Errors detected */
|
|
|
|
#define BTRFS_SUPER_FLAG_ERROR (1ULL << 2)
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
#define BTRFS_SUPER_FLAG_SEEDING (1ULL << 32)
|
|
|
|
#define BTRFS_SUPER_FLAG_METADUMP (1ULL << 33)
|
|
|
|
|
|
|
|
#define BTRFS_BACKREF_REV_MAX 256
|
|
|
|
#define BTRFS_BACKREF_REV_SHIFT 56
|
|
|
|
#define BTRFS_BACKREF_REV_MASK (((u64)BTRFS_BACKREF_REV_MAX - 1) << \
|
|
|
|
BTRFS_BACKREF_REV_SHIFT)
|
|
|
|
|
|
|
|
#define BTRFS_OLD_BACKREF_REV 0
|
|
|
|
#define BTRFS_MIXED_BACKREF_REV 1
|
2008-04-01 11:21:32 -04:00
|
|
|
|
2007-02-26 10:40:21 -05:00
|
|
|
/*
|
|
|
|
* every tree block (leaf or node) starts with this header.
|
|
|
|
*/
|
2007-03-12 12:29:44 -04:00
|
|
|
struct btrfs_header {
|
2008-04-15 15:41:47 -04:00
|
|
|
/* these first four must match the super block */
|
2007-03-29 15:15:27 -04:00
|
|
|
u8 csum[BTRFS_CSUM_SIZE];
|
2007-10-15 16:14:19 -04:00
|
|
|
u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
|
2007-10-15 16:15:53 -04:00
|
|
|
__le64 bytenr; /* which block this node is supposed to live in */
|
2008-04-01 11:21:32 -04:00
|
|
|
__le64 flags;
|
2008-04-15 15:41:47 -04:00
|
|
|
|
|
|
|
/* allowed to be different from the super from here on down */
|
|
|
|
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
|
2007-03-23 15:56:19 -04:00
|
|
|
__le64 generation;
|
2007-04-20 20:23:12 -04:00
|
|
|
__le64 owner;
|
2007-10-15 16:14:19 -04:00
|
|
|
__le32 nritems;
|
2007-03-27 09:06:38 -04:00
|
|
|
u8 level;
|
2007-02-02 09:18:22 -05:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
#define BTRFS_NODEPTRS_PER_BLOCK(r) (((r)->nodesize - \
|
2009-01-05 21:25:51 -05:00
|
|
|
sizeof(struct btrfs_header)) / \
|
|
|
|
sizeof(struct btrfs_key_ptr))
|
2007-03-14 14:14:43 -04:00
|
|
|
#define __BTRFS_LEAF_DATA_SIZE(bs) ((bs) - sizeof(struct btrfs_header))
|
2007-10-15 16:14:19 -04:00
|
|
|
#define BTRFS_LEAF_DATA_SIZE(r) (__BTRFS_LEAF_DATA_SIZE(r->leafsize))
|
2007-04-19 13:37:44 -04:00
|
|
|
#define BTRFS_MAX_INLINE_DATA_SIZE(r) (BTRFS_LEAF_DATA_SIZE(r) - \
|
|
|
|
sizeof(struct btrfs_item) - \
|
|
|
|
sizeof(struct btrfs_file_extent_item))
|
2009-11-12 09:35:27 +00:00
|
|
|
#define BTRFS_MAX_XATTR_SIZE(r) (BTRFS_LEAF_DATA_SIZE(r) - \
|
|
|
|
sizeof(struct btrfs_item) -\
|
|
|
|
sizeof(struct btrfs_dir_item))
|
2007-02-02 09:18:22 -05:00
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this is a very generous portion of the super block, giving us
|
|
|
|
* room to translate 14 chunks with 3 stripes each.
|
|
|
|
*/
|
|
|
|
#define BTRFS_SYSTEM_CHUNK_ARRAY_SIZE 2048
|
2008-04-18 10:29:49 -04:00
|
|
|
#define BTRFS_LABEL_SIZE 256
|
2008-03-24 15:01:56 -04:00
|
|
|
|
2011-11-03 15:17:42 -04:00
|
|
|
/*
|
|
|
|
* just in case we somehow lose the roots and are not able to mount,
|
|
|
|
* we store an array of the roots from previous transactions
|
|
|
|
* in the super.
|
|
|
|
*/
|
|
|
|
#define BTRFS_NUM_BACKUP_ROOTS 4
|
|
|
|
struct btrfs_root_backup {
|
|
|
|
__le64 tree_root;
|
|
|
|
__le64 tree_root_gen;
|
|
|
|
|
|
|
|
__le64 chunk_root;
|
|
|
|
__le64 chunk_root_gen;
|
|
|
|
|
|
|
|
__le64 extent_root;
|
|
|
|
__le64 extent_root_gen;
|
|
|
|
|
|
|
|
__le64 fs_root;
|
|
|
|
__le64 fs_root_gen;
|
|
|
|
|
|
|
|
__le64 dev_root;
|
|
|
|
__le64 dev_root_gen;
|
|
|
|
|
|
|
|
__le64 csum_root;
|
|
|
|
__le64 csum_root_gen;
|
|
|
|
|
|
|
|
__le64 total_bytes;
|
|
|
|
__le64 bytes_used;
|
|
|
|
__le64 num_devices;
|
|
|
|
/* future */
|
|
|
|
__le64 unsed_64[4];
|
|
|
|
|
|
|
|
u8 tree_root_level;
|
|
|
|
u8 chunk_root_level;
|
|
|
|
u8 extent_root_level;
|
|
|
|
u8 fs_root_level;
|
|
|
|
u8 dev_root_level;
|
|
|
|
u8 csum_root_level;
|
|
|
|
/* future and to align */
|
|
|
|
u8 unused_8[10];
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-02-26 10:40:21 -05:00
|
|
|
/*
|
|
|
|
* the super block basically lists the main trees of the FS
|
|
|
|
* it currently lacks any block count etc etc
|
|
|
|
*/
|
2007-03-13 10:46:10 -04:00
|
|
|
struct btrfs_super_block {
|
2007-03-29 15:15:27 -04:00
|
|
|
u8 csum[BTRFS_CSUM_SIZE];
|
2008-04-01 11:21:32 -04:00
|
|
|
/* the first 4 fields must match struct btrfs_header */
|
2008-11-17 21:11:30 -05:00
|
|
|
u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
|
2007-10-15 16:15:53 -04:00
|
|
|
__le64 bytenr; /* this block number */
|
2008-04-01 11:21:32 -04:00
|
|
|
__le64 flags;
|
2008-04-15 15:41:47 -04:00
|
|
|
|
|
|
|
/* allowed to be different from the btrfs_header from here own down */
|
2007-03-13 16:47:54 -04:00
|
|
|
__le64 magic;
|
|
|
|
__le64 generation;
|
|
|
|
__le64 root;
|
2008-03-24 15:01:56 -04:00
|
|
|
__le64 chunk_root;
|
2008-09-05 16:13:11 -04:00
|
|
|
__le64 log_root;
|
2008-12-08 16:40:21 -05:00
|
|
|
|
|
|
|
/* this will help find the new super based on the log root */
|
|
|
|
__le64 log_root_transid;
|
2007-10-15 16:15:53 -04:00
|
|
|
__le64 total_bytes;
|
|
|
|
__le64 bytes_used;
|
2007-03-21 11:12:56 -04:00
|
|
|
__le64 root_dir_objectid;
|
2008-03-24 15:02:07 -04:00
|
|
|
__le64 num_devices;
|
2007-10-15 16:14:19 -04:00
|
|
|
__le32 sectorsize;
|
|
|
|
__le32 nodesize;
|
|
|
|
__le32 leafsize;
|
2007-11-30 11:30:34 -05:00
|
|
|
__le32 stripesize;
|
2008-03-24 15:01:56 -04:00
|
|
|
__le32 sys_chunk_array_size;
|
2008-10-29 14:49:05 -04:00
|
|
|
__le64 chunk_root_generation;
|
2008-12-02 06:36:08 -05:00
|
|
|
__le64 compat_flags;
|
|
|
|
__le64 compat_ro_flags;
|
|
|
|
__le64 incompat_flags;
|
2008-12-02 07:17:45 -05:00
|
|
|
__le16 csum_type;
|
2007-10-15 16:15:53 -04:00
|
|
|
u8 root_level;
|
2008-03-24 15:01:56 -04:00
|
|
|
u8 chunk_root_level;
|
2008-09-05 16:13:11 -04:00
|
|
|
u8 log_root_level;
|
2008-03-24 15:02:07 -04:00
|
|
|
struct btrfs_dev_item dev_item;
|
2008-12-08 16:40:21 -05:00
|
|
|
|
2008-04-18 10:29:49 -04:00
|
|
|
char label[BTRFS_LABEL_SIZE];
|
2008-12-08 16:40:21 -05:00
|
|
|
|
2010-06-21 14:48:16 -04:00
|
|
|
__le64 cache_generation;
|
|
|
|
|
2008-12-08 16:40:21 -05:00
|
|
|
/* future expansion */
|
2010-06-21 14:48:16 -04:00
|
|
|
__le64 reserved[31];
|
2008-03-24 15:01:56 -04:00
|
|
|
u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
|
2011-11-03 15:17:42 -04:00
|
|
|
struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
|
2007-02-21 17:04:57 -05:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-12-02 06:36:08 -05:00
|
|
|
/*
|
|
|
|
* Compat flags that we support. If any incompat flags are set other than the
|
|
|
|
* ones specified below then we will fail to mount
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF (1ULL << 0)
|
2010-06-21 14:48:16 -04:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL (1ULL << 1)
|
2010-09-16 16:19:09 -04:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS (1ULL << 2)
|
2010-10-25 15:12:26 +08:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO (1ULL << 3)
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
|
|
|
|
#define BTRFS_FEATURE_COMPAT_SUPP 0ULL
|
|
|
|
#define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL
|
2010-06-21 14:48:16 -04:00
|
|
|
#define BTRFS_FEATURE_INCOMPAT_SUPP \
|
|
|
|
(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF | \
|
2010-09-16 16:19:09 -04:00
|
|
|
BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL | \
|
2010-10-25 15:12:26 +08:00
|
|
|
BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \
|
|
|
|
BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO)
|
2008-12-02 06:36:08 -05:00
|
|
|
|
2007-02-26 10:40:21 -05:00
|
|
|
/*
|
2007-03-15 12:56:47 -04:00
|
|
|
* A leaf is full of items. offset and size tell us where to find
|
2007-02-26 10:40:21 -05:00
|
|
|
* the item in the leaf (relative to the start of the data area)
|
|
|
|
*/
|
2007-03-12 20:12:07 -04:00
|
|
|
struct btrfs_item {
|
2007-03-12 16:22:34 -04:00
|
|
|
struct btrfs_disk_key key;
|
2007-03-14 14:14:43 -04:00
|
|
|
__le32 offset;
|
2007-10-15 16:14:19 -04:00
|
|
|
__le32 size;
|
2007-02-02 09:18:22 -05:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-02-26 10:40:21 -05:00
|
|
|
/*
|
|
|
|
* leaves have an item area and a data area:
|
|
|
|
* [item0, item1....itemN] [free space] [dataN...data1, data0]
|
|
|
|
*
|
|
|
|
* The data is separate from the items to get the keys closer together
|
|
|
|
* during searches.
|
|
|
|
*/
|
2007-03-13 10:46:10 -04:00
|
|
|
struct btrfs_leaf {
|
2007-03-12 12:29:44 -04:00
|
|
|
struct btrfs_header header;
|
2007-03-14 14:14:43 -04:00
|
|
|
struct btrfs_item items[];
|
2007-02-02 09:18:22 -05:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-02-26 10:40:21 -05:00
|
|
|
/*
|
|
|
|
* all non-leaf blocks are nodes, they hold only keys and pointers to
|
|
|
|
* other blocks
|
|
|
|
*/
|
2007-03-14 14:14:43 -04:00
|
|
|
struct btrfs_key_ptr {
|
|
|
|
struct btrfs_disk_key key;
|
|
|
|
__le64 blockptr;
|
2007-12-11 09:25:06 -05:00
|
|
|
__le64 generation;
|
2007-03-14 14:14:43 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-03-13 10:46:10 -04:00
|
|
|
struct btrfs_node {
|
2007-03-12 12:29:44 -04:00
|
|
|
struct btrfs_header header;
|
2007-03-14 14:14:43 -04:00
|
|
|
struct btrfs_key_ptr ptrs[];
|
2007-02-02 09:18:22 -05:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-02-26 10:40:21 -05:00
|
|
|
/*
|
2007-03-13 10:46:10 -04:00
|
|
|
* btrfs_paths remember the path taken from the root down to the leaf.
|
|
|
|
* level 0 is always the leaf, and nodes[1...BTRFS_MAX_LEVEL] will point
|
2007-02-26 10:40:21 -05:00
|
|
|
* to any other levels that are present.
|
|
|
|
*
|
|
|
|
* The slots array records the index of the item or block pointer
|
|
|
|
* used while walking the tree.
|
|
|
|
*/
|
2007-03-13 10:46:10 -04:00
|
|
|
struct btrfs_path {
|
2007-10-15 16:14:19 -04:00
|
|
|
struct extent_buffer *nodes[BTRFS_MAX_LEVEL];
|
2007-03-13 10:46:10 -04:00
|
|
|
int slots[BTRFS_MAX_LEVEL];
|
2008-06-25 16:01:30 -04:00
|
|
|
/* if there is real range locking, this locks field will change */
|
|
|
|
int locks[BTRFS_MAX_LEVEL];
|
2007-08-07 15:52:22 -04:00
|
|
|
int reada;
|
2008-06-25 16:01:30 -04:00
|
|
|
/* keep some upper locks as we walk down */
|
2007-08-07 16:15:09 -04:00
|
|
|
int lowest_level;
|
2008-12-10 09:10:46 -05:00
|
|
|
|
|
|
|
/*
|
|
|
|
* set by btrfs_split_item, tells search_slot to keep all locks
|
|
|
|
* and to force calls to keep space in the nodes
|
|
|
|
*/
|
2009-03-13 11:00:37 -04:00
|
|
|
unsigned int search_for_split:1;
|
|
|
|
unsigned int keep_locks:1;
|
|
|
|
unsigned int skip_locking:1;
|
|
|
|
unsigned int leave_spinning:1;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
unsigned int search_commit_root:1;
|
2007-02-02 09:18:22 -05:00
|
|
|
};
|
2007-02-24 06:24:44 -05:00
|
|
|
|
2007-03-15 12:56:47 -04:00
|
|
|
/*
|
|
|
|
* items in the extent btree are used to record the objectid of the
|
|
|
|
* owner of the block and the number of references
|
|
|
|
*/
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
|
2007-03-15 12:56:47 -04:00
|
|
|
struct btrfs_extent_item {
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
__le64 refs;
|
|
|
|
__le64 generation;
|
|
|
|
__le64 flags;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_extent_item_v0 {
|
2007-03-15 12:56:47 -04:00
|
|
|
__le32 refs;
|
2007-12-11 09:25:06 -05:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
#define BTRFS_MAX_EXTENT_ITEM_SIZE(r) ((BTRFS_LEAF_DATA_SIZE(r) >> 4) - \
|
|
|
|
sizeof(struct btrfs_item))
|
|
|
|
|
|
|
|
#define BTRFS_EXTENT_FLAG_DATA (1ULL << 0)
|
|
|
|
#define BTRFS_EXTENT_FLAG_TREE_BLOCK (1ULL << 1)
|
|
|
|
|
|
|
|
/* following flags only apply to tree blocks */
|
|
|
|
|
|
|
|
/* use full backrefs for extent pointers in the block */
|
|
|
|
#define BTRFS_BLOCK_FLAG_FULL_BACKREF (1ULL << 8)
|
|
|
|
|
2011-03-08 14:14:00 +01:00
|
|
|
/*
|
|
|
|
* this flag is only used internally by scrub and may be changed at any time
|
|
|
|
* it is only declared here to avoid collisions
|
|
|
|
*/
|
|
|
|
#define BTRFS_EXTENT_FLAG_SUPER (1ULL << 48)
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
struct btrfs_tree_block_info {
|
|
|
|
struct btrfs_disk_key key;
|
|
|
|
u8 level;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_extent_data_ref {
|
|
|
|
__le64 root;
|
|
|
|
__le64 objectid;
|
|
|
|
__le64 offset;
|
|
|
|
__le32 count;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_shared_data_ref {
|
|
|
|
__le32 count;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
struct btrfs_extent_inline_ref {
|
|
|
|
u8 type;
|
2009-07-22 09:59:00 -04:00
|
|
|
__le64 offset;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
|
|
|
/* old style backrefs item */
|
|
|
|
struct btrfs_extent_ref_v0 {
|
2007-12-11 09:25:06 -05:00
|
|
|
__le64 root;
|
|
|
|
__le64 generation;
|
|
|
|
__le64 objectid;
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
__le32 count;
|
2007-03-15 12:56:47 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
/* dev extents record free space on individual devices. The owner
|
|
|
|
* field points back to the chunk allocation mapping tree that allocated
|
2008-04-15 15:41:47 -04:00
|
|
|
* the extent. The chunk tree uuid field is a way to double check the owner
|
2008-03-24 15:01:56 -04:00
|
|
|
*/
|
|
|
|
struct btrfs_dev_extent {
|
2008-04-15 15:41:47 -04:00
|
|
|
__le64 chunk_tree;
|
|
|
|
__le64 chunk_objectid;
|
|
|
|
__le64 chunk_offset;
|
2008-03-24 15:01:56 -04:00
|
|
|
__le64 length;
|
2008-04-15 15:41:47 -04:00
|
|
|
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
|
2008-03-24 15:01:56 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-12-12 14:38:19 -05:00
|
|
|
struct btrfs_inode_ref {
|
2008-07-24 12:12:38 -04:00
|
|
|
__le64 index;
|
2007-12-12 14:38:19 -05:00
|
|
|
__le16 name_len;
|
|
|
|
/* name goes here */
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
struct btrfs_timespec {
|
2007-03-29 15:15:27 -04:00
|
|
|
__le64 sec;
|
2007-03-15 19:03:33 -04:00
|
|
|
__le32 nsec;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2009-01-21 10:49:16 -05:00
|
|
|
enum btrfs_compression_type {
|
2010-12-17 14:21:50 +08:00
|
|
|
BTRFS_COMPRESS_NONE = 0,
|
|
|
|
BTRFS_COMPRESS_ZLIB = 1,
|
2010-10-25 15:12:26 +08:00
|
|
|
BTRFS_COMPRESS_LZO = 2,
|
|
|
|
BTRFS_COMPRESS_TYPES = 2,
|
|
|
|
BTRFS_COMPRESS_LAST = 3,
|
2009-01-21 10:49:16 -05:00
|
|
|
};
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
|
2007-03-15 19:03:33 -04:00
|
|
|
struct btrfs_inode_item {
|
2008-09-05 16:13:11 -04:00
|
|
|
/* nfs style generation number */
|
2007-03-15 19:03:33 -04:00
|
|
|
__le64 generation;
|
2008-09-05 16:13:11 -04:00
|
|
|
/* transid that last touched this inode */
|
|
|
|
__le64 transid;
|
2007-03-15 19:03:33 -04:00
|
|
|
__le64 size;
|
2008-10-09 11:46:29 -04:00
|
|
|
__le64 nbytes;
|
2007-04-30 15:25:45 -04:00
|
|
|
__le64 block_group;
|
2007-03-15 19:03:33 -04:00
|
|
|
__le32 nlink;
|
|
|
|
__le32 uid;
|
|
|
|
__le32 gid;
|
|
|
|
__le32 mode;
|
2008-03-24 15:01:56 -04:00
|
|
|
__le64 rdev;
|
2008-12-02 06:36:08 -05:00
|
|
|
__le64 flags;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
|
2008-12-08 16:40:21 -05:00
|
|
|
/* modification sequence number for NFS */
|
|
|
|
__le64 sequence;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* a little future expansion, for more than this we can
|
|
|
|
* just grow the inode item and version it
|
|
|
|
*/
|
|
|
|
__le64 reserved[4];
|
2008-03-24 15:01:56 -04:00
|
|
|
struct btrfs_timespec atime;
|
|
|
|
struct btrfs_timespec ctime;
|
|
|
|
struct btrfs_timespec mtime;
|
|
|
|
struct btrfs_timespec otime;
|
2007-03-15 19:03:33 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-09-05 16:13:11 -04:00
|
|
|
struct btrfs_dir_log_item {
|
|
|
|
__le64 end;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-03-15 12:56:47 -04:00
|
|
|
struct btrfs_dir_item {
|
2007-04-06 15:37:36 -04:00
|
|
|
struct btrfs_disk_key location;
|
2008-09-05 16:13:11 -04:00
|
|
|
__le64 transid;
|
2007-11-16 11:45:54 -05:00
|
|
|
__le16 data_len;
|
2007-03-16 08:46:49 -04:00
|
|
|
__le16 name_len;
|
2007-03-15 12:56:47 -04:00
|
|
|
u8 type;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2010-12-20 16:04:08 +08:00
|
|
|
#define BTRFS_ROOT_SUBVOL_RDONLY (1ULL << 0)
|
|
|
|
|
2007-03-15 12:56:47 -04:00
|
|
|
struct btrfs_root_item {
|
2007-04-06 15:37:36 -04:00
|
|
|
struct btrfs_inode_item inode;
|
2008-10-29 14:49:05 -04:00
|
|
|
__le64 generation;
|
2007-04-06 15:37:36 -04:00
|
|
|
__le64 root_dirid;
|
2007-10-15 16:15:53 -04:00
|
|
|
__le64 bytenr;
|
|
|
|
__le64 byte_limit;
|
|
|
|
__le64 bytes_used;
|
2008-10-30 14:20:02 -04:00
|
|
|
__le64 last_snapshot;
|
2008-12-02 06:36:08 -05:00
|
|
|
__le64 flags;
|
2007-03-15 12:56:47 -04:00
|
|
|
__le32 refs;
|
2007-06-22 14:16:25 -04:00
|
|
|
struct btrfs_disk_key drop_progress;
|
|
|
|
u8 drop_level;
|
2007-10-15 16:15:53 -04:00
|
|
|
u8 level;
|
2007-03-20 14:38:32 -04:00
|
|
|
} __attribute__ ((__packed__));
|
2007-03-15 12:56:47 -04:00
|
|
|
|
2008-11-17 20:37:39 -05:00
|
|
|
/*
|
|
|
|
* this is used for both forward and backward root refs
|
|
|
|
*/
|
|
|
|
struct btrfs_root_ref {
|
|
|
|
__le64 dirid;
|
|
|
|
__le64 sequence;
|
|
|
|
__le16 name_len;
|
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-10-30 14:25:28 -04:00
|
|
|
#define BTRFS_FILE_EXTENT_INLINE 0
|
|
|
|
#define BTRFS_FILE_EXTENT_REG 1
|
|
|
|
#define BTRFS_FILE_EXTENT_PREALLOC 2
|
2007-04-19 13:37:44 -04:00
|
|
|
|
2007-03-20 14:38:32 -04:00
|
|
|
struct btrfs_file_extent_item {
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
/*
|
|
|
|
* transaction id that created this extent
|
|
|
|
*/
|
2007-03-27 09:16:29 -04:00
|
|
|
__le64 generation;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
/*
|
|
|
|
* max number of bytes to hold this extent in ram
|
|
|
|
* when we split a compressed extent we can't know how big
|
|
|
|
* each of the resulting pieces will be. So, this is
|
|
|
|
* an upper limit on the size of the extent in ram instead of
|
|
|
|
* an exact limit.
|
|
|
|
*/
|
|
|
|
__le64 ram_bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 32 bits for the various ways we might encode the data,
|
|
|
|
* including compression and encryption. If any of these
|
|
|
|
* are set to something a given disk format doesn't understand
|
|
|
|
* it is treated like an incompat flag for reading and writing,
|
|
|
|
* but not for stat.
|
|
|
|
*/
|
|
|
|
u8 compression;
|
|
|
|
u8 encryption;
|
|
|
|
__le16 other_encoding; /* spare for later use */
|
|
|
|
|
|
|
|
/* are we inline data or a real extent? */
|
2007-04-19 13:37:44 -04:00
|
|
|
u8 type;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
|
2007-03-20 14:38:32 -04:00
|
|
|
/*
|
|
|
|
* disk space consumed by the extent, checksum blocks are included
|
|
|
|
* in these numbers
|
|
|
|
*/
|
2007-10-15 16:15:53 -04:00
|
|
|
__le64 disk_bytenr;
|
|
|
|
__le64 disk_num_bytes;
|
2007-03-20 14:38:32 -04:00
|
|
|
/*
|
2007-03-26 16:00:06 -04:00
|
|
|
* the logical offset in file blocks (no csums)
|
2007-03-20 14:38:32 -04:00
|
|
|
* this extent record is for. This allows a file extent to point
|
|
|
|
* into the middle of an existing extent on disk, sharing it
|
|
|
|
* between two snapshots (useful if some bytes in the middle of the
|
|
|
|
* extent have changed
|
|
|
|
*/
|
|
|
|
__le64 offset;
|
|
|
|
/*
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
* the logical number of file blocks (no csums included). This
|
|
|
|
* always reflects the size uncompressed and without encoding.
|
2007-03-20 14:38:32 -04:00
|
|
|
*/
|
2007-10-15 16:15:53 -04:00
|
|
|
__le64 num_bytes;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
|
2007-03-20 14:38:32 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2007-03-29 15:15:27 -04:00
|
|
|
struct btrfs_csum_item {
|
2007-05-10 12:36:17 -04:00
|
|
|
u8 csum;
|
2007-03-29 15:15:27 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
/* different types of block groups (and chunks) */
|
|
|
|
#define BTRFS_BLOCK_GROUP_DATA (1 << 0)
|
|
|
|
#define BTRFS_BLOCK_GROUP_SYSTEM (1 << 1)
|
|
|
|
#define BTRFS_BLOCK_GROUP_METADATA (1 << 2)
|
2008-03-25 16:50:33 -04:00
|
|
|
#define BTRFS_BLOCK_GROUP_RAID0 (1 << 3)
|
2008-04-03 16:29:03 -04:00
|
|
|
#define BTRFS_BLOCK_GROUP_RAID1 (1 << 4)
|
2008-04-03 16:29:03 -04:00
|
|
|
#define BTRFS_BLOCK_GROUP_DUP (1 << 5)
|
2008-04-16 10:49:51 -04:00
|
|
|
#define BTRFS_BLOCK_GROUP_RAID10 (1 << 6)
|
2010-05-16 10:46:24 -04:00
|
|
|
#define BTRFS_NR_RAID_TYPES 5
|
2007-05-29 16:52:18 -04:00
|
|
|
|
2007-04-26 16:46:15 -04:00
|
|
|
struct btrfs_block_group_item {
|
|
|
|
__le64 used;
|
2008-03-24 15:01:56 -04:00
|
|
|
__le64 chunk_objectid;
|
|
|
|
__le64 flags;
|
2007-04-26 16:46:15 -04:00
|
|
|
} __attribute__ ((__packed__));
|
|
|
|
|
2008-03-24 15:01:59 -04:00
|
|
|
struct btrfs_space_info {
|
|
|
|
u64 flags;
|
2009-02-20 11:00:09 -05:00
|
|
|
|
2010-10-14 14:52:27 -04:00
|
|
|
u64 total_bytes; /* total bytes in the space,
|
|
|
|
this doesn't take mirrors into account */
|
2010-05-16 10:46:24 -04:00
|
|
|
u64 bytes_used; /* total bytes used,
|
2011-04-26 23:28:26 -07:00
|
|
|
this doesn't take mirrors into account */
|
2009-02-20 11:00:09 -05:00
|
|
|
u64 bytes_pinned; /* total bytes pinned, will be freed when the
|
|
|
|
transaction finishes */
|
|
|
|
u64 bytes_reserved; /* total bytes the allocator has reserved for
|
|
|
|
current allocations */
|
|
|
|
u64 bytes_readonly; /* total bytes that are read only */
|
2010-05-16 10:49:58 -04:00
|
|
|
|
2009-02-20 11:00:09 -05:00
|
|
|
u64 bytes_may_use; /* number of bytes that may be used for
|
2009-09-11 16:12:44 -04:00
|
|
|
delalloc/allocations */
|
2010-05-16 10:46:24 -04:00
|
|
|
u64 disk_used; /* total bytes used on disk */
|
2010-10-14 14:52:27 -04:00
|
|
|
u64 disk_total; /* total bytes on disk, takes mirrors into
|
|
|
|
account */
|
2009-02-20 11:00:09 -05:00
|
|
|
|
2011-03-12 07:08:42 -05:00
|
|
|
/*
|
|
|
|
* we bump reservation progress every time we decrement
|
|
|
|
* bytes_reserved. This way people waiting for reservations
|
|
|
|
* know something good has happened and they can check
|
|
|
|
* for progress. The number here isn't to be trusted, it
|
|
|
|
* just shows reclaim activity
|
|
|
|
*/
|
|
|
|
unsigned long reservation_progress;
|
|
|
|
|
2011-05-12 18:13:12 +02:00
|
|
|
unsigned int full:1; /* indicates that we cannot allocate any more
|
2009-02-20 11:00:09 -05:00
|
|
|
chunks for this space */
|
2011-05-12 18:13:12 +02:00
|
|
|
unsigned int chunk_alloc:1; /* set if we are allocating a chunk */
|
2011-04-11 20:20:11 -04:00
|
|
|
|
2011-06-07 16:07:44 -04:00
|
|
|
unsigned int flush:1; /* set if we are trying to make space */
|
|
|
|
|
2011-05-12 18:13:12 +02:00
|
|
|
unsigned int force_alloc; /* set if we need to force a chunk
|
|
|
|
alloc for this space */
|
2009-02-20 11:00:09 -05:00
|
|
|
|
2008-03-24 15:01:59 -04:00
|
|
|
struct list_head list;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
|
|
|
/* for block groups in our same type */
|
2010-05-16 10:46:24 -04:00
|
|
|
struct list_head block_groups[BTRFS_NR_RAID_TYPES];
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
spinlock_t lock;
|
Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 14:49:05 -04:00
|
|
|
struct rw_semaphore groups_sem;
|
2011-06-07 16:07:44 -04:00
|
|
|
wait_queue_head_t wait;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
};
|
|
|
|
|
2010-05-16 10:46:25 -04:00
|
|
|
struct btrfs_block_rsv {
|
|
|
|
u64 size;
|
|
|
|
u64 reserved;
|
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
spinlock_t lock;
|
|
|
|
unsigned int full:1;
|
|
|
|
};
|
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
/*
|
|
|
|
* free clusters are used to claim free space in relatively large chunks,
|
|
|
|
* allowing us to do less seeky writes. They are used for all metadata
|
|
|
|
* allocations and data allocations in ssd mode.
|
|
|
|
*/
|
|
|
|
struct btrfs_free_cluster {
|
|
|
|
spinlock_t lock;
|
|
|
|
spinlock_t refill_lock;
|
|
|
|
struct rb_root root;
|
|
|
|
|
|
|
|
/* largest extent in this cluster */
|
|
|
|
u64 max_size;
|
|
|
|
|
|
|
|
/* first extent starting offset */
|
|
|
|
u64 window_start;
|
|
|
|
|
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
/*
|
|
|
|
* when a cluster is allocated from a block group, we put the
|
|
|
|
* cluster onto a list in the block group so that it can
|
|
|
|
* be freed before the block group is freed.
|
|
|
|
*/
|
|
|
|
struct list_head block_group_list;
|
2008-03-24 15:01:59 -04:00
|
|
|
};
|
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
enum btrfs_caching_type {
|
|
|
|
BTRFS_CACHE_NO = 0,
|
|
|
|
BTRFS_CACHE_STARTED = 1,
|
2011-11-14 13:52:14 -05:00
|
|
|
BTRFS_CACHE_FAST = 2,
|
|
|
|
BTRFS_CACHE_FINISHED = 3,
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
};
|
|
|
|
|
2010-06-21 14:48:16 -04:00
|
|
|
enum btrfs_disk_cache_state {
|
|
|
|
BTRFS_DC_WRITTEN = 0,
|
|
|
|
BTRFS_DC_ERROR = 1,
|
|
|
|
BTRFS_DC_CLEAR = 2,
|
|
|
|
BTRFS_DC_SETUP = 3,
|
|
|
|
BTRFS_DC_NEED_WRITE = 4,
|
|
|
|
};
|
|
|
|
|
2009-09-11 16:11:19 -04:00
|
|
|
struct btrfs_caching_control {
|
|
|
|
struct list_head list;
|
|
|
|
struct mutex mutex;
|
|
|
|
wait_queue_head_t wait;
|
2011-06-30 14:42:28 -04:00
|
|
|
struct btrfs_work work;
|
2009-09-11 16:11:19 -04:00
|
|
|
struct btrfs_block_group_cache *block_group;
|
|
|
|
u64 progress;
|
|
|
|
atomic_t count;
|
|
|
|
};
|
|
|
|
|
2007-04-26 16:46:15 -04:00
|
|
|
struct btrfs_block_group_cache {
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_block_group_item item;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
struct btrfs_fs_info *fs_info;
|
2010-06-21 14:48:16 -04:00
|
|
|
struct inode *inode;
|
2008-07-22 23:06:41 -04:00
|
|
|
spinlock_t lock;
|
2007-11-16 14:57:08 -05:00
|
|
|
u64 pinned;
|
2008-09-26 10:05:48 -04:00
|
|
|
u64 reserved;
|
2009-09-11 16:11:20 -04:00
|
|
|
u64 bytes_super;
|
2008-03-24 15:01:56 -04:00
|
|
|
u64 flags;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
u64 sectorsize;
|
2011-10-06 08:58:24 -04:00
|
|
|
u64 cache_generation;
|
2010-11-20 12:03:07 +00:00
|
|
|
unsigned int ro:1;
|
|
|
|
unsigned int dirty:1;
|
|
|
|
unsigned int iref:1;
|
2010-06-21 14:48:16 -04:00
|
|
|
|
|
|
|
int disk_cache_state;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
/* cache tracking stuff */
|
|
|
|
int cached;
|
2009-09-11 16:11:19 -04:00
|
|
|
struct btrfs_caching_control *caching_ctl;
|
|
|
|
u64 last_byte_to_unpin;
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
struct btrfs_space_info *space_info;
|
|
|
|
|
|
|
|
/* free space cache stuff */
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *free_space_ctl;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
|
|
|
/* block group cache stuff */
|
|
|
|
struct rb_node cache_node;
|
|
|
|
|
|
|
|
/* for block groups in the same raid type */
|
|
|
|
struct list_head list;
|
2008-12-11 16:30:39 -05:00
|
|
|
|
|
|
|
/* usage count */
|
|
|
|
atomic_t count;
|
2009-04-03 09:47:43 -04:00
|
|
|
|
|
|
|
/* List of struct btrfs_free_clusters for this block group.
|
|
|
|
* Today it will only have one thing on it, but that may change
|
|
|
|
*/
|
|
|
|
struct list_head cluster_list;
|
2007-04-26 16:46:15 -04:00
|
|
|
};
|
2008-03-24 15:01:56 -04:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
struct reloc_control;
|
2008-03-24 15:01:56 -04:00
|
|
|
struct btrfs_device;
|
2008-03-24 15:02:07 -04:00
|
|
|
struct btrfs_fs_devices;
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
struct btrfs_delayed_root;
|
2007-03-20 14:38:32 -04:00
|
|
|
struct btrfs_fs_info {
|
2007-10-15 16:14:19 -04:00
|
|
|
u8 fsid[BTRFS_FSID_SIZE];
|
2008-04-15 15:41:47 -04:00
|
|
|
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
|
2007-03-15 12:56:47 -04:00
|
|
|
struct btrfs_root *extent_root;
|
|
|
|
struct btrfs_root *tree_root;
|
2008-03-24 15:01:56 -04:00
|
|
|
struct btrfs_root *chunk_root;
|
|
|
|
struct btrfs_root *dev_root;
|
2008-11-17 21:02:50 -05:00
|
|
|
struct btrfs_root *fs_root;
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
struct btrfs_root *csum_root;
|
2008-09-05 16:13:11 -04:00
|
|
|
|
|
|
|
/* the log root tree is a directory of all the other log roots */
|
|
|
|
struct btrfs_root *log_root_tree;
|
2009-09-21 15:56:00 -04:00
|
|
|
|
|
|
|
spinlock_t fs_roots_radix_lock;
|
2007-04-09 10:42:37 -04:00
|
|
|
struct radix_tree_root fs_roots_radix;
|
2007-10-15 16:15:26 -04:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
/* block group cache stuff */
|
|
|
|
spinlock_t block_group_cache_lock;
|
|
|
|
struct rb_root block_group_cache_tree;
|
|
|
|
|
2011-09-26 17:12:22 -04:00
|
|
|
/* keep track of unallocated space */
|
|
|
|
spinlock_t free_chunk_lock;
|
|
|
|
u64 free_chunk_space;
|
|
|
|
|
2009-09-11 16:11:19 -04:00
|
|
|
struct extent_io_tree freed_extents[2];
|
|
|
|
struct extent_io_tree *pinned_extents;
|
2007-10-15 16:15:26 -04:00
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
/* logical->physical extent mapping */
|
|
|
|
struct btrfs_mapping_tree mapping_tree;
|
|
|
|
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
/*
|
|
|
|
* block reservation for extent, checksum, root tree and
|
|
|
|
* delayed dir index item
|
|
|
|
*/
|
2010-05-16 10:46:25 -04:00
|
|
|
struct btrfs_block_rsv global_block_rsv;
|
|
|
|
/* block reservation for delay allocation */
|
|
|
|
struct btrfs_block_rsv delalloc_block_rsv;
|
|
|
|
/* block reservation for metadata operations */
|
|
|
|
struct btrfs_block_rsv trans_block_rsv;
|
|
|
|
/* block reservation for chunk tree */
|
|
|
|
struct btrfs_block_rsv chunk_block_rsv;
|
2011-11-03 22:54:25 -04:00
|
|
|
/* block reservation for delayed operations */
|
|
|
|
struct btrfs_block_rsv delayed_block_rsv;
|
2010-05-16 10:46:25 -04:00
|
|
|
|
|
|
|
struct btrfs_block_rsv empty_block_rsv;
|
|
|
|
|
2007-03-20 15:57:25 -04:00
|
|
|
u64 generation;
|
2007-08-10 16:22:09 -04:00
|
|
|
u64 last_trans_committed;
|
2009-03-24 10:24:20 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* this is updated to the current trans every time a full commit
|
|
|
|
* is required instead of the faster short fsync log commits
|
|
|
|
*/
|
|
|
|
u64 last_trans_log_full_commit;
|
2010-12-17 14:21:50 +08:00
|
|
|
unsigned long mount_opt:20;
|
|
|
|
unsigned long compress_type:4;
|
2008-01-29 16:03:38 -05:00
|
|
|
u64 max_inline;
|
2008-01-02 10:01:11 -05:00
|
|
|
u64 alloc_start;
|
2007-03-22 15:59:16 -04:00
|
|
|
struct btrfs_transaction *running_transaction;
|
2008-07-17 12:53:50 -04:00
|
|
|
wait_queue_head_t transaction_throttle;
|
2008-07-17 12:54:14 -04:00
|
|
|
wait_queue_head_t transaction_wait;
|
2010-10-29 15:37:34 -04:00
|
|
|
wait_queue_head_t transaction_blocked_wait;
|
2008-11-06 22:02:51 -05:00
|
|
|
wait_queue_head_t async_submit_wait;
|
2008-09-05 16:13:11 -04:00
|
|
|
|
2011-04-13 15:41:04 +02:00
|
|
|
struct btrfs_super_block *super_copy;
|
|
|
|
struct btrfs_super_block *super_for_commit;
|
2008-03-24 15:01:56 -04:00
|
|
|
struct block_device *__bdev;
|
2007-03-22 12:13:20 -04:00
|
|
|
struct super_block *sb;
|
2007-03-28 13:57:48 -04:00
|
|
|
struct inode *btree_inode;
|
2008-03-26 10:28:07 -04:00
|
|
|
struct backing_dev_info bdi;
|
2008-09-05 16:13:11 -04:00
|
|
|
struct mutex tree_log_mutex;
|
2008-06-25 16:01:31 -04:00
|
|
|
struct mutex transaction_kthread_mutex;
|
|
|
|
struct mutex cleaner_mutex;
|
2008-06-25 16:01:30 -04:00
|
|
|
struct mutex chunk_mutex;
|
2008-07-08 14:19:17 -04:00
|
|
|
struct mutex volume_mutex;
|
2009-03-31 13:27:11 -04:00
|
|
|
/*
|
|
|
|
* this protects the ordered operations list only while we are
|
|
|
|
* processing all of the entries on it. This way we make
|
|
|
|
* sure the commit code doesn't find the list temporarily empty
|
|
|
|
* because another function happens to be doing non-waiting preflush
|
|
|
|
* before jumping into the main commit.
|
|
|
|
*/
|
|
|
|
struct mutex ordered_operations_mutex;
|
2009-09-11 16:11:19 -04:00
|
|
|
struct rw_semaphore extent_commit_sem;
|
2009-03-31 13:27:11 -04:00
|
|
|
|
2009-11-12 09:34:40 +00:00
|
|
|
struct rw_semaphore cleanup_work_sem;
|
2009-09-21 16:00:26 -04:00
|
|
|
|
2009-11-12 09:34:40 +00:00
|
|
|
struct rw_semaphore subvol_sem;
|
2009-09-21 16:00:26 -04:00
|
|
|
struct srcu_struct subvol_srcu;
|
|
|
|
|
2011-04-11 17:25:13 -04:00
|
|
|
spinlock_t trans_lock;
|
2011-06-13 20:00:16 -04:00
|
|
|
/*
|
|
|
|
* the reloc mutex goes with the trans lock, it is taken
|
|
|
|
* during commit to protect us from the relocation code
|
|
|
|
*/
|
|
|
|
struct mutex reloc_mutex;
|
|
|
|
|
2007-04-19 21:01:03 -04:00
|
|
|
struct list_head trans_list;
|
2007-10-15 16:19:22 -04:00
|
|
|
struct list_head hashers;
|
2007-06-08 18:11:48 -04:00
|
|
|
struct list_head dead_roots;
|
2009-09-11 16:11:19 -04:00
|
|
|
struct list_head caching_block_groups;
|
2008-09-05 16:13:11 -04:00
|
|
|
|
2009-11-12 09:36:34 +00:00
|
|
|
spinlock_t delayed_iput_lock;
|
|
|
|
struct list_head delayed_iputs;
|
|
|
|
|
2008-05-15 16:15:45 -04:00
|
|
|
atomic_t nr_async_submits;
|
2008-09-29 11:19:10 -04:00
|
|
|
atomic_t async_submit_draining;
|
2008-08-15 15:34:15 -04:00
|
|
|
atomic_t nr_async_bios;
|
2008-11-06 22:02:51 -05:00
|
|
|
atomic_t async_delalloc_pages;
|
2011-04-11 17:25:13 -04:00
|
|
|
atomic_t open_ioctl_trans;
|
2008-04-09 16:28:12 -04:00
|
|
|
|
2008-07-24 11:57:52 -04:00
|
|
|
/*
|
|
|
|
* this is used by the balancing code to wait for all the pending
|
|
|
|
* ordered extents
|
|
|
|
*/
|
|
|
|
spinlock_t ordered_extent_lock;
|
2009-03-31 13:27:11 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* all of the data=ordered extents pending writeback
|
|
|
|
* these can span multiple transactions and basically include
|
|
|
|
* every dirty data page that isn't from nodatacow
|
|
|
|
*/
|
2008-07-24 11:57:52 -04:00
|
|
|
struct list_head ordered_extents;
|
2009-03-31 13:27:11 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* all of the inodes that have delalloc bytes. It is possible for
|
|
|
|
* this list to be empty even when there is still dirty data=ordered
|
|
|
|
* extents waiting to finish IO.
|
|
|
|
*/
|
2008-08-04 23:17:27 -04:00
|
|
|
struct list_head delalloc_inodes;
|
2008-07-24 11:57:52 -04:00
|
|
|
|
2009-03-31 13:27:11 -04:00
|
|
|
/*
|
|
|
|
* special rename and truncate targets that must be on disk before
|
|
|
|
* we're allowed to commit. This is basically the ext3 style
|
|
|
|
* data=ordered list.
|
|
|
|
*/
|
|
|
|
struct list_head ordered_operations;
|
|
|
|
|
2008-06-11 16:50:36 -04:00
|
|
|
/*
|
|
|
|
* there is a pool of worker threads for checksumming during writes
|
|
|
|
* and a pool for checksumming after reads. This is because readers
|
|
|
|
* can run with FS locks held, and the writers may be waiting for
|
|
|
|
* those locks. We don't want ordering in the pending list to cause
|
|
|
|
* deadlocks, and so the two are serviced separately.
|
2008-06-12 14:46:17 -04:00
|
|
|
*
|
|
|
|
* A third pool does submit_bio to avoid deadlocking with the other
|
|
|
|
* two
|
2008-06-11 16:50:36 -04:00
|
|
|
*/
|
2009-10-02 19:11:56 -04:00
|
|
|
struct btrfs_workers generic_worker;
|
2008-06-11 16:50:36 -04:00
|
|
|
struct btrfs_workers workers;
|
2008-11-06 22:02:51 -05:00
|
|
|
struct btrfs_workers delalloc_workers;
|
2008-06-11 16:50:36 -04:00
|
|
|
struct btrfs_workers endio_workers;
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
struct btrfs_workers endio_meta_workers;
|
2008-12-17 14:51:42 -05:00
|
|
|
struct btrfs_workers endio_meta_write_workers;
|
2008-07-17 12:53:50 -04:00
|
|
|
struct btrfs_workers endio_write_workers;
|
2010-07-02 12:14:14 -04:00
|
|
|
struct btrfs_workers endio_freespace_worker;
|
2008-06-12 14:46:17 -04:00
|
|
|
struct btrfs_workers submit_workers;
|
2011-06-30 14:42:28 -04:00
|
|
|
struct btrfs_workers caching_workers;
|
2011-05-23 14:30:00 +02:00
|
|
|
struct btrfs_workers readahead_workers;
|
2011-06-30 14:42:28 -04:00
|
|
|
|
2008-07-17 12:53:51 -04:00
|
|
|
/*
|
|
|
|
* fixup workers take dirty pages that didn't properly go through
|
|
|
|
* the cow mechanism and make them safe to write. It happens
|
|
|
|
* for the sys_munmap function call path
|
|
|
|
*/
|
|
|
|
struct btrfs_workers fixup_workers;
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
struct btrfs_workers delayed_workers;
|
2008-06-25 16:01:31 -04:00
|
|
|
struct task_struct *transaction_kthread;
|
|
|
|
struct task_struct *cleaner_kthread;
|
2008-06-11 21:47:56 -04:00
|
|
|
int thread_pool_size;
|
2008-06-11 16:50:36 -04:00
|
|
|
|
2007-08-29 15:47:34 -04:00
|
|
|
struct kobject super_kobj;
|
|
|
|
struct completion kobj_unregister;
|
2007-04-20 13:16:02 -04:00
|
|
|
int do_barriers;
|
2007-06-08 18:11:48 -04:00
|
|
|
int closing;
|
2008-09-05 16:13:11 -04:00
|
|
|
int log_root_recovering;
|
2010-05-16 10:48:46 -04:00
|
|
|
int enospc_unlink;
|
2011-04-11 17:25:13 -04:00
|
|
|
int trans_no_join;
|
2007-03-20 14:38:32 -04:00
|
|
|
|
2007-11-16 14:57:08 -05:00
|
|
|
u64 total_pinned;
|
2009-03-13 11:00:37 -04:00
|
|
|
|
|
|
|
/* protected by the delalloc lock, used to keep from writing
|
|
|
|
* metadata until there is a nice batch
|
|
|
|
*/
|
|
|
|
u64 dirty_metadata_bytes;
|
2008-03-24 15:01:56 -04:00
|
|
|
struct list_head dirty_cowonly_roots;
|
|
|
|
|
2008-03-24 15:02:07 -04:00
|
|
|
struct btrfs_fs_devices *fs_devices;
|
2009-03-10 12:39:20 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* the space_info list is almost entirely read only. It only changes
|
|
|
|
* when we add a new raid type to the FS, and that happens
|
|
|
|
* very rarely. RCU is used to protect it.
|
|
|
|
*/
|
2008-03-24 15:01:59 -04:00
|
|
|
struct list_head space_info;
|
2009-03-10 12:39:20 -04:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
struct reloc_control *reloc_ctl;
|
|
|
|
|
2007-12-21 16:27:21 -05:00
|
|
|
spinlock_t delalloc_lock;
|
|
|
|
u64 delalloc_bytes;
|
2009-04-03 09:47:43 -04:00
|
|
|
|
|
|
|
/* data_alloc_cluster is only used in ssd mode */
|
|
|
|
struct btrfs_free_cluster data_alloc_cluster;
|
|
|
|
|
|
|
|
/* all metadata allocations go through this cluster */
|
|
|
|
struct btrfs_free_cluster meta_alloc_cluster;
|
2008-04-04 15:40:00 -04:00
|
|
|
|
2011-05-24 15:35:30 -04:00
|
|
|
/* auto defrag inodes go here */
|
|
|
|
spinlock_t defrag_inodes_lock;
|
|
|
|
struct rb_root defrag_inodes;
|
|
|
|
atomic_t defrag_running;
|
|
|
|
|
2008-07-28 15:32:19 -04:00
|
|
|
spinlock_t ref_cache_lock;
|
|
|
|
u64 total_ref_cache_size;
|
|
|
|
|
2008-04-04 15:40:00 -04:00
|
|
|
u64 avail_data_alloc_bits;
|
|
|
|
u64 avail_metadata_alloc_bits;
|
|
|
|
u64 avail_system_alloc_bits;
|
|
|
|
u64 data_alloc_profile;
|
|
|
|
u64 metadata_alloc_profile;
|
|
|
|
u64 system_alloc_profile;
|
2008-04-28 15:29:42 -04:00
|
|
|
|
2009-04-21 17:40:57 -04:00
|
|
|
unsigned data_chunk_allocations;
|
|
|
|
unsigned metadata_ratio;
|
|
|
|
|
2008-04-28 15:29:42 -04:00
|
|
|
void *bdev_holder;
|
2011-01-06 19:30:25 +08:00
|
|
|
|
2011-03-08 14:14:00 +01:00
|
|
|
/* private scrub information */
|
|
|
|
struct mutex scrub_lock;
|
|
|
|
atomic_t scrubs_running;
|
|
|
|
atomic_t scrub_pause_req;
|
|
|
|
atomic_t scrubs_paused;
|
|
|
|
atomic_t scrub_cancel_req;
|
|
|
|
wait_queue_head_t scrub_pause_wait;
|
|
|
|
struct rw_semaphore scrub_super_lock;
|
|
|
|
int scrub_workers_refcnt;
|
|
|
|
struct btrfs_workers scrub_workers;
|
|
|
|
|
2011-01-06 19:30:25 +08:00
|
|
|
/* filesystem state */
|
|
|
|
u64 fs_state;
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
|
|
|
|
struct btrfs_delayed_root *delayed_root;
|
2011-11-03 15:17:42 -04:00
|
|
|
|
2011-05-23 14:30:00 +02:00
|
|
|
/* readahead tree */
|
|
|
|
spinlock_t reada_lock;
|
|
|
|
struct radix_tree_root reada_tree;
|
2011-11-06 03:05:08 -05:00
|
|
|
|
2011-11-03 15:17:42 -04:00
|
|
|
/* next backup root to be overwritten */
|
|
|
|
int backup_root_index;
|
2007-11-16 14:57:08 -05:00
|
|
|
};
|
2008-03-24 15:01:56 -04:00
|
|
|
|
2007-03-20 14:38:32 -04:00
|
|
|
/*
|
|
|
|
* in ram representation of the tree. extent_root is used for all allocations
|
2007-04-25 15:52:25 -04:00
|
|
|
* and for the extent tree extent_root root.
|
2007-03-20 14:38:32 -04:00
|
|
|
*/
|
|
|
|
struct btrfs_root {
|
2007-10-15 16:14:19 -04:00
|
|
|
struct extent_buffer *node;
|
2008-06-25 16:01:30 -04:00
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
struct extent_buffer *commit_root;
|
2008-09-05 16:13:11 -04:00
|
|
|
struct btrfs_root *log_root;
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
|
|
|
struct btrfs_root *reloc_root;
|
2008-07-28 15:32:19 -04:00
|
|
|
|
2007-03-15 12:56:47 -04:00
|
|
|
struct btrfs_root_item root_item;
|
|
|
|
struct btrfs_key root_key;
|
2007-03-20 14:38:32 -04:00
|
|
|
struct btrfs_fs_info *fs_info;
|
2008-09-11 16:17:57 -04:00
|
|
|
struct extent_io_tree dirty_log_pages;
|
|
|
|
|
2007-08-29 15:47:34 -04:00
|
|
|
struct kobject root_kobj;
|
|
|
|
struct completion kobj_unregister;
|
2008-06-25 16:01:30 -04:00
|
|
|
struct mutex objectid_mutex;
|
2009-01-21 12:54:03 -05:00
|
|
|
|
2010-05-16 10:46:25 -04:00
|
|
|
spinlock_t accounting_lock;
|
|
|
|
struct btrfs_block_rsv *block_rsv;
|
|
|
|
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
/* free ino cache stuff */
|
|
|
|
struct mutex fs_commit_mutex;
|
|
|
|
struct btrfs_free_space_ctl *free_ino_ctl;
|
|
|
|
enum btrfs_caching_type cached;
|
|
|
|
spinlock_t cache_lock;
|
|
|
|
wait_queue_head_t cache_wait;
|
|
|
|
struct btrfs_free_space_ctl *free_ino_pinned;
|
|
|
|
u64 cache_progress;
|
2011-04-20 10:33:24 +08:00
|
|
|
struct inode *cache_inode;
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
|
2008-09-05 16:13:11 -04:00
|
|
|
struct mutex log_mutex;
|
2009-01-21 12:54:03 -05:00
|
|
|
wait_queue_head_t log_writer_wait;
|
|
|
|
wait_queue_head_t log_commit_wait[2];
|
|
|
|
atomic_t log_writers;
|
|
|
|
atomic_t log_commit[2];
|
|
|
|
unsigned long log_transid;
|
2009-10-13 13:21:08 -04:00
|
|
|
unsigned long last_log_commit;
|
2009-01-21 12:54:03 -05:00
|
|
|
unsigned long log_batch;
|
2009-10-08 15:30:04 -04:00
|
|
|
pid_t log_start_pid;
|
|
|
|
bool log_multiple_pids;
|
2008-08-04 23:17:27 -04:00
|
|
|
|
2007-04-09 10:42:37 -04:00
|
|
|
u64 objectid;
|
|
|
|
u64 last_trans;
|
2007-10-15 16:14:19 -04:00
|
|
|
|
|
|
|
/* data allocations are done in sectorsize units */
|
|
|
|
u32 sectorsize;
|
|
|
|
|
|
|
|
/* node allocations are done in nodesize units */
|
|
|
|
u32 nodesize;
|
|
|
|
|
|
|
|
/* leaf allocations are done in leafsize units */
|
|
|
|
u32 leafsize;
|
|
|
|
|
2007-11-30 11:30:34 -05:00
|
|
|
u32 stripesize;
|
|
|
|
|
2007-03-20 14:38:32 -04:00
|
|
|
u32 type;
|
2009-09-21 15:56:00 -04:00
|
|
|
|
|
|
|
u64 highest_objectid;
|
2011-06-13 20:00:16 -04:00
|
|
|
|
|
|
|
/* btrfs_record_root_in_trans is a multi-step process,
|
|
|
|
* and it can race with the balancing code. But the
|
|
|
|
* race is very small, and only the first time the root
|
|
|
|
* is added to each transaction. So in_trans_setup
|
|
|
|
* is used to tell us when more checks are required
|
|
|
|
*/
|
|
|
|
unsigned long in_trans_setup;
|
2007-08-07 15:52:19 -04:00
|
|
|
int ref_cows;
|
2008-03-24 15:01:56 -04:00
|
|
|
int track_dirty;
|
2009-09-21 15:56:00 -04:00
|
|
|
int in_radix;
|
|
|
|
|
2008-06-25 16:01:31 -04:00
|
|
|
u64 defrag_trans_start;
|
2007-08-07 16:15:09 -04:00
|
|
|
struct btrfs_key defrag_progress;
|
2008-05-24 14:04:53 -04:00
|
|
|
struct btrfs_key defrag_max;
|
2007-08-07 16:15:09 -04:00
|
|
|
int defrag_running;
|
2007-08-29 15:47:34 -04:00
|
|
|
char *name;
|
2008-03-24 15:01:56 -04:00
|
|
|
|
|
|
|
/* the dirty list is only used by non-reference counted roots */
|
|
|
|
struct list_head dirty_list;
|
2008-07-24 12:17:14 -04:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
struct list_head root_list;
|
|
|
|
|
2010-05-16 10:49:58 -04:00
|
|
|
spinlock_t orphan_lock;
|
2008-07-24 12:17:14 -04:00
|
|
|
struct list_head orphan_list;
|
2010-05-16 10:49:58 -04:00
|
|
|
struct btrfs_block_rsv *orphan_block_rsv;
|
|
|
|
int orphan_item_inserted;
|
|
|
|
int orphan_cleanup_state;
|
2008-11-17 20:42:26 -05:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
spinlock_t inode_lock;
|
|
|
|
/* red-black tree that keeps track of in-memory inodes */
|
|
|
|
struct rb_root inode_tree;
|
|
|
|
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
/*
|
|
|
|
* radix tree that keeps track of delayed nodes of every inode,
|
|
|
|
* protected by inode_lock
|
|
|
|
*/
|
|
|
|
struct radix_tree_root delayed_nodes_tree;
|
2008-11-17 20:42:26 -05:00
|
|
|
/*
|
|
|
|
* right now this just gets used so that a root has its own devid
|
|
|
|
* for stat. It may be used for more later
|
|
|
|
*/
|
2011-07-07 15:44:25 -04:00
|
|
|
dev_t anon_dev;
|
2011-11-14 20:48:06 -05:00
|
|
|
|
|
|
|
int force_cow;
|
2007-03-15 12:56:47 -04:00
|
|
|
};
|
|
|
|
|
2011-05-24 15:35:30 -04:00
|
|
|
struct btrfs_ioctl_defrag_range_args {
|
|
|
|
/* start of the defrag operation */
|
|
|
|
__u64 start;
|
|
|
|
|
|
|
|
/* number of bytes to defrag, use (u64)-1 to say all */
|
|
|
|
__u64 len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* flags for the operation, which can include turning
|
|
|
|
* on compression for this one defrag
|
|
|
|
*/
|
|
|
|
__u64 flags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* any extent bigger than this will be considered
|
|
|
|
* already defragged. Use 0 to take the kernel default
|
|
|
|
* Use 1 to say every single extent must be rewritten
|
|
|
|
*/
|
|
|
|
__u32 extent_thresh;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* which compression method to use if turning on compression
|
|
|
|
* for this defrag operation. If unspecified, zlib will
|
|
|
|
* be used
|
|
|
|
*/
|
|
|
|
__u32 compress_type;
|
|
|
|
|
|
|
|
/* spare for later */
|
|
|
|
__u32 unused[4];
|
|
|
|
};
|
|
|
|
|
|
|
|
|
2007-03-15 19:03:33 -04:00
|
|
|
/*
|
|
|
|
* inode items have the data typically returned from stat and store other
|
|
|
|
* info about object characteristics. There is one for every file and dir in
|
|
|
|
* the FS
|
|
|
|
*/
|
2007-04-26 16:46:15 -04:00
|
|
|
#define BTRFS_INODE_ITEM_KEY 1
|
2008-11-17 20:37:39 -05:00
|
|
|
#define BTRFS_INODE_REF_KEY 12
|
|
|
|
#define BTRFS_XATTR_ITEM_KEY 24
|
|
|
|
#define BTRFS_ORPHAN_ITEM_KEY 48
|
2007-04-26 16:46:15 -04:00
|
|
|
/* reserve 2-15 close to the inode for later flexibility */
|
2007-03-15 19:03:33 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* dir items are the name -> inode pointers in a directory. There is one
|
|
|
|
* for every name in a directory.
|
|
|
|
*/
|
2008-11-17 20:37:39 -05:00
|
|
|
#define BTRFS_DIR_LOG_ITEM_KEY 60
|
|
|
|
#define BTRFS_DIR_LOG_INDEX_KEY 72
|
|
|
|
#define BTRFS_DIR_ITEM_KEY 84
|
|
|
|
#define BTRFS_DIR_INDEX_KEY 96
|
2007-03-15 19:03:33 -04:00
|
|
|
/*
|
2007-04-26 16:46:15 -04:00
|
|
|
* extent data is for file data
|
2007-03-15 19:03:33 -04:00
|
|
|
*/
|
2008-11-17 20:37:39 -05:00
|
|
|
#define BTRFS_EXTENT_DATA_KEY 108
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
|
2007-03-29 15:15:27 -04:00
|
|
|
/*
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
* extent csums are stored in a separate tree and hold csums for
|
|
|
|
* an entire extent on disk.
|
2007-03-29 15:15:27 -04:00
|
|
|
*/
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
#define BTRFS_EXTENT_CSUM_KEY 128
|
2007-03-29 15:15:27 -04:00
|
|
|
|
2007-03-15 19:03:33 -04:00
|
|
|
/*
|
2009-04-02 16:46:06 -04:00
|
|
|
* root items point to tree roots. They are typically in the root
|
2007-03-15 19:03:33 -04:00
|
|
|
* tree used by the super block to find all the other trees
|
|
|
|
*/
|
2008-11-17 20:37:39 -05:00
|
|
|
#define BTRFS_ROOT_ITEM_KEY 132
|
|
|
|
|
|
|
|
/*
|
|
|
|
* root backrefs tie subvols and snapshots to the directory entries that
|
|
|
|
* reference them
|
|
|
|
*/
|
|
|
|
#define BTRFS_ROOT_BACKREF_KEY 144
|
|
|
|
|
|
|
|
/*
|
|
|
|
* root refs make a fast index for listing all of the snapshots and
|
|
|
|
* subvolumes referenced by a given root. They point directly to the
|
|
|
|
* directory item in the root that references the subvol
|
|
|
|
*/
|
|
|
|
#define BTRFS_ROOT_REF_KEY 156
|
|
|
|
|
2007-03-15 19:03:33 -04:00
|
|
|
/*
|
|
|
|
* extent items are in the extent map tree. These record which blocks
|
|
|
|
* are used, and how many references there are to each block
|
|
|
|
*/
|
2008-11-17 20:37:39 -05:00
|
|
|
#define BTRFS_EXTENT_ITEM_KEY 168
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
|
|
|
|
#define BTRFS_TREE_BLOCK_REF_KEY 176
|
|
|
|
|
|
|
|
#define BTRFS_EXTENT_DATA_REF_KEY 178
|
|
|
|
|
|
|
|
#define BTRFS_EXTENT_REF_V0_KEY 180
|
|
|
|
|
|
|
|
#define BTRFS_SHARED_BLOCK_REF_KEY 182
|
|
|
|
|
|
|
|
#define BTRFS_SHARED_DATA_REF_KEY 184
|
2007-04-26 16:46:15 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* block groups give us hints into the extent allocation trees. Which
|
|
|
|
* blocks are free etc etc
|
|
|
|
*/
|
2008-11-17 20:37:39 -05:00
|
|
|
#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
|
2007-03-20 14:38:32 -04:00
|
|
|
|
2008-11-17 20:37:39 -05:00
|
|
|
#define BTRFS_DEV_EXTENT_KEY 204
|
|
|
|
#define BTRFS_DEV_ITEM_KEY 216
|
|
|
|
#define BTRFS_CHUNK_ITEM_KEY 228
|
2008-03-24 15:01:56 -04:00
|
|
|
|
2007-03-15 19:03:33 -04:00
|
|
|
/*
|
|
|
|
* string items are for debugging. They just store a short string of
|
|
|
|
* data in the FS
|
|
|
|
*/
|
2007-04-26 16:46:15 -04:00
|
|
|
#define BTRFS_STRING_ITEM_KEY 253
|
|
|
|
|
2011-06-28 15:10:37 +00:00
|
|
|
/*
|
|
|
|
* Flags for mount options.
|
|
|
|
*
|
|
|
|
* Note: don't forget to add new options to btrfs_show_options()
|
|
|
|
*/
|
2008-01-09 09:23:21 -05:00
|
|
|
#define BTRFS_MOUNT_NODATASUM (1 << 0)
|
|
|
|
#define BTRFS_MOUNT_NODATACOW (1 << 1)
|
|
|
|
#define BTRFS_MOUNT_NOBARRIER (1 << 2)
|
2008-01-18 10:54:22 -05:00
|
|
|
#define BTRFS_MOUNT_SSD (1 << 3)
|
2008-05-13 13:46:40 -04:00
|
|
|
#define BTRFS_MOUNT_DEGRADED (1 << 4)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
#define BTRFS_MOUNT_COMPRESS (1 << 5)
|
2009-04-02 16:49:40 -04:00
|
|
|
#define BTRFS_MOUNT_NOTREELOG (1 << 6)
|
2009-04-02 16:59:01 -04:00
|
|
|
#define BTRFS_MOUNT_FLUSHONCOMMIT (1 << 7)
|
2009-06-09 20:28:34 -04:00
|
|
|
#define BTRFS_MOUNT_SSD_SPREAD (1 << 8)
|
2009-06-10 09:51:32 -04:00
|
|
|
#define BTRFS_MOUNT_NOSSD (1 << 9)
|
2009-10-14 09:24:59 -04:00
|
|
|
#define BTRFS_MOUNT_DISCARD (1 << 10)
|
2010-01-28 16:18:15 -05:00
|
|
|
#define BTRFS_MOUNT_FORCE_COMPRESS (1 << 11)
|
2010-06-21 14:48:16 -04:00
|
|
|
#define BTRFS_MOUNT_SPACE_CACHE (1 << 12)
|
2010-09-21 14:21:34 -04:00
|
|
|
#define BTRFS_MOUNT_CLEAR_CACHE (1 << 13)
|
2010-10-29 15:46:43 -04:00
|
|
|
#define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1 << 14)
|
2011-02-16 13:10:41 -05:00
|
|
|
#define BTRFS_MOUNT_ENOSPC_DEBUG (1 << 15)
|
2011-05-24 15:35:30 -04:00
|
|
|
#define BTRFS_MOUNT_AUTO_DEFRAG (1 << 16)
|
2011-06-03 09:36:29 -04:00
|
|
|
#define BTRFS_MOUNT_INODE_MAP_CACHE (1 << 17)
|
2011-11-03 15:17:42 -04:00
|
|
|
#define BTRFS_MOUNT_RECOVERY (1 << 18)
|
2007-12-14 15:30:32 -05:00
|
|
|
|
|
|
|
#define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt)
|
|
|
|
#define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt)
|
|
|
|
#define btrfs_test_opt(root, opt) ((root)->fs_info->mount_opt & \
|
|
|
|
BTRFS_MOUNT_##opt)
|
2008-01-08 15:54:37 -05:00
|
|
|
/*
|
|
|
|
* Inode flags
|
|
|
|
*/
|
2008-01-14 13:26:08 -05:00
|
|
|
#define BTRFS_INODE_NODATASUM (1 << 0)
|
|
|
|
#define BTRFS_INODE_NODATACOW (1 << 1)
|
|
|
|
#define BTRFS_INODE_READONLY (1 << 2)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
#define BTRFS_INODE_NOCOMPRESS (1 << 3)
|
2008-10-30 14:25:28 -04:00
|
|
|
#define BTRFS_INODE_PREALLOC (1 << 4)
|
2009-04-17 10:37:41 +02:00
|
|
|
#define BTRFS_INODE_SYNC (1 << 5)
|
|
|
|
#define BTRFS_INODE_IMMUTABLE (1 << 6)
|
|
|
|
#define BTRFS_INODE_APPEND (1 << 7)
|
|
|
|
#define BTRFS_INODE_NODUMP (1 << 8)
|
|
|
|
#define BTRFS_INODE_NOATIME (1 << 9)
|
|
|
|
#define BTRFS_INODE_DIRSYNC (1 << 10)
|
Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-22 10:12:20 +00:00
|
|
|
#define BTRFS_INODE_COMPRESS (1 << 11)
|
2009-04-17 10:37:41 +02:00
|
|
|
|
2011-03-28 02:01:25 +00:00
|
|
|
#define BTRFS_INODE_ROOT_ITEM_INIT (1 << 31)
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* some macros to generate set/get funcs for the struct fields. This
|
|
|
|
* assumes there is a lefoo_to_cpu for every type, so lets make a simple
|
|
|
|
* one for u8:
|
|
|
|
*/
|
|
|
|
#define le8_to_cpu(v) (v)
|
|
|
|
#define cpu_to_le8(v) (v)
|
|
|
|
#define __le8 u8
|
|
|
|
|
|
|
|
#define read_eb_member(eb, ptr, type, member, result) ( \
|
|
|
|
read_extent_buffer(eb, (char *)(result), \
|
|
|
|
((unsigned long)(ptr)) + \
|
|
|
|
offsetof(type, member), \
|
|
|
|
sizeof(((type *)0)->member)))
|
|
|
|
|
|
|
|
#define write_eb_member(eb, ptr, type, member, result) ( \
|
|
|
|
write_extent_buffer(eb, (char *)(result), \
|
|
|
|
((unsigned long)(ptr)) + \
|
|
|
|
offsetof(type, member), \
|
|
|
|
sizeof(((type *)0)->member)))
|
|
|
|
|
2007-10-15 16:18:56 -04:00
|
|
|
#ifndef BTRFS_SETGET_FUNCS
|
2007-10-15 16:14:19 -04:00
|
|
|
#define BTRFS_SETGET_FUNCS(name, type, member, bits) \
|
2007-10-15 16:18:56 -04:00
|
|
|
u##bits btrfs_##name(struct extent_buffer *eb, type *s); \
|
|
|
|
void btrfs_set_##name(struct extent_buffer *eb, type *s, u##bits val);
|
|
|
|
#endif
|
2007-10-15 16:14:19 -04:00
|
|
|
|
|
|
|
#define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits) \
|
|
|
|
static inline u##bits btrfs_##name(struct extent_buffer *eb) \
|
|
|
|
{ \
|
2011-08-03 08:11:41 +00:00
|
|
|
type *p = page_address(eb->first_page); \
|
2008-02-15 10:40:52 -05:00
|
|
|
u##bits res = le##bits##_to_cpu(p->member); \
|
2007-10-15 16:18:55 -04:00
|
|
|
return res; \
|
2007-10-15 16:14:19 -04:00
|
|
|
} \
|
|
|
|
static inline void btrfs_set_##name(struct extent_buffer *eb, \
|
|
|
|
u##bits val) \
|
|
|
|
{ \
|
2011-08-03 08:11:41 +00:00
|
|
|
type *p = page_address(eb->first_page); \
|
2008-02-15 10:40:52 -05:00
|
|
|
p->member = cpu_to_le##bits(val); \
|
2007-10-15 16:14:19 -04:00
|
|
|
}
|
2007-04-26 16:46:15 -04:00
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
#define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits) \
|
|
|
|
static inline u##bits btrfs_##name(type *s) \
|
|
|
|
{ \
|
|
|
|
return le##bits##_to_cpu(s->member); \
|
|
|
|
} \
|
|
|
|
static inline void btrfs_set_##name(type *s, u##bits val) \
|
|
|
|
{ \
|
|
|
|
s->member = cpu_to_le##bits(val); \
|
2007-03-15 19:03:33 -04:00
|
|
|
}
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(device_total_bytes, struct btrfs_dev_item, total_bytes, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(device_bytes_used, struct btrfs_dev_item, bytes_used, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(device_io_align, struct btrfs_dev_item, io_align, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(device_io_width, struct btrfs_dev_item, io_width, 32);
|
2008-12-08 16:40:21 -05:00
|
|
|
BTRFS_SETGET_FUNCS(device_start_offset, struct btrfs_dev_item,
|
|
|
|
start_offset, 64);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_FUNCS(device_sector_size, struct btrfs_dev_item, sector_size, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(device_id, struct btrfs_dev_item, devid, 64);
|
2008-04-15 15:41:47 -04:00
|
|
|
BTRFS_SETGET_FUNCS(device_group, struct btrfs_dev_item, dev_group, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(device_seek_speed, struct btrfs_dev_item, seek_speed, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(device_bandwidth, struct btrfs_dev_item, bandwidth, 8);
|
2008-11-17 21:11:30 -05:00
|
|
|
BTRFS_SETGET_FUNCS(device_generation, struct btrfs_dev_item, generation, 64);
|
2008-03-24 15:01:56 -04:00
|
|
|
|
2008-03-24 15:02:07 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_type, struct btrfs_dev_item, type, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_total_bytes, struct btrfs_dev_item,
|
|
|
|
total_bytes, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_bytes_used, struct btrfs_dev_item,
|
|
|
|
bytes_used, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_io_align, struct btrfs_dev_item,
|
|
|
|
io_align, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_io_width, struct btrfs_dev_item,
|
|
|
|
io_width, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_sector_size, struct btrfs_dev_item,
|
|
|
|
sector_size, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_id, struct btrfs_dev_item, devid, 64);
|
2008-04-15 15:41:47 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_group, struct btrfs_dev_item,
|
|
|
|
dev_group, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_seek_speed, struct btrfs_dev_item,
|
|
|
|
seek_speed, 8);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_bandwidth, struct btrfs_dev_item,
|
|
|
|
bandwidth, 8);
|
2008-11-17 21:11:30 -05:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_device_generation, struct btrfs_dev_item,
|
|
|
|
generation, 64);
|
2008-03-24 15:02:07 -04:00
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
static inline char *btrfs_device_uuid(struct btrfs_dev_item *d)
|
|
|
|
{
|
|
|
|
return (char *)d + offsetof(struct btrfs_dev_item, uuid);
|
|
|
|
}
|
|
|
|
|
2008-11-17 21:11:30 -05:00
|
|
|
static inline char *btrfs_device_fsid(struct btrfs_dev_item *d)
|
|
|
|
{
|
|
|
|
return (char *)d + offsetof(struct btrfs_dev_item, fsid);
|
|
|
|
}
|
|
|
|
|
2008-04-15 15:41:47 -04:00
|
|
|
BTRFS_SETGET_FUNCS(chunk_length, struct btrfs_chunk, length, 64);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_FUNCS(chunk_owner, struct btrfs_chunk, owner, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_stripe_len, struct btrfs_chunk, stripe_len, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_io_align, struct btrfs_chunk, io_align, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_io_width, struct btrfs_chunk, io_width, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_sector_size, struct btrfs_chunk, sector_size, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_type, struct btrfs_chunk, type, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(chunk_num_stripes, struct btrfs_chunk, num_stripes, 16);
|
2008-04-16 10:49:51 -04:00
|
|
|
BTRFS_SETGET_FUNCS(chunk_sub_stripes, struct btrfs_chunk, sub_stripes, 16);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_FUNCS(stripe_devid, struct btrfs_stripe, devid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(stripe_offset, struct btrfs_stripe, offset, 64);
|
|
|
|
|
2008-04-15 15:41:47 -04:00
|
|
|
static inline char *btrfs_stripe_dev_uuid(struct btrfs_stripe *s)
|
|
|
|
{
|
|
|
|
return (char *)s + offsetof(struct btrfs_stripe, dev_uuid);
|
|
|
|
}
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_length, struct btrfs_chunk, length, 64);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_owner, struct btrfs_chunk, owner, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_stripe_len, struct btrfs_chunk,
|
|
|
|
stripe_len, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_io_align, struct btrfs_chunk,
|
|
|
|
io_align, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_io_width, struct btrfs_chunk,
|
|
|
|
io_width, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_sector_size, struct btrfs_chunk,
|
|
|
|
sector_size, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_type, struct btrfs_chunk, type, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_num_stripes, struct btrfs_chunk,
|
|
|
|
num_stripes, 16);
|
2008-04-16 10:49:51 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_chunk_sub_stripes, struct btrfs_chunk,
|
|
|
|
sub_stripes, 16);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_stripe_devid, struct btrfs_stripe, devid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(stack_stripe_offset, struct btrfs_stripe, offset, 64);
|
|
|
|
|
|
|
|
static inline struct btrfs_stripe *btrfs_stripe_nr(struct btrfs_chunk *c,
|
|
|
|
int nr)
|
|
|
|
{
|
|
|
|
unsigned long offset = (unsigned long)c;
|
|
|
|
offset += offsetof(struct btrfs_chunk, stripe);
|
|
|
|
offset += nr * sizeof(struct btrfs_stripe);
|
|
|
|
return (struct btrfs_stripe *)offset;
|
|
|
|
}
|
|
|
|
|
2008-04-18 10:29:38 -04:00
|
|
|
static inline char *btrfs_stripe_dev_uuid_nr(struct btrfs_chunk *c, int nr)
|
|
|
|
{
|
|
|
|
return btrfs_stripe_dev_uuid(btrfs_stripe_nr(c, nr));
|
|
|
|
}
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
static inline u64 btrfs_stripe_offset_nr(struct extent_buffer *eb,
|
|
|
|
struct btrfs_chunk *c, int nr)
|
|
|
|
{
|
|
|
|
return btrfs_stripe_offset(eb, btrfs_stripe_nr(c, nr));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u64 btrfs_stripe_devid_nr(struct extent_buffer *eb,
|
|
|
|
struct btrfs_chunk *c, int nr)
|
|
|
|
{
|
|
|
|
return btrfs_stripe_devid(eb, btrfs_stripe_nr(c, nr));
|
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_block_group_item */
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(block_group_used, struct btrfs_block_group_item,
|
|
|
|
used, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(disk_block_group_used, struct btrfs_block_group_item,
|
|
|
|
used, 64);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(block_group_chunk_objectid,
|
|
|
|
struct btrfs_block_group_item, chunk_objectid, 64);
|
2008-04-15 15:41:47 -04:00
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(disk_block_group_chunk_objectid,
|
2008-03-24 15:01:56 -04:00
|
|
|
struct btrfs_block_group_item, chunk_objectid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(disk_block_group_flags,
|
|
|
|
struct btrfs_block_group_item, flags, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(block_group_flags,
|
|
|
|
struct btrfs_block_group_item, flags, 64);
|
2007-03-15 19:03:33 -04:00
|
|
|
|
2007-12-12 14:38:19 -05:00
|
|
|
/* struct btrfs_inode_ref */
|
|
|
|
BTRFS_SETGET_FUNCS(inode_ref_name_len, struct btrfs_inode_ref, name_len, 16);
|
2008-07-24 12:12:38 -04:00
|
|
|
BTRFS_SETGET_FUNCS(inode_ref_index, struct btrfs_inode_ref, index, 64);
|
2007-12-12 14:38:19 -05:00
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_inode_item */
|
|
|
|
BTRFS_SETGET_FUNCS(inode_generation, struct btrfs_inode_item, generation, 64);
|
2008-12-08 16:40:21 -05:00
|
|
|
BTRFS_SETGET_FUNCS(inode_sequence, struct btrfs_inode_item, sequence, 64);
|
2008-09-05 16:13:11 -04:00
|
|
|
BTRFS_SETGET_FUNCS(inode_transid, struct btrfs_inode_item, transid, 64);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_FUNCS(inode_size, struct btrfs_inode_item, size, 64);
|
2008-10-09 11:46:29 -04:00
|
|
|
BTRFS_SETGET_FUNCS(inode_nbytes, struct btrfs_inode_item, nbytes, 64);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_FUNCS(inode_block_group, struct btrfs_inode_item, block_group, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_nlink, struct btrfs_inode_item, nlink, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_uid, struct btrfs_inode_item, uid, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_gid, struct btrfs_inode_item, gid, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(inode_mode, struct btrfs_inode_item, mode, 32);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_FUNCS(inode_rdev, struct btrfs_inode_item, rdev, 64);
|
2008-12-02 06:36:08 -05:00
|
|
|
BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
|
2007-03-15 19:03:33 -04:00
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
static inline struct btrfs_timespec *
|
2007-10-15 16:14:19 -04:00
|
|
|
btrfs_inode_atime(struct btrfs_inode_item *inode_item)
|
2007-03-15 19:03:33 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
unsigned long ptr = (unsigned long)inode_item;
|
|
|
|
ptr += offsetof(struct btrfs_inode_item, atime);
|
2008-03-24 15:01:56 -04:00
|
|
|
return (struct btrfs_timespec *)ptr;
|
2007-03-15 19:03:33 -04:00
|
|
|
}
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
static inline struct btrfs_timespec *
|
2007-10-15 16:14:19 -04:00
|
|
|
btrfs_inode_mtime(struct btrfs_inode_item *inode_item)
|
2007-03-15 19:03:33 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
unsigned long ptr = (unsigned long)inode_item;
|
|
|
|
ptr += offsetof(struct btrfs_inode_item, mtime);
|
2008-03-24 15:01:56 -04:00
|
|
|
return (struct btrfs_timespec *)ptr;
|
2007-03-15 19:03:33 -04:00
|
|
|
}
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
static inline struct btrfs_timespec *
|
2007-10-15 16:14:19 -04:00
|
|
|
btrfs_inode_ctime(struct btrfs_inode_item *inode_item)
|
2007-03-15 19:03:33 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
unsigned long ptr = (unsigned long)inode_item;
|
|
|
|
ptr += offsetof(struct btrfs_inode_item, ctime);
|
2008-03-24 15:01:56 -04:00
|
|
|
return (struct btrfs_timespec *)ptr;
|
2007-03-15 19:03:33 -04:00
|
|
|
}
|
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_FUNCS(timespec_sec, struct btrfs_timespec, sec, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
|
2007-03-22 12:13:20 -04:00
|
|
|
|
2008-03-24 15:01:56 -04:00
|
|
|
/* struct btrfs_dev_extent */
|
2008-04-15 15:41:47 -04:00
|
|
|
BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent,
|
|
|
|
chunk_tree, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
|
|
|
|
chunk_objectid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(dev_extent_chunk_offset, struct btrfs_dev_extent,
|
|
|
|
chunk_offset, 64);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_FUNCS(dev_extent_length, struct btrfs_dev_extent, length, 64);
|
|
|
|
|
2008-04-15 15:41:47 -04:00
|
|
|
static inline u8 *btrfs_dev_extent_chunk_tree_uuid(struct btrfs_dev_extent *dev)
|
|
|
|
{
|
|
|
|
unsigned long ptr = offsetof(struct btrfs_dev_extent, chunk_tree_uuid);
|
|
|
|
return (u8 *)((unsigned long)dev + ptr);
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
BTRFS_SETGET_FUNCS(extent_refs, struct btrfs_extent_item, refs, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_generation, struct btrfs_extent_item,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_flags, struct btrfs_extent_item, flags, 64);
|
2007-12-11 09:25:06 -05:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
BTRFS_SETGET_FUNCS(extent_refs_v0, struct btrfs_extent_item_v0, refs, 32);
|
|
|
|
|
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(tree_block_level, struct btrfs_tree_block_info, level, 8);
|
|
|
|
|
|
|
|
static inline void btrfs_tree_block_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_tree_block_info *item,
|
|
|
|
struct btrfs_disk_key *key)
|
|
|
|
{
|
|
|
|
read_eb_member(eb, item, struct btrfs_tree_block_info, key, key);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_tree_block_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_tree_block_info *item,
|
|
|
|
struct btrfs_disk_key *key)
|
|
|
|
{
|
|
|
|
write_eb_member(eb, item, struct btrfs_tree_block_info, key, key);
|
|
|
|
}
|
2007-03-22 12:13:20 -04:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
BTRFS_SETGET_FUNCS(extent_data_ref_root, struct btrfs_extent_data_ref,
|
|
|
|
root, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_data_ref_objectid, struct btrfs_extent_data_ref,
|
|
|
|
objectid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_data_ref_offset, struct btrfs_extent_data_ref,
|
|
|
|
offset, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_data_ref_count, struct btrfs_extent_data_ref,
|
|
|
|
count, 32);
|
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(shared_data_ref_count, struct btrfs_shared_data_ref,
|
|
|
|
count, 32);
|
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(extent_inline_ref_type, struct btrfs_extent_inline_ref,
|
|
|
|
type, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(extent_inline_ref_offset, struct btrfs_extent_inline_ref,
|
|
|
|
offset, 64);
|
|
|
|
|
|
|
|
static inline u32 btrfs_extent_inline_ref_size(int type)
|
|
|
|
{
|
|
|
|
if (type == BTRFS_TREE_BLOCK_REF_KEY ||
|
|
|
|
type == BTRFS_SHARED_BLOCK_REF_KEY)
|
|
|
|
return sizeof(struct btrfs_extent_inline_ref);
|
|
|
|
if (type == BTRFS_SHARED_DATA_REF_KEY)
|
|
|
|
return sizeof(struct btrfs_shared_data_ref) +
|
|
|
|
sizeof(struct btrfs_extent_inline_ref);
|
|
|
|
if (type == BTRFS_EXTENT_DATA_REF_KEY)
|
|
|
|
return sizeof(struct btrfs_extent_data_ref) +
|
|
|
|
offsetof(struct btrfs_extent_inline_ref, offset);
|
|
|
|
BUG();
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
BTRFS_SETGET_FUNCS(ref_root_v0, struct btrfs_extent_ref_v0, root, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(ref_generation_v0, struct btrfs_extent_ref_v0,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(ref_objectid_v0, struct btrfs_extent_ref_v0, objectid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(ref_count_v0, struct btrfs_extent_ref_v0, count, 32);
|
2007-03-22 12:13:20 -04:00
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_node */
|
|
|
|
BTRFS_SETGET_FUNCS(key_blockptr, struct btrfs_key_ptr, blockptr, 64);
|
2007-12-11 09:25:06 -05:00
|
|
|
BTRFS_SETGET_FUNCS(key_generation, struct btrfs_key_ptr, generation, 64);
|
2007-03-22 12:13:20 -04:00
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline u64 btrfs_node_blockptr(struct extent_buffer *eb, int nr)
|
2007-03-13 09:49:06 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
unsigned long ptr;
|
|
|
|
ptr = offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
|
|
|
return btrfs_key_blockptr(eb, (struct btrfs_key_ptr *)ptr);
|
2007-03-13 09:49:06 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_set_node_blockptr(struct extent_buffer *eb,
|
|
|
|
int nr, u64 val)
|
2007-03-13 09:49:06 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
unsigned long ptr;
|
|
|
|
ptr = offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
|
|
|
btrfs_set_key_blockptr(eb, (struct btrfs_key_ptr *)ptr, val);
|
2007-03-13 09:49:06 -04:00
|
|
|
}
|
|
|
|
|
2007-12-11 09:25:06 -05:00
|
|
|
static inline u64 btrfs_node_ptr_generation(struct extent_buffer *eb, int nr)
|
|
|
|
{
|
|
|
|
unsigned long ptr;
|
|
|
|
ptr = offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
|
|
|
return btrfs_key_generation(eb, (struct btrfs_key_ptr *)ptr);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_node_ptr_generation(struct extent_buffer *eb,
|
|
|
|
int nr, u64 val)
|
|
|
|
{
|
|
|
|
unsigned long ptr;
|
|
|
|
ptr = offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
|
|
|
btrfs_set_key_generation(eb, (struct btrfs_key_ptr *)ptr, val);
|
|
|
|
}
|
|
|
|
|
2007-10-15 16:18:55 -04:00
|
|
|
static inline unsigned long btrfs_node_key_ptr_offset(int nr)
|
2007-04-20 20:23:12 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
return offsetof(struct btrfs_node, ptrs) +
|
|
|
|
sizeof(struct btrfs_key_ptr) * nr;
|
2007-04-20 20:23:12 -04:00
|
|
|
}
|
|
|
|
|
2007-11-06 15:09:29 -05:00
|
|
|
void btrfs_node_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_disk_key *disk_key, int nr);
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_set_node_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_disk_key *disk_key, int nr)
|
2007-03-13 09:28:32 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
unsigned long ptr;
|
|
|
|
ptr = btrfs_node_key_ptr_offset(nr);
|
|
|
|
write_eb_member(eb, (struct btrfs_key_ptr *)ptr,
|
|
|
|
struct btrfs_key_ptr, key, disk_key);
|
2007-03-13 09:28:32 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_item */
|
|
|
|
BTRFS_SETGET_FUNCS(item_offset, struct btrfs_item, offset, 32);
|
|
|
|
BTRFS_SETGET_FUNCS(item_size, struct btrfs_item, size, 32);
|
2007-04-20 20:23:12 -04:00
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline unsigned long btrfs_item_nr_offset(int nr)
|
2007-03-13 09:28:32 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
return offsetof(struct btrfs_leaf, items) +
|
|
|
|
sizeof(struct btrfs_item) * nr;
|
2007-03-13 09:28:32 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline struct btrfs_item *btrfs_item_nr(struct extent_buffer *eb,
|
|
|
|
int nr)
|
2007-03-12 20:12:07 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
return (struct btrfs_item *)btrfs_item_nr_offset(nr);
|
2007-03-12 20:12:07 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline u32 btrfs_item_end(struct extent_buffer *eb,
|
|
|
|
struct btrfs_item *item)
|
2007-03-12 20:12:07 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
return btrfs_item_offset(eb, item) + btrfs_item_size(eb, item);
|
2007-03-12 20:12:07 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline u32 btrfs_item_end_nr(struct extent_buffer *eb, int nr)
|
2007-03-12 20:12:07 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
return btrfs_item_end(eb, btrfs_item_nr(eb, nr));
|
2007-03-12 20:12:07 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline u32 btrfs_item_offset_nr(struct extent_buffer *eb, int nr)
|
2007-03-12 20:12:07 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
return btrfs_item_offset(eb, btrfs_item_nr(eb, nr));
|
2007-03-12 20:12:07 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline u32 btrfs_item_size_nr(struct extent_buffer *eb, int nr)
|
2007-03-12 20:12:07 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
return btrfs_item_size(eb, btrfs_item_nr(eb, nr));
|
2007-03-12 20:12:07 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_item_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_disk_key *disk_key, int nr)
|
2007-03-15 15:18:43 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
struct btrfs_item *item = btrfs_item_nr(eb, nr);
|
|
|
|
read_eb_member(eb, item, struct btrfs_item, key, disk_key);
|
2007-03-15 15:18:43 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_set_item_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_disk_key *disk_key, int nr)
|
2007-03-15 15:18:43 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
struct btrfs_item *item = btrfs_item_nr(eb, nr);
|
|
|
|
write_eb_member(eb, item, struct btrfs_item, key, disk_key);
|
2007-03-15 15:18:43 -04:00
|
|
|
}
|
|
|
|
|
2008-09-05 16:13:11 -04:00
|
|
|
BTRFS_SETGET_FUNCS(dir_log_end, struct btrfs_dir_log_item, end, 64);
|
|
|
|
|
2008-11-17 20:37:39 -05:00
|
|
|
/*
|
|
|
|
* struct btrfs_root_ref
|
|
|
|
*/
|
|
|
|
BTRFS_SETGET_FUNCS(root_ref_dirid, struct btrfs_root_ref, dirid, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(root_ref_sequence, struct btrfs_root_ref, sequence, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(root_ref_name_len, struct btrfs_root_ref, name_len, 16);
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_dir_item */
|
2007-11-16 11:45:54 -05:00
|
|
|
BTRFS_SETGET_FUNCS(dir_data_len, struct btrfs_dir_item, data_len, 16);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_FUNCS(dir_type, struct btrfs_dir_item, type, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(dir_name_len, struct btrfs_dir_item, name_len, 16);
|
2008-09-05 16:13:11 -04:00
|
|
|
BTRFS_SETGET_FUNCS(dir_transid, struct btrfs_dir_item, transid, 64);
|
2007-03-15 15:18:43 -04:00
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_dir_item_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_dir_item *item,
|
|
|
|
struct btrfs_disk_key *key)
|
2007-03-15 15:18:43 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
read_eb_member(eb, item, struct btrfs_dir_item, location, key);
|
2007-03-15 15:18:43 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_set_dir_item_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_dir_item *item,
|
|
|
|
struct btrfs_disk_key *key)
|
2007-03-16 08:46:49 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
write_eb_member(eb, item, struct btrfs_dir_item, location, key);
|
2007-03-16 08:46:49 -04:00
|
|
|
}
|
|
|
|
|
2010-06-21 14:48:16 -04:00
|
|
|
BTRFS_SETGET_FUNCS(free_space_entries, struct btrfs_free_space_header,
|
|
|
|
num_entries, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(free_space_bitmaps, struct btrfs_free_space_header,
|
|
|
|
num_bitmaps, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(free_space_generation, struct btrfs_free_space_header,
|
|
|
|
generation, 64);
|
|
|
|
|
|
|
|
static inline void btrfs_free_space_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_free_space_header *h,
|
|
|
|
struct btrfs_disk_key *key)
|
|
|
|
{
|
|
|
|
read_eb_member(eb, h, struct btrfs_free_space_header, location, key);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_free_space_key(struct extent_buffer *eb,
|
|
|
|
struct btrfs_free_space_header *h,
|
|
|
|
struct btrfs_disk_key *key)
|
|
|
|
{
|
|
|
|
write_eb_member(eb, h, struct btrfs_free_space_header, location, key);
|
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_disk_key */
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(disk_key_objectid, struct btrfs_disk_key,
|
|
|
|
objectid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(disk_key_offset, struct btrfs_disk_key, offset, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(disk_key_type, struct btrfs_disk_key, type, 8);
|
2007-03-15 15:18:43 -04:00
|
|
|
|
2007-03-12 16:22:34 -04:00
|
|
|
static inline void btrfs_disk_key_to_cpu(struct btrfs_key *cpu,
|
|
|
|
struct btrfs_disk_key *disk)
|
|
|
|
{
|
|
|
|
cpu->offset = le64_to_cpu(disk->offset);
|
2007-10-15 16:14:19 -04:00
|
|
|
cpu->type = disk->type;
|
2007-03-12 16:22:34 -04:00
|
|
|
cpu->objectid = le64_to_cpu(disk->objectid);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_cpu_key_to_disk(struct btrfs_disk_key *disk,
|
|
|
|
struct btrfs_key *cpu)
|
|
|
|
{
|
|
|
|
disk->offset = cpu_to_le64(cpu->offset);
|
2007-10-15 16:14:19 -04:00
|
|
|
disk->type = cpu->type;
|
2007-03-12 16:22:34 -04:00
|
|
|
disk->objectid = cpu_to_le64(cpu->objectid);
|
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_node_key_to_cpu(struct extent_buffer *eb,
|
|
|
|
struct btrfs_key *key, int nr)
|
2007-03-23 15:56:19 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
btrfs_node_key(eb, &disk_key, nr);
|
|
|
|
btrfs_disk_key_to_cpu(key, &disk_key);
|
2007-03-23 15:56:19 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_item_key_to_cpu(struct extent_buffer *eb,
|
|
|
|
struct btrfs_key *key, int nr)
|
2007-03-23 15:56:19 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
btrfs_item_key(eb, &disk_key, nr);
|
|
|
|
btrfs_disk_key_to_cpu(key, &disk_key);
|
2007-03-23 15:56:19 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_dir_item_key_to_cpu(struct extent_buffer *eb,
|
|
|
|
struct btrfs_dir_item *item,
|
|
|
|
struct btrfs_key *key)
|
2007-04-20 20:23:12 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
btrfs_dir_item_key(eb, item, &disk_key);
|
|
|
|
btrfs_disk_key_to_cpu(key, &disk_key);
|
2007-04-20 20:23:12 -04:00
|
|
|
}
|
|
|
|
|
2007-08-29 15:47:34 -04:00
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline u8 btrfs_key_type(struct btrfs_key *key)
|
2007-03-13 16:47:54 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
return key->type;
|
2007-03-13 16:47:54 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline void btrfs_set_key_type(struct btrfs_key *key, u8 val)
|
2007-03-13 16:47:54 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
key->type = val;
|
2007-03-13 16:47:54 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_header */
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_bytenr, struct btrfs_header, bytenr, 64);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_generation, struct btrfs_header,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_owner, struct btrfs_header, owner, 64);
|
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_nritems, struct btrfs_header, nritems, 32);
|
2008-04-01 11:21:32 -04:00
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_flags, struct btrfs_header, flags, 64);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_HEADER_FUNCS(header_level, struct btrfs_header, level, 8);
|
2007-04-09 10:42:37 -04:00
|
|
|
|
2008-04-01 11:21:32 -04:00
|
|
|
static inline int btrfs_header_flag(struct extent_buffer *eb, u64 flag)
|
|
|
|
{
|
|
|
|
return (btrfs_header_flags(eb) & flag) == flag;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int btrfs_set_header_flag(struct extent_buffer *eb, u64 flag)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_header_flags(eb);
|
|
|
|
btrfs_set_header_flags(eb, flags | flag);
|
|
|
|
return (flags & flag) == flag;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int btrfs_clear_header_flag(struct extent_buffer *eb, u64 flag)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_header_flags(eb);
|
|
|
|
btrfs_set_header_flags(eb, flags & ~flag);
|
|
|
|
return (flags & flag) == flag;
|
|
|
|
}
|
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
static inline int btrfs_header_backref_rev(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_header_flags(eb);
|
|
|
|
return flags >> BTRFS_BACKREF_REV_SHIFT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_set_header_backref_rev(struct extent_buffer *eb,
|
|
|
|
int rev)
|
|
|
|
{
|
|
|
|
u64 flags = btrfs_header_flags(eb);
|
|
|
|
flags &= ~BTRFS_BACKREF_REV_MASK;
|
|
|
|
flags |= (u64)rev << BTRFS_BACKREF_REV_SHIFT;
|
|
|
|
btrfs_set_header_flags(eb, flags);
|
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline u8 *btrfs_header_fsid(struct extent_buffer *eb)
|
2007-04-09 10:42:37 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
unsigned long ptr = offsetof(struct btrfs_header, fsid);
|
|
|
|
return (u8 *)ptr;
|
2007-04-09 10:42:37 -04:00
|
|
|
}
|
|
|
|
|
2008-04-15 15:41:47 -04:00
|
|
|
static inline u8 *btrfs_header_chunk_tree_uuid(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
unsigned long ptr = offsetof(struct btrfs_header, chunk_tree_uuid);
|
|
|
|
return (u8 *)ptr;
|
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline int btrfs_is_leaf(struct extent_buffer *eb)
|
2007-03-13 16:47:54 -04:00
|
|
|
{
|
2009-01-05 21:25:51 -05:00
|
|
|
return btrfs_header_level(eb) == 0;
|
2007-03-13 16:47:54 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_root_item */
|
2008-10-29 14:49:05 -04:00
|
|
|
BTRFS_SETGET_FUNCS(disk_root_generation, struct btrfs_root_item,
|
|
|
|
generation, 64);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_FUNCS(disk_root_refs, struct btrfs_root_item, refs, 32);
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_FUNCS(disk_root_bytenr, struct btrfs_root_item, bytenr, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(disk_root_level, struct btrfs_root_item, level, 8);
|
2007-03-13 16:47:54 -04:00
|
|
|
|
2008-10-29 14:49:05 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_generation, struct btrfs_root_item,
|
|
|
|
generation, 64);
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_bytenr, struct btrfs_root_item, bytenr, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_level, struct btrfs_root_item, level, 8);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_dirid, struct btrfs_root_item, root_dirid, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_refs, struct btrfs_root_item, refs, 32);
|
2008-12-02 06:36:08 -05:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_flags, struct btrfs_root_item, flags, 64);
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_used, struct btrfs_root_item, bytes_used, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_limit, struct btrfs_root_item, byte_limit, 64);
|
2008-10-30 14:20:02 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(root_last_snapshot, struct btrfs_root_item,
|
|
|
|
last_snapshot, 64);
|
2007-03-14 14:14:43 -04:00
|
|
|
|
2010-12-20 16:04:08 +08:00
|
|
|
static inline bool btrfs_root_readonly(struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
return root->root_item.flags & BTRFS_ROOT_SUBVOL_RDONLY;
|
|
|
|
}
|
|
|
|
|
2011-11-03 15:17:42 -04:00
|
|
|
/* struct btrfs_root_backup */
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_tree_root, struct btrfs_root_backup,
|
|
|
|
tree_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_tree_root_gen, struct btrfs_root_backup,
|
|
|
|
tree_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_tree_root_level, struct btrfs_root_backup,
|
|
|
|
tree_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_chunk_root, struct btrfs_root_backup,
|
|
|
|
chunk_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_chunk_root_gen, struct btrfs_root_backup,
|
|
|
|
chunk_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_chunk_root_level, struct btrfs_root_backup,
|
|
|
|
chunk_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_extent_root, struct btrfs_root_backup,
|
|
|
|
extent_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_extent_root_gen, struct btrfs_root_backup,
|
|
|
|
extent_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_extent_root_level, struct btrfs_root_backup,
|
|
|
|
extent_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_fs_root, struct btrfs_root_backup,
|
|
|
|
fs_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_fs_root_gen, struct btrfs_root_backup,
|
|
|
|
fs_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_fs_root_level, struct btrfs_root_backup,
|
|
|
|
fs_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_dev_root, struct btrfs_root_backup,
|
|
|
|
dev_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_dev_root_gen, struct btrfs_root_backup,
|
|
|
|
dev_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_dev_root_level, struct btrfs_root_backup,
|
|
|
|
dev_root_level, 8);
|
|
|
|
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_csum_root, struct btrfs_root_backup,
|
|
|
|
csum_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_csum_root_gen, struct btrfs_root_backup,
|
|
|
|
csum_root_gen, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_csum_root_level, struct btrfs_root_backup,
|
|
|
|
csum_root_level, 8);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_total_bytes, struct btrfs_root_backup,
|
|
|
|
total_bytes, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_bytes_used, struct btrfs_root_backup,
|
|
|
|
bytes_used, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(backup_num_devices, struct btrfs_root_backup,
|
|
|
|
num_devices, 64);
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_super_block */
|
2008-12-02 07:17:45 -05:00
|
|
|
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_bytenr, struct btrfs_super_block, bytenr, 64);
|
2008-05-07 11:43:44 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_flags, struct btrfs_super_block, flags, 64);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_generation, struct btrfs_super_block,
|
|
|
|
generation, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_root, struct btrfs_super_block, root, 64);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_sys_array_size,
|
|
|
|
struct btrfs_super_block, sys_chunk_array_size, 32);
|
2008-10-29 14:49:05 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_chunk_root_generation,
|
|
|
|
struct btrfs_super_block, chunk_root_generation, 64);
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_root_level, struct btrfs_super_block,
|
|
|
|
root_level, 8);
|
2008-03-24 15:01:56 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_chunk_root, struct btrfs_super_block,
|
|
|
|
chunk_root, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct btrfs_super_block,
|
2008-09-05 16:13:11 -04:00
|
|
|
chunk_root_level, 8);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block,
|
|
|
|
log_root, 64);
|
2008-12-08 16:40:21 -05:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block,
|
|
|
|
log_root_transid, 64);
|
2008-09-05 16:13:11 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block,
|
|
|
|
log_root_level, 8);
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block,
|
|
|
|
total_bytes, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_bytes_used, struct btrfs_super_block,
|
|
|
|
bytes_used, 64);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_sectorsize, struct btrfs_super_block,
|
|
|
|
sectorsize, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_nodesize, struct btrfs_super_block,
|
|
|
|
nodesize, 32);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_leafsize, struct btrfs_super_block,
|
|
|
|
leafsize, 32);
|
2007-11-30 11:30:34 -05:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_stripesize, struct btrfs_super_block,
|
|
|
|
stripesize, 32);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_root_dir, struct btrfs_super_block,
|
|
|
|
root_dir_objectid, 64);
|
2008-03-24 15:02:07 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_num_devices, struct btrfs_super_block,
|
|
|
|
num_devices, 64);
|
2008-12-02 06:36:08 -05:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_compat_flags, struct btrfs_super_block,
|
|
|
|
compat_flags, 64);
|
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_compat_ro_flags, struct btrfs_super_block,
|
2009-12-17 21:32:27 +00:00
|
|
|
compat_ro_flags, 64);
|
2008-12-02 06:36:08 -05:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_incompat_flags, struct btrfs_super_block,
|
|
|
|
incompat_flags, 64);
|
2008-12-02 07:17:45 -05:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_csum_type, struct btrfs_super_block,
|
|
|
|
csum_type, 16);
|
2010-06-21 14:48:16 -04:00
|
|
|
BTRFS_SETGET_STACK_FUNCS(super_cache_generation, struct btrfs_super_block,
|
|
|
|
cache_generation, 64);
|
2008-12-02 07:17:45 -05:00
|
|
|
|
|
|
|
static inline int btrfs_super_csum_size(struct btrfs_super_block *s)
|
|
|
|
{
|
|
|
|
int t = btrfs_super_csum_type(s);
|
|
|
|
BUG_ON(t >= ARRAY_SIZE(btrfs_csum_sizes));
|
|
|
|
return btrfs_csum_sizes[t];
|
|
|
|
}
|
2007-03-21 11:12:56 -04:00
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
static inline unsigned long btrfs_leaf_data(struct extent_buffer *l)
|
2007-03-21 11:12:56 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
return offsetof(struct btrfs_leaf, items);
|
2007-03-21 11:12:56 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:14:19 -04:00
|
|
|
/* struct btrfs_file_extent_item */
|
|
|
|
BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
|
2007-03-20 14:38:32 -04:00
|
|
|
|
2009-01-05 21:25:51 -05:00
|
|
|
static inline unsigned long
|
|
|
|
btrfs_file_extent_inline_start(struct btrfs_file_extent_item *e)
|
2007-04-19 13:37:44 -04:00
|
|
|
{
|
2007-10-15 16:14:19 -04:00
|
|
|
unsigned long offset = (unsigned long)e;
|
2007-10-15 16:15:53 -04:00
|
|
|
offset += offsetof(struct btrfs_file_extent_item, disk_bytenr);
|
2007-10-15 16:14:19 -04:00
|
|
|
return offset;
|
2007-04-19 13:37:44 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline u32 btrfs_file_extent_calc_inline_size(u32 datasize)
|
|
|
|
{
|
2007-10-15 16:15:53 -04:00
|
|
|
return offsetof(struct btrfs_file_extent_item, disk_bytenr) + datasize;
|
2007-03-20 14:38:32 -04:00
|
|
|
}
|
|
|
|
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_disk_bytenr, struct btrfs_file_extent_item,
|
|
|
|
disk_bytenr, 64);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_generation, struct btrfs_file_extent_item,
|
|
|
|
generation, 64);
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_disk_num_bytes, struct btrfs_file_extent_item,
|
|
|
|
disk_num_bytes, 64);
|
2007-10-15 16:14:19 -04:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_offset, struct btrfs_file_extent_item,
|
|
|
|
offset, 64);
|
2007-10-15 16:15:53 -04:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_num_bytes, struct btrfs_file_extent_item,
|
|
|
|
num_bytes, 64);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
BTRFS_SETGET_FUNCS(file_extent_ram_bytes, struct btrfs_file_extent_item,
|
|
|
|
ram_bytes, 64);
|
|
|
|
BTRFS_SETGET_FUNCS(file_extent_compression, struct btrfs_file_extent_item,
|
|
|
|
compression, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(file_extent_encryption, struct btrfs_file_extent_item,
|
|
|
|
encryption, 8);
|
|
|
|
BTRFS_SETGET_FUNCS(file_extent_other_encoding, struct btrfs_file_extent_item,
|
|
|
|
other_encoding, 16);
|
|
|
|
|
|
|
|
/* this returns the number of file bytes represented by the inline item.
|
|
|
|
* If an item is compressed, this is the uncompressed size
|
|
|
|
*/
|
|
|
|
static inline u32 btrfs_file_extent_inline_len(struct extent_buffer *eb,
|
|
|
|
struct btrfs_file_extent_item *e)
|
|
|
|
{
|
|
|
|
return btrfs_file_extent_ram_bytes(eb, e);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* this returns the number of bytes used by the item on disk, minus the
|
|
|
|
* size of any extent headers. If a file is compressed on disk, this is
|
|
|
|
* the compressed size
|
|
|
|
*/
|
|
|
|
static inline u32 btrfs_file_extent_inline_item_len(struct extent_buffer *eb,
|
|
|
|
struct btrfs_item *e)
|
|
|
|
{
|
|
|
|
unsigned long offset;
|
|
|
|
offset = offsetof(struct btrfs_file_extent_item, disk_bytenr);
|
|
|
|
return btrfs_item_size(eb, e) - offset;
|
|
|
|
}
|
2007-03-20 14:38:32 -04:00
|
|
|
|
2007-03-22 12:13:20 -04:00
|
|
|
static inline struct btrfs_root *btrfs_sb(struct super_block *sb)
|
|
|
|
{
|
|
|
|
return sb->s_fs_info;
|
|
|
|
}
|
|
|
|
|
2009-01-05 21:25:51 -05:00
|
|
|
static inline u32 btrfs_level_size(struct btrfs_root *root, int level)
|
|
|
|
{
|
2007-10-15 16:15:53 -04:00
|
|
|
if (level == 0)
|
|
|
|
return root->leafsize;
|
|
|
|
return root->nodesize;
|
|
|
|
}
|
|
|
|
|
2007-03-14 10:31:29 -04:00
|
|
|
/* helper function to cast into the data area of the leaf. */
|
|
|
|
#define btrfs_item_ptr(leaf, slot, type) \
|
2007-03-14 14:14:43 -04:00
|
|
|
((type *)(btrfs_leaf_data(leaf) + \
|
2007-10-15 16:14:19 -04:00
|
|
|
btrfs_item_offset_nr(leaf, slot)))
|
|
|
|
|
|
|
|
#define btrfs_item_ptr_offset(leaf, slot) \
|
|
|
|
((unsigned long)(btrfs_leaf_data(leaf) + \
|
|
|
|
btrfs_item_offset_nr(leaf, slot)))
|
2007-03-14 10:31:29 -04:00
|
|
|
|
2008-09-24 11:48:04 -04:00
|
|
|
static inline struct dentry *fdentry(struct file *file)
|
|
|
|
{
|
2007-12-18 16:15:09 -05:00
|
|
|
return file->f_path.dentry;
|
|
|
|
}
|
|
|
|
|
2010-09-16 16:19:09 -04:00
|
|
|
static inline bool btrfs_mixed_space_info(struct btrfs_space_info *space_info)
|
|
|
|
{
|
|
|
|
return ((space_info->flags & BTRFS_BLOCK_GROUP_METADATA) &&
|
|
|
|
(space_info->flags & BTRFS_BLOCK_GROUP_DATA));
|
|
|
|
}
|
|
|
|
|
2011-09-21 15:05:58 -04:00
|
|
|
static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping)
|
|
|
|
{
|
|
|
|
return mapping_gfp_mask(mapping) & ~__GFP_FS;
|
|
|
|
}
|
|
|
|
|
2007-04-17 13:26:50 -04:00
|
|
|
/* extent-tree.c */
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root,
|
2011-07-15 15:16:44 +00:00
|
|
|
unsigned num_items)
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
{
|
|
|
|
return (root->leafsize + root->nodesize * (BTRFS_MAX_LEVEL - 1)) *
|
|
|
|
3 * num_items;
|
2011-08-19 10:29:59 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Doing a truncate won't result in new nodes or leaves, just what we need for
|
|
|
|
* COW.
|
|
|
|
*/
|
|
|
|
static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_root *root,
|
|
|
|
unsigned num_items)
|
|
|
|
{
|
|
|
|
return (root->leafsize + root->nodesize * (BTRFS_MAX_LEVEL - 1)) *
|
|
|
|
num_items;
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
}
|
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
|
2009-03-13 10:10:06 -04:00
|
|
|
int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, unsigned long count);
|
2008-09-23 13:14:14 -04:00
|
|
|
int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len);
|
2010-05-16 10:48:46 -04:00
|
|
|
int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytenr,
|
|
|
|
u64 num_bytes, u64 *refs, u64 *flags);
|
2009-09-11 16:11:19 -04:00
|
|
|
int btrfs_pin_extent(struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num, int reserved);
|
2011-10-31 20:52:39 -04:00
|
|
|
int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes);
|
2008-10-30 14:20:02 -04:00
|
|
|
int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 objectid, u64 offset, u64 bytenr);
|
2009-01-05 21:25:51 -05:00
|
|
|
struct btrfs_block_group_cache *btrfs_lookup_block_group(
|
|
|
|
struct btrfs_fs_info *info,
|
|
|
|
u64 bytenr);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
void btrfs_put_block_group(struct btrfs_block_group_cache *cache);
|
2008-12-11 16:30:39 -05:00
|
|
|
u64 btrfs_find_block_group(struct btrfs_root *root,
|
|
|
|
u64 search_start, u64 search_hint, int owner);
|
2007-10-15 16:14:19 -04:00
|
|
|
struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
struct btrfs_root *root, u32 blocksize,
|
|
|
|
u64 parent, u64 root_objectid,
|
|
|
|
struct btrfs_disk_key *key, int level,
|
|
|
|
u64 hint, u64 empty_size);
|
2010-05-16 10:46:25 -04:00
|
|
|
void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf,
|
|
|
|
u64 parent, int last_ref);
|
2008-08-01 15:11:20 -04:00
|
|
|
struct extent_buffer *btrfs_init_new_buffer(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
2009-02-12 14:09:45 -05:00
|
|
|
u64 bytenr, u32 blocksize,
|
|
|
|
int level);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 root_objectid, u64 owner,
|
|
|
|
u64 offset, struct btrfs_key *ins);
|
|
|
|
int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 root_objectid, u64 owner, u64 offset,
|
|
|
|
struct btrfs_key *ins);
|
2008-07-17 12:53:50 -04:00
|
|
|
int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 num_bytes, u64 min_alloc_size,
|
|
|
|
u64 empty_size, u64 hint_byte,
|
|
|
|
u64 search_end, struct btrfs_key *ins,
|
|
|
|
u64 data);
|
2007-03-16 16:20:31 -04:00
|
|
|
int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
struct extent_buffer *buf, int full_backref);
|
|
|
|
int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf, int full_backref);
|
|
|
|
int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 flags,
|
|
|
|
int is_data);
|
2008-09-23 13:14:14 -04:00
|
|
|
int btrfs_free_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
u64 root_objectid, u64 owner, u64 offset);
|
|
|
|
|
2008-08-01 15:11:20 -04:00
|
|
|
int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len);
|
2011-10-31 20:52:39 -04:00
|
|
|
int btrfs_free_and_pin_reserved_extent(struct btrfs_root *root,
|
|
|
|
u64 start, u64 len);
|
2009-09-11 16:11:19 -04:00
|
|
|
int btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2007-06-28 15:57:36 -04:00
|
|
|
int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
|
2009-09-11 16:11:19 -04:00
|
|
|
struct btrfs_root *root);
|
2007-04-17 13:26:50 -04:00
|
|
|
int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
|
2008-09-23 13:14:14 -04:00
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 bytenr, u64 num_bytes, u64 parent,
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
u64 root_objectid, u64 owner, u64 offset);
|
|
|
|
|
2007-04-26 16:46:15 -04:00
|
|
|
int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2008-12-11 16:30:39 -05:00
|
|
|
int btrfs_extent_readonly(struct btrfs_root *root, u64 bytenr);
|
2007-04-26 16:46:15 -04:00
|
|
|
int btrfs_free_block_groups(struct btrfs_fs_info *info);
|
|
|
|
int btrfs_read_block_groups(struct btrfs_root *root);
|
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 16:11:19 -04:00
|
|
|
int btrfs_can_relocate(struct btrfs_root *root, u64 bytenr);
|
2008-03-24 15:01:56 -04:00
|
|
|
int btrfs_make_block_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytes_used,
|
2008-04-15 15:41:47 -04:00
|
|
|
u64 type, u64 chunk_objectid, u64 chunk_offset,
|
2008-03-24 15:01:56 -04:00
|
|
|
u64 size);
|
Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
|
|
|
int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 group_start);
|
2008-11-17 21:11:30 -05:00
|
|
|
u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags);
|
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 10:07:31 +00:00
|
|
|
u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
|
2009-02-20 11:00:09 -05:00
|
|
|
void btrfs_set_inode_space_info(struct btrfs_root *root, struct inode *ionde);
|
2009-03-10 12:39:20 -04:00
|
|
|
void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
|
2010-05-16 10:48:47 -04:00
|
|
|
int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
|
|
|
|
void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
|
2010-05-16 10:48:46 -04:00
|
|
|
void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2010-05-16 10:49:58 -04:00
|
|
|
int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode);
|
|
|
|
void btrfs_orphan_release_metadata(struct inode *inode);
|
2010-05-16 10:48:46 -04:00
|
|
|
int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_pending_snapshot *pending);
|
2010-05-16 10:48:47 -04:00
|
|
|
int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
|
|
|
|
void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
|
|
|
|
int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
|
|
|
|
void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
|
2010-05-16 10:46:25 -04:00
|
|
|
void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv);
|
|
|
|
struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root);
|
|
|
|
void btrfs_free_block_rsv(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *rsv);
|
2011-08-30 12:34:28 -04:00
|
|
|
int btrfs_block_rsv_add(struct btrfs_root *root,
|
2010-05-16 10:46:25 -04:00
|
|
|
struct btrfs_block_rsv *block_rsv,
|
2010-10-15 16:52:49 -04:00
|
|
|
u64 num_bytes);
|
Btrfs: fix delayed insertion reservation
We all keep getting those stupid warnings from use_block_rsv when running
stress.sh, and it's because the delayed insertion stuff is being stupid. It's
not the delayed insertion stuffs fault, it's all just stupid. When marking an
inode dirty for oh say updating the time on it, we just do a
btrfs_join_transaction, which doesn't reserve any space. This is stupid because
we're going to have to have space reserve to make this change, but we do it
because it's fast because chances are we're going to call it over and over again
and it doesn't matter. Well thanks to the delayed insertion stuff this is
mostly the case, so we do actually need to make this reservation. So if
trans->bytes_reserved is 0 then try to do a normal reservation. If not return
ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
will let it do the whole ENOSPC dance and reserve enough space for the delayed
insertion to steal the reservation from the transaction.
The other stupid thing we do is not reserve space for the inode when writing to
the thing. Usually this is ok since we have to update the time so we'd have
already done all this work before we get to the endio stuff, so it doesn't
matter. But this is stupid because we could write the data after the
transaction commits where we changed the mtime of the inode so we have to cow
all the way down to the inode anyway. This used to be masked by the delalloc
reservation stuff, but because we delay the update it doesn't get masked in this
case. So again the delayed insertion stuff bites us in the ass. So if our
trans->block_rsv is delalloc, just steal the reservation from the delalloc
reserve. Hopefully this won't bite us in the ass, but I've said that before.
With this patch stress.sh no longer spits out those stupid warnings (famous last
words). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-04 19:56:02 -04:00
|
|
|
int btrfs_block_rsv_add_noflush(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes);
|
2011-08-30 12:34:28 -04:00
|
|
|
int btrfs_block_rsv_check(struct btrfs_root *root,
|
2011-10-18 12:15:48 -04:00
|
|
|
struct btrfs_block_rsv *block_rsv, int min_factor);
|
|
|
|
int btrfs_block_rsv_refill(struct btrfs_root *root,
|
2010-05-16 10:46:25 -04:00
|
|
|
struct btrfs_block_rsv *block_rsv,
|
2011-10-18 12:15:48 -04:00
|
|
|
u64 min_reserved);
|
2011-11-18 17:43:00 +08:00
|
|
|
int btrfs_block_rsv_refill_noflush(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 min_reserved);
|
2010-05-16 10:46:25 -04:00
|
|
|
int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
|
|
|
|
struct btrfs_block_rsv *dst_rsv,
|
|
|
|
u64 num_bytes);
|
|
|
|
void btrfs_block_rsv_release(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv,
|
|
|
|
u64 num_bytes);
|
|
|
|
int btrfs_set_block_group_ro(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache);
|
|
|
|
int btrfs_set_block_group_rw(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache *cache);
|
2010-06-21 14:48:16 -04:00
|
|
|
void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
|
btrfs: fix wrong free space information of btrfs
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-05 10:07:31 +00:00
|
|
|
u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo);
|
2011-01-06 19:30:25 +08:00
|
|
|
int btrfs_error_unpin_extent_range(struct btrfs_root *root,
|
|
|
|
u64 start, u64 end);
|
|
|
|
int btrfs_error_discard_extent(struct btrfs_root *root, u64 bytenr,
|
2011-03-24 10:24:27 +00:00
|
|
|
u64 num_bytes, u64 *actual_bytes);
|
2011-02-16 13:57:04 -05:00
|
|
|
int btrfs_force_chunk_alloc(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 type);
|
2011-03-24 10:24:28 +00:00
|
|
|
int btrfs_trim_fs(struct btrfs_root *root, struct fstrim_range *range);
|
2011-01-06 19:30:25 +08:00
|
|
|
|
2011-03-07 02:13:14 +00:00
|
|
|
int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
|
2007-03-26 16:00:06 -04:00
|
|
|
/* ctree.c */
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
|
|
|
|
int level, int *slot);
|
|
|
|
int btrfs_comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2);
|
2008-03-24 15:01:56 -04:00
|
|
|
int btrfs_previous_item(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 min_objectid,
|
|
|
|
int type);
|
2008-09-23 13:14:14 -04:00
|
|
|
int btrfs_set_item_key_safe(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct btrfs_path *path,
|
|
|
|
struct btrfs_key *new_key);
|
2008-06-25 16:01:30 -04:00
|
|
|
struct extent_buffer *btrfs_root_node(struct btrfs_root *root);
|
|
|
|
struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root);
|
2008-06-25 16:01:31 -04:00
|
|
|
int btrfs_find_next_key(struct btrfs_root *root, struct btrfs_path *path,
|
2008-06-25 16:01:31 -04:00
|
|
|
struct btrfs_key *key, int lowest_level,
|
|
|
|
int cache_only, u64 min_trans);
|
|
|
|
int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
|
2008-09-05 16:13:11 -04:00
|
|
|
struct btrfs_key *max_key,
|
2008-06-25 16:01:31 -04:00
|
|
|
struct btrfs_path *path, int cache_only,
|
|
|
|
u64 min_trans);
|
2007-10-15 16:14:19 -04:00
|
|
|
int btrfs_cow_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct extent_buffer *buf,
|
|
|
|
struct extent_buffer *parent, int parent_slot,
|
2009-03-13 10:24:59 -04:00
|
|
|
struct extent_buffer **cow_ret);
|
2007-12-17 20:14:01 -05:00
|
|
|
int btrfs_copy_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf,
|
|
|
|
struct extent_buffer **cow_ret, u64 new_root_objectid);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
int btrfs_block_can_be_shared(struct btrfs_root *root,
|
|
|
|
struct extent_buffer *buf);
|
2007-04-16 09:22:45 -04:00
|
|
|
int btrfs_extend_item(struct btrfs_trans_handle *trans, struct btrfs_root
|
|
|
|
*root, struct btrfs_path *path, u32 data_size);
|
2007-04-17 13:26:50 -04:00
|
|
|
int btrfs_truncate_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
2007-11-01 11:28:41 -04:00
|
|
|
u32 new_size, int from_end);
|
2008-12-10 09:10:46 -05:00
|
|
|
int btrfs_split_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *new_key,
|
|
|
|
unsigned long split_offset);
|
2009-11-12 09:33:58 +00:00
|
|
|
int btrfs_duplicate_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *new_key);
|
2007-03-16 16:20:31 -04:00
|
|
|
int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root
|
|
|
|
*root, struct btrfs_key *key, struct btrfs_path *p, int
|
|
|
|
ins_len, int cow);
|
2007-08-07 16:15:09 -04:00
|
|
|
int btrfs_realloc_node(struct btrfs_trans_handle *trans,
|
2007-10-15 16:14:19 -04:00
|
|
|
struct btrfs_root *root, struct extent_buffer *parent,
|
2007-10-15 16:22:39 -04:00
|
|
|
int start_slot, int cache_only, u64 *last_ret,
|
|
|
|
struct btrfs_key *progress);
|
2011-04-21 01:20:15 +02:00
|
|
|
void btrfs_release_path(struct btrfs_path *p);
|
2007-04-02 10:50:19 -04:00
|
|
|
struct btrfs_path *btrfs_alloc_path(void);
|
|
|
|
void btrfs_free_path(struct btrfs_path *p);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:25:08 -05:00
|
|
|
void btrfs_set_path_blocking(struct btrfs_path *p);
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
void btrfs_clear_path_blocking(struct btrfs_path *p,
|
2011-07-16 15:23:14 -04:00
|
|
|
struct extent_buffer *held, int held_rw);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:25:08 -05:00
|
|
|
void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
|
|
|
|
|
2008-01-29 15:11:36 -05:00
|
|
|
int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, int slot, int nr);
|
|
|
|
static inline int btrfs_del_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
return btrfs_del_items(trans, root, path, path->slots[0], 1);
|
|
|
|
}
|
|
|
|
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
int setup_items_for_insert(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct btrfs_path *path,
|
|
|
|
struct btrfs_key *cpu_key, u32 *data_size,
|
|
|
|
u32 total_data, u32 total_size, int nr);
|
2007-03-16 16:20:31 -04:00
|
|
|
int btrfs_insert_item(struct btrfs_trans_handle *trans, struct btrfs_root
|
|
|
|
*root, struct btrfs_key *key, void *data, u32 data_size);
|
2008-01-29 15:15:18 -05:00
|
|
|
int btrfs_insert_empty_items(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *cpu_key, u32 *data_size, int nr);
|
|
|
|
|
|
|
|
static inline int btrfs_insert_empty_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_key *key,
|
|
|
|
u32 data_size)
|
|
|
|
{
|
|
|
|
return btrfs_insert_empty_items(trans, root, path, key, &data_size, 1);
|
|
|
|
}
|
|
|
|
|
2007-03-13 10:46:10 -04:00
|
|
|
int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *path);
|
2007-12-11 09:25:06 -05:00
|
|
|
int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path);
|
2007-10-15 16:14:19 -04:00
|
|
|
int btrfs_leaf_free_space(struct btrfs_root *root, struct extent_buffer *leaf);
|
2011-08-09 07:11:13 +00:00
|
|
|
void btrfs_drop_snapshot(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_rsv *block_rsv, int update_ref);
|
2008-10-29 14:49:05 -04:00
|
|
|
int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct extent_buffer *node,
|
|
|
|
struct extent_buffer *parent);
|
2011-05-31 18:07:27 +02:00
|
|
|
static inline int btrfs_fs_closing(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Get synced with close_ctree()
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
return fs_info->closing;
|
|
|
|
}
|
2011-04-13 15:41:04 +02:00
|
|
|
static inline void free_fs_info(struct btrfs_fs_info *fs_info)
|
|
|
|
{
|
|
|
|
kfree(fs_info->delayed_root);
|
|
|
|
kfree(fs_info->extent_root);
|
|
|
|
kfree(fs_info->tree_root);
|
|
|
|
kfree(fs_info->chunk_root);
|
|
|
|
kfree(fs_info->dev_root);
|
|
|
|
kfree(fs_info->csum_root);
|
|
|
|
kfree(fs_info->super_copy);
|
|
|
|
kfree(fs_info->super_for_commit);
|
|
|
|
kfree(fs_info);
|
|
|
|
}
|
2011-05-31 18:07:27 +02:00
|
|
|
|
2007-03-26 16:00:06 -04:00
|
|
|
/* root-item.c */
|
2008-11-17 21:14:24 -05:00
|
|
|
int btrfs_find_root_ref(struct btrfs_root *tree_root,
|
2009-09-21 15:56:00 -04:00
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 root_id, u64 ref_id);
|
2008-11-17 20:37:39 -05:00
|
|
|
int btrfs_add_root_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *tree_root,
|
2009-09-21 15:56:00 -04:00
|
|
|
u64 root_id, u64 ref_id, u64 dirid, u64 sequence,
|
|
|
|
const char *name, int name_len);
|
|
|
|
int btrfs_del_root_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *tree_root,
|
|
|
|
u64 root_id, u64 ref_id, u64 dirid, u64 *sequence,
|
2008-11-17 20:37:39 -05:00
|
|
|
const char *name, int name_len);
|
2007-03-16 16:20:31 -04:00
|
|
|
int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
|
|
|
|
struct btrfs_key *key);
|
|
|
|
int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root
|
|
|
|
*root, struct btrfs_key *key, struct btrfs_root_item
|
|
|
|
*item);
|
|
|
|
int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
|
|
|
|
*root, struct btrfs_key *key, struct btrfs_root_item
|
|
|
|
*item);
|
|
|
|
int btrfs_find_last_root(struct btrfs_root *root, u64 objectid, struct
|
|
|
|
btrfs_root_item *item, struct btrfs_key *key);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
int btrfs_find_dead_roots(struct btrfs_root *root, u64 objectid);
|
2009-09-21 16:00:26 -04:00
|
|
|
int btrfs_find_orphan_roots(struct btrfs_root *tree_root);
|
2011-07-14 21:23:06 +00:00
|
|
|
void btrfs_set_root_node(struct btrfs_root_item *item,
|
|
|
|
struct extent_buffer *node);
|
2011-03-28 02:01:25 +00:00
|
|
|
void btrfs_check_and_init_root_item(struct btrfs_root_item *item);
|
|
|
|
|
2007-03-26 16:00:06 -04:00
|
|
|
/* dir-item.c */
|
2009-01-05 21:25:51 -05:00
|
|
|
int btrfs_insert_dir_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, const char *name,
|
btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-22 18:12:22 +08:00
|
|
|
int name_len, struct inode *dir,
|
2008-07-24 12:12:38 -04:00
|
|
|
struct btrfs_key *location, u8 type, u64 index);
|
2007-04-19 15:36:27 -04:00
|
|
|
struct btrfs_dir_item *btrfs_lookup_dir_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dir,
|
|
|
|
const char *name, int name_len,
|
|
|
|
int mod);
|
|
|
|
struct btrfs_dir_item *
|
|
|
|
btrfs_lookup_dir_index_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dir,
|
|
|
|
u64 objectid, const char *name, int name_len,
|
|
|
|
int mod);
|
2009-09-21 15:56:00 -04:00
|
|
|
struct btrfs_dir_item *
|
|
|
|
btrfs_search_dir_index_item(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dirid,
|
|
|
|
const char *name, int name_len);
|
2007-04-19 15:36:27 -04:00
|
|
|
struct btrfs_dir_item *btrfs_match_dir_item_name(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
2007-03-23 15:56:19 -04:00
|
|
|
const char *name, int name_len);
|
2007-04-19 15:36:27 -04:00
|
|
|
int btrfs_delete_one_dir_name(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct btrfs_dir_item *di);
|
2007-11-16 11:45:54 -05:00
|
|
|
int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
|
2009-11-12 09:35:27 +00:00
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 objectid,
|
|
|
|
const char *name, u16 name_len,
|
|
|
|
const void *data, u16 data_len);
|
2007-11-16 11:45:54 -05:00
|
|
|
struct btrfs_dir_item *btrfs_lookup_xattr(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 dir,
|
|
|
|
const char *name, u16 name_len,
|
|
|
|
int mod);
|
2011-03-16 16:47:17 -04:00
|
|
|
int verify_dir_item(struct btrfs_root *root,
|
|
|
|
struct extent_buffer *leaf,
|
|
|
|
struct btrfs_dir_item *dir_item);
|
2008-07-24 12:17:14 -04:00
|
|
|
|
|
|
|
/* orphan.c */
|
|
|
|
int btrfs_insert_orphan_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 offset);
|
|
|
|
int btrfs_del_orphan_item(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 offset);
|
2009-09-21 15:56:00 -04:00
|
|
|
int btrfs_find_orphan_item(struct btrfs_root *root, u64 offset);
|
2008-07-24 12:17:14 -04:00
|
|
|
|
2007-03-26 16:00:06 -04:00
|
|
|
/* inode-item.c */
|
2007-12-12 14:38:19 -05:00
|
|
|
int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
const char *name, int name_len,
|
2008-07-24 12:12:38 -04:00
|
|
|
u64 inode_objectid, u64 ref_objectid, u64 index);
|
2007-12-12 14:38:19 -05:00
|
|
|
int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
const char *name, int name_len,
|
2008-07-24 12:12:38 -04:00
|
|
|
u64 inode_objectid, u64 ref_objectid, u64 *index);
|
2010-05-16 10:48:46 -04:00
|
|
|
struct btrfs_inode_ref *
|
|
|
|
btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
const char *name, int name_len,
|
|
|
|
u64 inode_objectid, u64 ref_objectid, int mod);
|
2007-10-15 16:14:19 -04:00
|
|
|
int btrfs_insert_empty_inode(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 objectid);
|
2007-03-20 15:57:25 -04:00
|
|
|
int btrfs_lookup_inode(struct btrfs_trans_handle *trans, struct btrfs_root
|
2007-04-06 15:37:36 -04:00
|
|
|
*root, struct btrfs_path *path,
|
|
|
|
struct btrfs_key *location, int mod);
|
2007-03-26 16:00:06 -04:00
|
|
|
|
|
|
|
/* file-item.c */
|
2008-12-10 09:10:46 -05:00
|
|
|
int btrfs_del_csums(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, u64 bytenr, u64 len);
|
2008-07-31 15:42:53 -04:00
|
|
|
int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
struct bio *bio, u32 *dst);
|
2010-05-23 11:00:55 -04:00
|
|
|
int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
|
|
|
|
struct bio *bio, u64 logical_offset, u32 *dst);
|
2007-04-17 13:26:50 -04:00
|
|
|
int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
struct btrfs_root *root,
|
|
|
|
u64 objectid, u64 pos,
|
|
|
|
u64 disk_offset, u64 disk_num_bytes,
|
|
|
|
u64 num_bytes, u64 offset, u64 ram_bytes,
|
|
|
|
u8 compression, u8 encryption, u16 other_encoding);
|
2007-03-26 16:00:06 -04:00
|
|
|
int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path, u64 objectid,
|
2007-10-15 16:15:53 -04:00
|
|
|
u64 bytenr, int mod);
|
2008-02-20 12:07:25 -05:00
|
|
|
int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
struct btrfs_root *root,
|
2008-07-17 12:53:50 -04:00
|
|
|
struct btrfs_ordered_sum *sums);
|
2008-07-18 06:17:13 -04:00
|
|
|
int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
struct bio *bio, u64 file_start, int contig);
|
2007-04-17 13:26:50 -04:00
|
|
|
struct btrfs_csum_item *btrfs_lookup_csum(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
|
|
|
u64 bytenr, int cow);
|
2007-05-29 15:17:08 -04:00
|
|
|
int btrfs_csum_truncate(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct btrfs_path *path,
|
|
|
|
u64 isize);
|
2011-03-08 14:14:00 +01:00
|
|
|
int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
|
|
|
|
struct list_head *list, int search_commit);
|
2007-06-12 06:35:45 -04:00
|
|
|
/* inode.c */
|
2011-07-18 13:21:36 -04:00
|
|
|
struct extent_map *btrfs_get_extent_fiemap(struct inode *inode, struct page *page,
|
|
|
|
size_t pg_offset, u64 start, u64 len,
|
|
|
|
int create);
|
2008-07-24 09:51:08 -04:00
|
|
|
|
|
|
|
/* RHEL and EL kernels have a patch that renames PG_checked to FsMisc */
|
2008-08-07 11:19:42 -04:00
|
|
|
#if defined(ClearPageFsMisc) && !defined(ClearPageChecked)
|
2008-07-24 09:51:08 -04:00
|
|
|
#define ClearPageChecked ClearPageFsMisc
|
|
|
|
#define SetPageChecked SetPageFsMisc
|
|
|
|
#define PageChecked PageFsMisc
|
|
|
|
#endif
|
|
|
|
|
2011-07-20 03:46:35 +00:00
|
|
|
/* This forces readahead on a given range of bytes in an inode */
|
|
|
|
static inline void btrfs_force_ra(struct address_space *mapping,
|
|
|
|
struct file_ra_state *ra, struct file *file,
|
|
|
|
pgoff_t offset, unsigned long req_size)
|
|
|
|
{
|
|
|
|
page_cache_sync_readahead(mapping, ra, file, offset, req_size);
|
|
|
|
}
|
|
|
|
|
2008-11-17 21:02:50 -05:00
|
|
|
struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry);
|
|
|
|
int btrfs_set_inode_index(struct inode *dir, u64 *index);
|
2008-09-05 16:13:11 -04:00
|
|
|
int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct inode *dir, struct inode *inode,
|
|
|
|
const char *name, int name_len);
|
|
|
|
int btrfs_add_link(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *parent_inode, struct inode *inode,
|
|
|
|
const char *name, int name_len, int add_backref, u64 index);
|
2009-09-21 15:56:00 -04:00
|
|
|
int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct inode *dir, u64 objectid,
|
|
|
|
const char *name, int name_len);
|
2008-09-05 16:13:11 -04:00
|
|
|
int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct inode *inode, u64 new_size,
|
|
|
|
u32 min_type);
|
|
|
|
|
2009-11-12 09:36:34 +00:00
|
|
|
int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
|
2010-02-03 19:33:23 +00:00
|
|
|
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
|
|
|
|
struct extent_state **cached_state);
|
2008-07-22 11:18:09 -04:00
|
|
|
int btrfs_writepages(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc);
|
2008-12-11 16:30:39 -05:00
|
|
|
int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
|
2011-05-11 15:26:06 -04:00
|
|
|
struct btrfs_root *new_root, u64 new_dirid);
|
2008-03-24 15:02:07 -04:00
|
|
|
int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
|
|
|
size_t size, struct bio *bio, unsigned long bio_flags);
|
2008-03-24 15:02:07 -04:00
|
|
|
|
2009-03-31 15:23:21 -07:00
|
|
|
int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
|
2007-06-15 13:50:00 -04:00
|
|
|
int btrfs_readpage(struct file *file, struct page *page);
|
2010-06-07 11:35:40 -04:00
|
|
|
void btrfs_evict_inode(struct inode *inode);
|
2010-03-05 09:21:37 +01:00
|
|
|
int btrfs_write_inode(struct inode *inode, struct writeback_control *wbc);
|
2011-11-30 10:45:38 -05:00
|
|
|
int btrfs_dirty_inode(struct inode *inode);
|
|
|
|
int btrfs_update_time(struct file *file);
|
2007-06-12 06:35:45 -04:00
|
|
|
struct inode *btrfs_alloc_inode(struct super_block *sb);
|
|
|
|
void btrfs_destroy_inode(struct inode *inode);
|
2010-06-07 13:43:19 -04:00
|
|
|
int btrfs_drop_inode(struct inode *inode);
|
2007-06-12 06:35:45 -04:00
|
|
|
int btrfs_init_cachep(void);
|
|
|
|
void btrfs_destroy_cachep(void);
|
2008-06-10 10:07:39 -04:00
|
|
|
long btrfs_ioctl_trans_end(struct file *file);
|
2008-07-21 02:01:04 +05:30
|
|
|
struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
|
Btrfs: change how we mount subvolumes
This work is in preperation for being able to set a different root as the
default mounting root.
There is currently a problem with how we mount subvolumes. We cannot currently
mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
default subvolume. So say you take a snapshot of the default subvolume and call
it snap1, and then take a snapshot of snap1 and call it snap2, so now you have
/
/snap1
/snap1/snap2
as your available volumes. Currently you can only mount / and /snap1,
you cannot mount /snap1/snap2. To fix this problem instead of passing
subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is
the tree id that gets spit out via the subvolume listing you get from
the subvolume listing patches (btrfs filesystem list). This allows us
to mount /, /snap1 and /snap1/snap2 as the root volume.
In addition to the above, we also now read the default dir item in the
tree root to get the root key that it points to. For now this just
points at what has always been the default subvolme, but later on I plan
to change it to point at whatever root you want to be the new default
root, so you can just set the default mount and not have to mount with
-o subvolid=<treeid>. I tested this out with the above scenario and it
worked perfectly. Thanks,
mount -o subvol operates inside the selected subvolid. For example:
mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt
/mnt will have the snap1 directory for the subvolume with id
256.
mount -o subvol=snap /dev/xxx /mnt
/mnt will be the snap directory of whatever the default subvolume
is.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-04 17:38:27 +00:00
|
|
|
struct btrfs_root *root, int *was_new);
|
2007-08-27 16:49:44 -04:00
|
|
|
struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page,
|
2011-04-19 14:29:38 +02:00
|
|
|
size_t pg_offset, u64 start, u64 end,
|
2007-08-27 16:49:44 -04:00
|
|
|
int create);
|
|
|
|
int btrfs_update_inode(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
struct inode *inode);
|
2008-09-26 10:05:38 -04:00
|
|
|
int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode);
|
|
|
|
int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode);
|
2011-01-31 16:22:42 -05:00
|
|
|
int btrfs_orphan_cleanup(struct btrfs_root *root);
|
2010-05-16 10:49:58 -04:00
|
|
|
void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2011-01-31 15:30:16 -05:00
|
|
|
int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
|
2009-09-21 16:00:26 -04:00
|
|
|
int btrfs_invalidate_inodes(struct btrfs_root *root);
|
2009-11-12 09:36:34 +00:00
|
|
|
void btrfs_add_delayed_iput(struct inode *inode);
|
|
|
|
void btrfs_run_delayed_iputs(struct btrfs_root *root);
|
2010-05-16 10:49:59 -04:00
|
|
|
int btrfs_prealloc_file_range(struct inode *inode, int mode,
|
|
|
|
u64 start, u64 num_bytes, u64 min_size,
|
|
|
|
loff_t actual_len, u64 *alloc_hint);
|
2010-06-21 14:48:16 -04:00
|
|
|
int btrfs_prealloc_file_range_trans(struct inode *inode,
|
|
|
|
struct btrfs_trans_handle *trans, int mode,
|
|
|
|
u64 start, u64 num_bytes, u64 min_size,
|
|
|
|
loff_t actual_len, u64 *alloc_hint);
|
2009-10-09 09:54:36 -04:00
|
|
|
extern const struct dentry_operations btrfs_dentry_operations;
|
2008-06-11 21:53:53 -04:00
|
|
|
|
|
|
|
/* ioctl.c */
|
|
|
|
long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
|
2009-04-17 10:37:41 +02:00
|
|
|
void btrfs_update_iflags(struct inode *inode);
|
|
|
|
void btrfs_inherit_iflags(struct inode *inode, struct inode *dir);
|
2011-05-24 15:35:30 -04:00
|
|
|
int btrfs_defrag_file(struct inode *inode, struct file *file,
|
|
|
|
struct btrfs_ioctl_defrag_range_args *range,
|
|
|
|
u64 newer_than, unsigned long max_pages);
|
2007-06-12 06:35:45 -04:00
|
|
|
/* file.c */
|
2011-05-24 15:35:30 -04:00
|
|
|
int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode);
|
|
|
|
int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
|
2011-07-16 20:44:56 -04:00
|
|
|
int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync);
|
2008-09-26 10:05:38 -04:00
|
|
|
int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end,
|
|
|
|
int skip_pinned);
|
2009-10-01 15:43:56 -07:00
|
|
|
extern const struct file_operations btrfs_file_operations;
|
2009-11-12 09:34:08 +00:00
|
|
|
int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode,
|
|
|
|
u64 start, u64 end, u64 *hint_byte, int drop_cache);
|
2008-10-30 14:25:28 -04:00
|
|
|
int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode, u64 start, u64 end);
|
2008-06-10 10:07:39 -04:00
|
|
|
int btrfs_release_file(struct inode *inode, struct file *file);
|
2011-04-06 13:05:22 -04:00
|
|
|
void btrfs_drop_pages(struct page **pages, size_t num_pages);
|
|
|
|
int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
|
|
|
|
struct page **pages, size_t num_pages,
|
|
|
|
loff_t pos, size_t write_bytes,
|
|
|
|
struct extent_state **cached);
|
2008-06-10 10:07:39 -04:00
|
|
|
|
2007-08-07 16:15:09 -04:00
|
|
|
/* tree-defrag.c */
|
|
|
|
int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, int cache_only);
|
2007-08-29 15:47:34 -04:00
|
|
|
|
|
|
|
/* sysfs.c */
|
|
|
|
int btrfs_init_sysfs(void);
|
|
|
|
void btrfs_exit_sysfs(void);
|
|
|
|
|
2007-11-16 11:45:54 -05:00
|
|
|
/* xattr.c */
|
|
|
|
ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);
|
2008-07-24 12:16:03 -04:00
|
|
|
|
2007-12-21 16:27:24 -05:00
|
|
|
/* super.c */
|
2008-06-10 10:40:29 -04:00
|
|
|
int btrfs_parse_options(struct btrfs_root *root, char *options);
|
2008-06-10 10:07:39 -04:00
|
|
|
int btrfs_sync_fs(struct super_block *sb, int wait);
|
2011-01-06 19:30:25 +08:00
|
|
|
void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
|
|
|
|
unsigned int line, int errno);
|
|
|
|
|
|
|
|
#define btrfs_std_error(fs_info, errno) \
|
|
|
|
do { \
|
|
|
|
if ((errno)) \
|
|
|
|
__btrfs_std_error((fs_info), __func__, __LINE__, (errno));\
|
|
|
|
} while (0)
|
2008-07-24 12:16:36 -04:00
|
|
|
|
|
|
|
/* acl.c */
|
2009-10-13 13:50:18 -04:00
|
|
|
#ifdef CONFIG_BTRFS_FS_POSIX_ACL
|
2011-07-23 17:37:31 +02:00
|
|
|
struct posix_acl *btrfs_get_acl(struct inode *inode, int type);
|
2009-11-12 09:35:27 +00:00
|
|
|
int btrfs_init_acl(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode, struct inode *dir);
|
2008-07-24 12:16:36 -04:00
|
|
|
int btrfs_acl_chmod(struct inode *inode);
|
2011-07-14 03:17:39 +00:00
|
|
|
#else
|
2011-08-02 21:14:05 -10:00
|
|
|
#define btrfs_get_acl NULL
|
2011-07-14 03:17:39 +00:00
|
|
|
static inline int btrfs_init_acl(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode, struct inode *dir)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
static inline int btrfs_acl_chmod(struct inode *inode)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
/* relocation.c */
|
|
|
|
int btrfs_relocate_block_group(struct btrfs_root *root, u64 group_start);
|
|
|
|
int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
|
|
|
int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
|
|
|
int btrfs_recover_relocation(struct btrfs_root *root);
|
|
|
|
int btrfs_reloc_clone_csums(struct inode *inode, u64 file_pos, u64 len);
|
2010-05-16 10:49:59 -04:00
|
|
|
void btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root, struct extent_buffer *buf,
|
|
|
|
struct extent_buffer *cow);
|
|
|
|
void btrfs_reloc_pre_snapshot(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_pending_snapshot *pending,
|
|
|
|
u64 *bytes_to_reserve);
|
|
|
|
void btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_pending_snapshot *pending);
|
2011-03-08 14:14:00 +01:00
|
|
|
|
|
|
|
/* scrub.c */
|
|
|
|
int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
|
2011-03-23 16:34:19 +01:00
|
|
|
struct btrfs_scrub_progress *progress, int readonly);
|
2011-03-08 14:14:00 +01:00
|
|
|
int btrfs_scrub_pause(struct btrfs_root *root);
|
|
|
|
int btrfs_scrub_pause_super(struct btrfs_root *root);
|
|
|
|
int btrfs_scrub_continue(struct btrfs_root *root);
|
|
|
|
int btrfs_scrub_continue_super(struct btrfs_root *root);
|
|
|
|
int btrfs_scrub_cancel(struct btrfs_root *root);
|
|
|
|
int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev);
|
|
|
|
int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid);
|
|
|
|
int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
|
|
|
|
struct btrfs_scrub_progress *progress);
|
|
|
|
|
btrfs: initial readahead code and prototypes
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-23 14:33:49 +02:00
|
|
|
/* reada.c */
|
|
|
|
struct reada_control {
|
|
|
|
struct btrfs_root *root; /* tree to prefetch */
|
|
|
|
struct btrfs_key key_start;
|
|
|
|
struct btrfs_key key_end; /* exclusive */
|
|
|
|
atomic_t elems;
|
|
|
|
struct kref refcnt;
|
|
|
|
wait_queue_head_t wait;
|
|
|
|
};
|
|
|
|
struct reada_control *btrfs_reada_add(struct btrfs_root *root,
|
|
|
|
struct btrfs_key *start, struct btrfs_key *end);
|
|
|
|
int btrfs_reada_wait(void *handle);
|
|
|
|
void btrfs_reada_detach(void *handle);
|
|
|
|
int btree_readahead_hook(struct btrfs_root *root, struct extent_buffer *eb,
|
|
|
|
u64 start, int err);
|
|
|
|
|
2007-02-02 09:18:22 -05:00
|
|
|
#endif
|