Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2008 Red Hat. All rights reserved.
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public
|
|
|
|
* License v2 as published by the Free Software Foundation.
|
|
|
|
*
|
|
|
|
* This program is distributed in the hope that it will be useful,
|
|
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
|
|
* General Public License for more details.
|
|
|
|
*
|
|
|
|
* You should have received a copy of the GNU General Public
|
|
|
|
* License along with this program; if not, write to the
|
|
|
|
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
|
|
|
|
* Boston, MA 021110-1307, USA.
|
|
|
|
*/
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
#include <linux/pagemap.h>
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
#include <linux/sched.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
|
|
|
#include <linux/slab.h>
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
#include <linux/math64.h>
|
2011-08-08 08:24:46 -04:00
|
|
|
#include <linux/ratelimit.h>
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
#include "ctree.h"
|
2009-04-03 09:47:43 -04:00
|
|
|
#include "free-space-cache.h"
|
|
|
|
#include "transaction.h"
|
2010-06-21 14:48:16 -04:00
|
|
|
#include "disk-io.h"
|
2011-04-01 14:55:00 +00:00
|
|
|
#include "extent_io.h"
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
#include "inode-map.h"
|
2009-04-03 09:47:43 -04:00
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
#define BITS_PER_BITMAP (PAGE_CACHE_SIZE * 8)
|
|
|
|
#define MAX_CACHE_BYTES_PER_GIG (32 * 1024)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static int link_free_space(struct btrfs_free_space_ctl *ctl,
|
2010-07-02 12:14:14 -04:00
|
|
|
struct btrfs_free_space *info);
|
2012-05-14 10:06:40 -04:00
|
|
|
static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
|
|
|
|
struct btrfs_free_space *info);
|
2010-07-02 12:14:14 -04:00
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
static struct inode *__lookup_free_space_inode(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
u64 offset)
|
2010-06-21 14:48:16 -04:00
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_key location;
|
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
struct btrfs_free_space_header *header;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct inode *inode = NULL;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
key.objectid = BTRFS_FREE_SPACE_OBJECTID;
|
2011-04-20 10:20:14 +08:00
|
|
|
key.offset = offset;
|
2010-06-21 14:48:16 -04:00
|
|
|
key.type = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
if (ret > 0) {
|
2011-04-21 01:20:15 +02:00
|
|
|
btrfs_release_path(path);
|
2010-06-21 14:48:16 -04:00
|
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
}
|
|
|
|
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
header = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_free_space_header);
|
|
|
|
btrfs_free_space_key(leaf, header, &disk_key);
|
|
|
|
btrfs_disk_key_to_cpu(&location, &disk_key);
|
2011-04-21 01:20:15 +02:00
|
|
|
btrfs_release_path(path);
|
2010-06-21 14:48:16 -04:00
|
|
|
|
|
|
|
inode = btrfs_iget(root->fs_info->sb, &location, root, NULL);
|
|
|
|
if (!inode)
|
|
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return inode;
|
|
|
|
if (is_bad_inode(inode)) {
|
|
|
|
iput(inode);
|
|
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
}
|
|
|
|
|
2012-04-13 11:03:55 -04:00
|
|
|
mapping_set_gfp_mask(inode->i_mapping,
|
|
|
|
mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS);
|
2011-03-31 09:43:23 +00:00
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
return inode;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct inode *lookup_free_space_inode(struct btrfs_root *root,
|
|
|
|
struct btrfs_block_group_cache
|
|
|
|
*block_group, struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
struct inode *inode = NULL;
|
2011-10-06 08:58:24 -04:00
|
|
|
u32 flags = BTRFS_INODE_NODATASUM | BTRFS_INODE_NODATACOW;
|
2011-04-20 10:20:14 +08:00
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->inode)
|
|
|
|
inode = igrab(block_group->inode);
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
if (inode)
|
|
|
|
return inode;
|
|
|
|
|
|
|
|
inode = __lookup_free_space_inode(root, path,
|
|
|
|
block_group->key.objectid);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return inode;
|
|
|
|
|
2010-06-21 14:48:16 -04:00
|
|
|
spin_lock(&block_group->lock);
|
2011-10-06 08:58:24 -04:00
|
|
|
if (!((BTRFS_I(inode)->flags & flags) == flags)) {
|
2011-06-10 15:31:13 -04:00
|
|
|
printk(KERN_INFO "Old style space inode found, converting.\n");
|
2011-10-06 08:58:24 -04:00
|
|
|
BTRFS_I(inode)->flags |= BTRFS_INODE_NODATASUM |
|
|
|
|
BTRFS_INODE_NODATACOW;
|
2011-06-10 15:31:13 -04:00
|
|
|
block_group->disk_cache_state = BTRFS_DC_CLEAR;
|
|
|
|
}
|
|
|
|
|
2011-08-29 14:06:00 -04:00
|
|
|
if (!block_group->iref) {
|
2010-06-21 14:48:16 -04:00
|
|
|
block_group->inode = igrab(inode);
|
|
|
|
block_group->iref = 1;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
return inode;
|
|
|
|
}
|
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
int __create_free_space_inode(struct btrfs_root *root,
|
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path, u64 ino, u64 offset)
|
2010-06-21 14:48:16 -04:00
|
|
|
{
|
|
|
|
struct btrfs_key key;
|
|
|
|
struct btrfs_disk_key disk_key;
|
|
|
|
struct btrfs_free_space_header *header;
|
|
|
|
struct btrfs_inode_item *inode_item;
|
|
|
|
struct extent_buffer *leaf;
|
2011-10-06 08:58:24 -04:00
|
|
|
u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC;
|
2010-06-21 14:48:16 -04:00
|
|
|
int ret;
|
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
ret = btrfs_insert_empty_inode(trans, root, path, ino);
|
2010-06-21 14:48:16 -04:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2011-10-06 08:58:24 -04:00
|
|
|
/* We inline crc's for the free disk space cache */
|
|
|
|
if (ino != BTRFS_FREE_INO_OBJECTID)
|
|
|
|
flags |= BTRFS_INODE_NODATASUM | BTRFS_INODE_NODATACOW;
|
|
|
|
|
2010-06-21 14:48:16 -04:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
inode_item = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_inode_item);
|
|
|
|
btrfs_item_key(leaf, &disk_key, path->slots[0]);
|
|
|
|
memset_extent_buffer(leaf, 0, (unsigned long)inode_item,
|
|
|
|
sizeof(*inode_item));
|
|
|
|
btrfs_set_inode_generation(leaf, inode_item, trans->transid);
|
|
|
|
btrfs_set_inode_size(leaf, inode_item, 0);
|
|
|
|
btrfs_set_inode_nbytes(leaf, inode_item, 0);
|
|
|
|
btrfs_set_inode_uid(leaf, inode_item, 0);
|
|
|
|
btrfs_set_inode_gid(leaf, inode_item, 0);
|
|
|
|
btrfs_set_inode_mode(leaf, inode_item, S_IFREG | 0600);
|
2011-10-06 08:58:24 -04:00
|
|
|
btrfs_set_inode_flags(leaf, inode_item, flags);
|
2010-06-21 14:48:16 -04:00
|
|
|
btrfs_set_inode_nlink(leaf, inode_item, 1);
|
|
|
|
btrfs_set_inode_transid(leaf, inode_item, trans->transid);
|
2011-04-20 10:20:14 +08:00
|
|
|
btrfs_set_inode_block_group(leaf, inode_item, offset);
|
2010-06-21 14:48:16 -04:00
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
2011-04-21 01:20:15 +02:00
|
|
|
btrfs_release_path(path);
|
2010-06-21 14:48:16 -04:00
|
|
|
|
|
|
|
key.objectid = BTRFS_FREE_SPACE_OBJECTID;
|
2011-04-20 10:20:14 +08:00
|
|
|
key.offset = offset;
|
2010-06-21 14:48:16 -04:00
|
|
|
key.type = 0;
|
|
|
|
|
|
|
|
ret = btrfs_insert_empty_item(trans, root, path, &key,
|
|
|
|
sizeof(struct btrfs_free_space_header));
|
|
|
|
if (ret < 0) {
|
2011-04-21 01:20:15 +02:00
|
|
|
btrfs_release_path(path);
|
2010-06-21 14:48:16 -04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
header = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_free_space_header);
|
|
|
|
memset_extent_buffer(leaf, 0, (unsigned long)header, sizeof(*header));
|
|
|
|
btrfs_set_free_space_key(leaf, header, &disk_key);
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
2011-04-21 01:20:15 +02:00
|
|
|
btrfs_release_path(path);
|
2010-06-21 14:48:16 -04:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
int create_free_space_inode(struct btrfs_root *root,
|
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
u64 ino;
|
|
|
|
|
|
|
|
ret = btrfs_find_free_objectid(root, &ino);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
return __create_free_space_inode(root, trans, path, ino,
|
|
|
|
block_group->key.objectid);
|
|
|
|
}
|
|
|
|
|
2010-06-21 14:48:16 -04:00
|
|
|
int btrfs_truncate_free_space_cache(struct btrfs_root *root,
|
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path,
|
|
|
|
struct inode *inode)
|
|
|
|
{
|
2011-09-11 10:52:24 -04:00
|
|
|
struct btrfs_block_rsv *rsv;
|
2011-11-02 09:29:35 -04:00
|
|
|
u64 needed_bytes;
|
2010-06-21 14:48:16 -04:00
|
|
|
loff_t oldsize;
|
|
|
|
int ret = 0;
|
|
|
|
|
2011-09-11 10:52:24 -04:00
|
|
|
rsv = trans->block_rsv;
|
2011-11-02 09:29:35 -04:00
|
|
|
trans->block_rsv = &root->fs_info->global_block_rsv;
|
|
|
|
|
|
|
|
/* 1 for slack space, 1 for updating the inode */
|
|
|
|
needed_bytes = btrfs_calc_trunc_metadata_size(root, 1) +
|
|
|
|
btrfs_calc_trans_metadata_size(root, 1);
|
|
|
|
|
|
|
|
spin_lock(&trans->block_rsv->lock);
|
|
|
|
if (trans->block_rsv->reserved < needed_bytes) {
|
|
|
|
spin_unlock(&trans->block_rsv->lock);
|
|
|
|
trans->block_rsv = rsv;
|
|
|
|
return -ENOSPC;
|
|
|
|
}
|
|
|
|
spin_unlock(&trans->block_rsv->lock);
|
2010-06-21 14:48:16 -04:00
|
|
|
|
|
|
|
oldsize = i_size_read(inode);
|
|
|
|
btrfs_i_size_write(inode, 0);
|
|
|
|
truncate_pagecache(inode, oldsize, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't need an orphan item because truncating the free space cache
|
|
|
|
* will never be split across transactions.
|
|
|
|
*/
|
|
|
|
ret = btrfs_truncate_inode_items(trans, root, inode,
|
|
|
|
0, BTRFS_EXTENT_DATA_KEY);
|
2011-09-11 10:52:24 -04:00
|
|
|
|
2010-06-21 14:48:16 -04:00
|
|
|
if (ret) {
|
2011-11-02 09:29:35 -04:00
|
|
|
trans->block_rsv = rsv;
|
2012-03-12 16:03:00 +01:00
|
|
|
btrfs_abort_transaction(trans, root, ret);
|
2010-06-21 14:48:16 -04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-04-20 10:33:24 +08:00
|
|
|
ret = btrfs_update_inode(trans, root, inode);
|
2012-03-12 16:03:00 +01:00
|
|
|
if (ret)
|
|
|
|
btrfs_abort_transaction(trans, root, ret);
|
2011-11-02 09:29:35 -04:00
|
|
|
trans->block_rsv = rsv;
|
|
|
|
|
2011-04-20 10:33:24 +08:00
|
|
|
return ret;
|
2010-06-21 14:48:16 -04:00
|
|
|
}
|
|
|
|
|
2010-08-25 16:54:15 -04:00
|
|
|
static int readahead_cache(struct inode *inode)
|
|
|
|
{
|
|
|
|
struct file_ra_state *ra;
|
|
|
|
unsigned long last_index;
|
|
|
|
|
|
|
|
ra = kzalloc(sizeof(*ra), GFP_NOFS);
|
|
|
|
if (!ra)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
file_ra_state_init(ra, inode->i_mapping);
|
|
|
|
last_index = (i_size_read(inode) - 1) >> PAGE_CACHE_SHIFT;
|
|
|
|
|
|
|
|
page_cache_sync_readahead(inode->i_mapping, ra, NULL, 0, last_index);
|
|
|
|
|
|
|
|
kfree(ra);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
struct io_ctl {
|
|
|
|
void *cur, *orig;
|
|
|
|
struct page *page;
|
|
|
|
struct page **pages;
|
|
|
|
struct btrfs_root *root;
|
|
|
|
unsigned long size;
|
|
|
|
int index;
|
|
|
|
int num_pages;
|
2011-10-06 08:58:24 -04:00
|
|
|
unsigned check_crcs:1;
|
2011-10-05 15:18:58 -04:00
|
|
|
};
|
|
|
|
|
|
|
|
static int io_ctl_init(struct io_ctl *io_ctl, struct inode *inode,
|
|
|
|
struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
memset(io_ctl, 0, sizeof(struct io_ctl));
|
|
|
|
io_ctl->num_pages = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
|
|
|
|
PAGE_CACHE_SHIFT;
|
|
|
|
io_ctl->pages = kzalloc(sizeof(struct page *) * io_ctl->num_pages,
|
|
|
|
GFP_NOFS);
|
|
|
|
if (!io_ctl->pages)
|
|
|
|
return -ENOMEM;
|
|
|
|
io_ctl->root = root;
|
2011-10-06 08:58:24 -04:00
|
|
|
if (btrfs_ino(inode) != BTRFS_FREE_INO_OBJECTID)
|
|
|
|
io_ctl->check_crcs = 1;
|
2011-10-05 15:18:58 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void io_ctl_free(struct io_ctl *io_ctl)
|
|
|
|
{
|
|
|
|
kfree(io_ctl->pages);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void io_ctl_unmap_page(struct io_ctl *io_ctl)
|
|
|
|
{
|
|
|
|
if (io_ctl->cur) {
|
|
|
|
kunmap(io_ctl->page);
|
|
|
|
io_ctl->cur = NULL;
|
|
|
|
io_ctl->orig = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void io_ctl_map_page(struct io_ctl *io_ctl, int clear)
|
|
|
|
{
|
|
|
|
WARN_ON(io_ctl->cur);
|
|
|
|
BUG_ON(io_ctl->index >= io_ctl->num_pages);
|
|
|
|
io_ctl->page = io_ctl->pages[io_ctl->index++];
|
|
|
|
io_ctl->cur = kmap(io_ctl->page);
|
|
|
|
io_ctl->orig = io_ctl->cur;
|
|
|
|
io_ctl->size = PAGE_CACHE_SIZE;
|
|
|
|
if (clear)
|
|
|
|
memset(io_ctl->cur, 0, PAGE_CACHE_SIZE);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void io_ctl_drop_pages(struct io_ctl *io_ctl)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
io_ctl_unmap_page(io_ctl);
|
|
|
|
|
|
|
|
for (i = 0; i < io_ctl->num_pages; i++) {
|
2012-01-09 14:27:42 +08:00
|
|
|
if (io_ctl->pages[i]) {
|
|
|
|
ClearPageChecked(io_ctl->pages[i]);
|
|
|
|
unlock_page(io_ctl->pages[i]);
|
|
|
|
page_cache_release(io_ctl->pages[i]);
|
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int io_ctl_prepare_pages(struct io_ctl *io_ctl, struct inode *inode,
|
|
|
|
int uptodate)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < io_ctl->num_pages; i++) {
|
|
|
|
page = find_or_create_page(inode->i_mapping, i, mask);
|
|
|
|
if (!page) {
|
|
|
|
io_ctl_drop_pages(io_ctl);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
io_ctl->pages[i] = page;
|
|
|
|
if (uptodate && !PageUptodate(page)) {
|
|
|
|
btrfs_readpage(NULL, page);
|
|
|
|
lock_page(page);
|
|
|
|
if (!PageUptodate(page)) {
|
|
|
|
printk(KERN_ERR "btrfs: error reading free "
|
|
|
|
"space cache\n");
|
|
|
|
io_ctl_drop_pages(io_ctl);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-11-15 09:31:24 -05:00
|
|
|
for (i = 0; i < io_ctl->num_pages; i++) {
|
|
|
|
clear_page_dirty_for_io(io_ctl->pages[i]);
|
|
|
|
set_page_extent_mapped(io_ctl->pages[i]);
|
|
|
|
}
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void io_ctl_set_generation(struct io_ctl *io_ctl, u64 generation)
|
|
|
|
{
|
2012-04-13 11:03:55 -04:00
|
|
|
__le64 *val;
|
2011-10-05 15:18:58 -04:00
|
|
|
|
|
|
|
io_ctl_map_page(io_ctl, 1);
|
|
|
|
|
|
|
|
/*
|
2011-10-06 08:58:24 -04:00
|
|
|
* Skip the csum areas. If we don't check crcs then we just have a
|
|
|
|
* 64bit chunk at the front of the first page.
|
2011-10-05 15:18:58 -04:00
|
|
|
*/
|
2011-10-06 08:58:24 -04:00
|
|
|
if (io_ctl->check_crcs) {
|
|
|
|
io_ctl->cur += (sizeof(u32) * io_ctl->num_pages);
|
|
|
|
io_ctl->size -= sizeof(u64) + (sizeof(u32) * io_ctl->num_pages);
|
|
|
|
} else {
|
|
|
|
io_ctl->cur += sizeof(u64);
|
|
|
|
io_ctl->size -= sizeof(u64) * 2;
|
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
|
|
|
|
val = io_ctl->cur;
|
|
|
|
*val = cpu_to_le64(generation);
|
|
|
|
io_ctl->cur += sizeof(u64);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int io_ctl_check_generation(struct io_ctl *io_ctl, u64 generation)
|
|
|
|
{
|
2012-04-13 11:03:55 -04:00
|
|
|
__le64 *gen;
|
2011-10-05 15:18:58 -04:00
|
|
|
|
2011-10-06 08:58:24 -04:00
|
|
|
/*
|
|
|
|
* Skip the crc area. If we don't check crcs then we just have a 64bit
|
|
|
|
* chunk at the front of the first page.
|
|
|
|
*/
|
|
|
|
if (io_ctl->check_crcs) {
|
|
|
|
io_ctl->cur += sizeof(u32) * io_ctl->num_pages;
|
|
|
|
io_ctl->size -= sizeof(u64) +
|
|
|
|
(sizeof(u32) * io_ctl->num_pages);
|
|
|
|
} else {
|
|
|
|
io_ctl->cur += sizeof(u64);
|
|
|
|
io_ctl->size -= sizeof(u64) * 2;
|
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
|
|
|
|
gen = io_ctl->cur;
|
|
|
|
if (le64_to_cpu(*gen) != generation) {
|
|
|
|
printk_ratelimited(KERN_ERR "btrfs: space cache generation "
|
|
|
|
"(%Lu) does not match inode (%Lu)\n", *gen,
|
|
|
|
generation);
|
|
|
|
io_ctl_unmap_page(io_ctl);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
io_ctl->cur += sizeof(u64);
|
2011-10-06 08:58:24 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void io_ctl_set_crc(struct io_ctl *io_ctl, int index)
|
|
|
|
{
|
|
|
|
u32 *tmp;
|
|
|
|
u32 crc = ~(u32)0;
|
|
|
|
unsigned offset = 0;
|
|
|
|
|
|
|
|
if (!io_ctl->check_crcs) {
|
|
|
|
io_ctl_unmap_page(io_ctl);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (index == 0)
|
2011-11-21 08:43:28 -08:00
|
|
|
offset = sizeof(u32) * io_ctl->num_pages;
|
2011-10-06 08:58:24 -04:00
|
|
|
|
|
|
|
crc = btrfs_csum_data(io_ctl->root, io_ctl->orig + offset, crc,
|
|
|
|
PAGE_CACHE_SIZE - offset);
|
|
|
|
btrfs_csum_final(crc, (char *)&crc);
|
|
|
|
io_ctl_unmap_page(io_ctl);
|
|
|
|
tmp = kmap(io_ctl->pages[0]);
|
|
|
|
tmp += index;
|
|
|
|
*tmp = crc;
|
|
|
|
kunmap(io_ctl->pages[0]);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int io_ctl_check_crc(struct io_ctl *io_ctl, int index)
|
|
|
|
{
|
|
|
|
u32 *tmp, val;
|
|
|
|
u32 crc = ~(u32)0;
|
|
|
|
unsigned offset = 0;
|
|
|
|
|
|
|
|
if (!io_ctl->check_crcs) {
|
|
|
|
io_ctl_map_page(io_ctl, 0);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (index == 0)
|
|
|
|
offset = sizeof(u32) * io_ctl->num_pages;
|
|
|
|
|
|
|
|
tmp = kmap(io_ctl->pages[0]);
|
|
|
|
tmp += index;
|
|
|
|
val = *tmp;
|
|
|
|
kunmap(io_ctl->pages[0]);
|
|
|
|
|
|
|
|
io_ctl_map_page(io_ctl, 0);
|
|
|
|
crc = btrfs_csum_data(io_ctl->root, io_ctl->orig + offset, crc,
|
|
|
|
PAGE_CACHE_SIZE - offset);
|
|
|
|
btrfs_csum_final(crc, (char *)&crc);
|
|
|
|
if (val != crc) {
|
|
|
|
printk_ratelimited(KERN_ERR "btrfs: csum mismatch on free "
|
|
|
|
"space cache\n");
|
|
|
|
io_ctl_unmap_page(io_ctl);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int io_ctl_add_entry(struct io_ctl *io_ctl, u64 offset, u64 bytes,
|
|
|
|
void *bitmap)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space_entry *entry;
|
|
|
|
|
|
|
|
if (!io_ctl->cur)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
|
|
|
entry = io_ctl->cur;
|
|
|
|
entry->offset = cpu_to_le64(offset);
|
|
|
|
entry->bytes = cpu_to_le64(bytes);
|
|
|
|
entry->type = (bitmap) ? BTRFS_FREE_SPACE_BITMAP :
|
|
|
|
BTRFS_FREE_SPACE_EXTENT;
|
|
|
|
io_ctl->cur += sizeof(struct btrfs_free_space_entry);
|
|
|
|
io_ctl->size -= sizeof(struct btrfs_free_space_entry);
|
|
|
|
|
|
|
|
if (io_ctl->size >= sizeof(struct btrfs_free_space_entry))
|
|
|
|
return 0;
|
|
|
|
|
2011-10-06 08:58:24 -04:00
|
|
|
io_ctl_set_crc(io_ctl, io_ctl->index - 1);
|
2011-10-05 15:18:58 -04:00
|
|
|
|
|
|
|
/* No more pages to map */
|
|
|
|
if (io_ctl->index >= io_ctl->num_pages)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* map the next page */
|
|
|
|
io_ctl_map_page(io_ctl, 1);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int io_ctl_add_bitmap(struct io_ctl *io_ctl, void *bitmap)
|
|
|
|
{
|
|
|
|
if (!io_ctl->cur)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we aren't at the start of the current page, unmap this one and
|
|
|
|
* map the next one if there is any left.
|
|
|
|
*/
|
|
|
|
if (io_ctl->cur != io_ctl->orig) {
|
2011-10-06 08:58:24 -04:00
|
|
|
io_ctl_set_crc(io_ctl, io_ctl->index - 1);
|
2011-10-05 15:18:58 -04:00
|
|
|
if (io_ctl->index >= io_ctl->num_pages)
|
|
|
|
return -ENOSPC;
|
|
|
|
io_ctl_map_page(io_ctl, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(io_ctl->cur, bitmap, PAGE_CACHE_SIZE);
|
2011-10-06 08:58:24 -04:00
|
|
|
io_ctl_set_crc(io_ctl, io_ctl->index - 1);
|
2011-10-05 15:18:58 -04:00
|
|
|
if (io_ctl->index < io_ctl->num_pages)
|
|
|
|
io_ctl_map_page(io_ctl, 0);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void io_ctl_zero_remaining_pages(struct io_ctl *io_ctl)
|
|
|
|
{
|
2011-10-06 08:58:24 -04:00
|
|
|
/*
|
|
|
|
* If we're not on the boundary we know we've modified the page and we
|
|
|
|
* need to crc the page.
|
|
|
|
*/
|
|
|
|
if (io_ctl->cur != io_ctl->orig)
|
|
|
|
io_ctl_set_crc(io_ctl, io_ctl->index - 1);
|
|
|
|
else
|
|
|
|
io_ctl_unmap_page(io_ctl);
|
2011-10-05 15:18:58 -04:00
|
|
|
|
|
|
|
while (io_ctl->index < io_ctl->num_pages) {
|
|
|
|
io_ctl_map_page(io_ctl, 1);
|
2011-10-06 08:58:24 -04:00
|
|
|
io_ctl_set_crc(io_ctl, io_ctl->index - 1);
|
2011-10-05 15:18:58 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-06 08:58:24 -04:00
|
|
|
static int io_ctl_read_entry(struct io_ctl *io_ctl,
|
|
|
|
struct btrfs_free_space *entry, u8 *type)
|
2011-10-05 15:18:58 -04:00
|
|
|
{
|
|
|
|
struct btrfs_free_space_entry *e;
|
2011-11-10 20:45:05 -05:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!io_ctl->cur) {
|
|
|
|
ret = io_ctl_check_crc(io_ctl, io_ctl->index);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
|
|
|
|
e = io_ctl->cur;
|
|
|
|
entry->offset = le64_to_cpu(e->offset);
|
|
|
|
entry->bytes = le64_to_cpu(e->bytes);
|
2011-10-06 08:58:24 -04:00
|
|
|
*type = e->type;
|
2011-10-05 15:18:58 -04:00
|
|
|
io_ctl->cur += sizeof(struct btrfs_free_space_entry);
|
|
|
|
io_ctl->size -= sizeof(struct btrfs_free_space_entry);
|
|
|
|
|
|
|
|
if (io_ctl->size >= sizeof(struct btrfs_free_space_entry))
|
2011-10-06 08:58:24 -04:00
|
|
|
return 0;
|
2011-10-05 15:18:58 -04:00
|
|
|
|
|
|
|
io_ctl_unmap_page(io_ctl);
|
|
|
|
|
2011-11-10 20:45:05 -05:00
|
|
|
return 0;
|
2011-10-05 15:18:58 -04:00
|
|
|
}
|
|
|
|
|
2011-10-06 08:58:24 -04:00
|
|
|
static int io_ctl_read_bitmap(struct io_ctl *io_ctl,
|
|
|
|
struct btrfs_free_space *entry)
|
2011-10-05 15:18:58 -04:00
|
|
|
{
|
2011-10-06 08:58:24 -04:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = io_ctl_check_crc(io_ctl, io_ctl->index);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
memcpy(entry->bitmap, io_ctl->cur, PAGE_CACHE_SIZE);
|
|
|
|
io_ctl_unmap_page(io_ctl);
|
2011-10-06 08:58:24 -04:00
|
|
|
|
|
|
|
return 0;
|
2011-10-05 15:18:58 -04:00
|
|
|
}
|
|
|
|
|
2012-05-14 10:06:40 -04:00
|
|
|
/*
|
|
|
|
* Since we attach pinned extents after the fact we can have contiguous sections
|
|
|
|
* of free space that are split up in entries. This poses a problem with the
|
|
|
|
* tree logging stuff since it could have allocated across what appears to be 2
|
|
|
|
* entries since we would have merged the entries when adding the pinned extents
|
|
|
|
* back to the free space cache. So run through the space cache that we just
|
|
|
|
* loaded and merge contiguous entries. This will make the log replay stuff not
|
|
|
|
* blow up and it will make for nicer allocator behavior.
|
|
|
|
*/
|
|
|
|
static void merge_space_tree(struct btrfs_free_space_ctl *ctl)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space *e, *prev = NULL;
|
|
|
|
struct rb_node *n;
|
|
|
|
|
|
|
|
again:
|
|
|
|
spin_lock(&ctl->tree_lock);
|
|
|
|
for (n = rb_first(&ctl->free_space_offset); n; n = rb_next(n)) {
|
|
|
|
e = rb_entry(n, struct btrfs_free_space, offset_index);
|
|
|
|
if (!prev)
|
|
|
|
goto next;
|
|
|
|
if (e->bitmap || prev->bitmap)
|
|
|
|
goto next;
|
|
|
|
if (prev->offset + prev->bytes == e->offset) {
|
|
|
|
unlink_free_space(ctl, prev);
|
|
|
|
unlink_free_space(ctl, e);
|
|
|
|
prev->bytes += e->bytes;
|
|
|
|
kmem_cache_free(btrfs_free_space_cachep, e);
|
|
|
|
link_free_space(ctl, prev);
|
|
|
|
prev = NULL;
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
next:
|
|
|
|
prev = e;
|
|
|
|
}
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
}
|
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
int __load_free_space_cache(struct btrfs_root *root, struct inode *inode,
|
|
|
|
struct btrfs_free_space_ctl *ctl,
|
|
|
|
struct btrfs_path *path, u64 offset)
|
2010-08-25 16:54:15 -04:00
|
|
|
{
|
|
|
|
struct btrfs_free_space_header *header;
|
|
|
|
struct extent_buffer *leaf;
|
2011-10-05 15:18:58 -04:00
|
|
|
struct io_ctl io_ctl;
|
2010-08-25 16:54:15 -04:00
|
|
|
struct btrfs_key key;
|
2011-10-05 15:18:58 -04:00
|
|
|
struct btrfs_free_space *e, *n;
|
2010-08-25 16:54:15 -04:00
|
|
|
struct list_head bitmaps;
|
|
|
|
u64 num_entries;
|
|
|
|
u64 num_bitmaps;
|
|
|
|
u64 generation;
|
2011-10-05 15:18:58 -04:00
|
|
|
u8 type;
|
2011-06-06 10:50:35 -04:00
|
|
|
int ret = 0;
|
2010-08-25 16:54:15 -04:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&bitmaps);
|
|
|
|
|
|
|
|
/* Nothing in the space cache, goodbye */
|
2011-04-20 10:20:14 +08:00
|
|
|
if (!i_size_read(inode))
|
2011-10-05 15:18:58 -04:00
|
|
|
return 0;
|
2010-08-25 16:54:15 -04:00
|
|
|
|
|
|
|
key.objectid = BTRFS_FREE_SPACE_OBJECTID;
|
2011-04-20 10:20:14 +08:00
|
|
|
key.offset = offset;
|
2010-08-25 16:54:15 -04:00
|
|
|
key.type = 0;
|
|
|
|
|
|
|
|
ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
|
2011-04-20 10:20:14 +08:00
|
|
|
if (ret < 0)
|
2011-10-05 15:18:58 -04:00
|
|
|
return 0;
|
2011-04-20 10:20:14 +08:00
|
|
|
else if (ret > 0) {
|
2011-05-22 12:33:42 -04:00
|
|
|
btrfs_release_path(path);
|
2011-10-05 15:18:58 -04:00
|
|
|
return 0;
|
2010-08-25 16:54:15 -04:00
|
|
|
}
|
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
ret = -1;
|
|
|
|
|
2010-08-25 16:54:15 -04:00
|
|
|
leaf = path->nodes[0];
|
|
|
|
header = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_free_space_header);
|
|
|
|
num_entries = btrfs_free_space_entries(leaf, header);
|
|
|
|
num_bitmaps = btrfs_free_space_bitmaps(leaf, header);
|
|
|
|
generation = btrfs_free_space_generation(leaf, header);
|
2011-05-22 12:33:42 -04:00
|
|
|
btrfs_release_path(path);
|
2010-08-25 16:54:15 -04:00
|
|
|
|
|
|
|
if (BTRFS_I(inode)->generation != generation) {
|
|
|
|
printk(KERN_ERR "btrfs: free space inode generation (%llu) did"
|
2011-04-20 10:20:14 +08:00
|
|
|
" not match free space cache generation (%llu)\n",
|
2010-08-25 16:54:15 -04:00
|
|
|
(unsigned long long)BTRFS_I(inode)->generation,
|
2011-04-20 10:20:14 +08:00
|
|
|
(unsigned long long)generation);
|
2011-10-05 15:18:58 -04:00
|
|
|
return 0;
|
2010-08-25 16:54:15 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!num_entries)
|
2011-10-05 15:18:58 -04:00
|
|
|
return 0;
|
2010-08-25 16:54:15 -04:00
|
|
|
|
2012-01-09 14:36:28 +08:00
|
|
|
ret = io_ctl_init(&io_ctl, inode, root);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2010-08-25 16:54:15 -04:00
|
|
|
ret = readahead_cache(inode);
|
2011-04-20 10:20:14 +08:00
|
|
|
if (ret)
|
2010-08-25 16:54:15 -04:00
|
|
|
goto out;
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
ret = io_ctl_prepare_pages(&io_ctl, inode, 1);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
2010-08-25 16:54:15 -04:00
|
|
|
|
2011-10-06 08:58:24 -04:00
|
|
|
ret = io_ctl_check_crc(&io_ctl, 0);
|
|
|
|
if (ret)
|
|
|
|
goto free_cache;
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
ret = io_ctl_check_generation(&io_ctl, generation);
|
|
|
|
if (ret)
|
|
|
|
goto free_cache;
|
2010-08-25 16:54:15 -04:00
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
while (num_entries) {
|
|
|
|
e = kmem_cache_zalloc(btrfs_free_space_cachep,
|
|
|
|
GFP_NOFS);
|
|
|
|
if (!e)
|
2010-08-25 16:54:15 -04:00
|
|
|
goto free_cache;
|
|
|
|
|
2011-10-06 08:58:24 -04:00
|
|
|
ret = io_ctl_read_entry(&io_ctl, e, &type);
|
|
|
|
if (ret) {
|
|
|
|
kmem_cache_free(btrfs_free_space_cachep, e);
|
|
|
|
goto free_cache;
|
|
|
|
}
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
if (!e->bytes) {
|
|
|
|
kmem_cache_free(btrfs_free_space_cachep, e);
|
|
|
|
goto free_cache;
|
2010-08-25 16:54:15 -04:00
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
|
|
|
|
if (type == BTRFS_FREE_SPACE_EXTENT) {
|
|
|
|
spin_lock(&ctl->tree_lock);
|
|
|
|
ret = link_free_space(ctl, e);
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
if (ret) {
|
|
|
|
printk(KERN_ERR "Duplicate entries in "
|
|
|
|
"free space cache, dumping\n");
|
|
|
|
kmem_cache_free(btrfs_free_space_cachep, e);
|
2010-08-25 16:54:15 -04:00
|
|
|
goto free_cache;
|
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
} else {
|
|
|
|
BUG_ON(!num_bitmaps);
|
|
|
|
num_bitmaps--;
|
|
|
|
e->bitmap = kzalloc(PAGE_CACHE_SIZE, GFP_NOFS);
|
|
|
|
if (!e->bitmap) {
|
|
|
|
kmem_cache_free(
|
|
|
|
btrfs_free_space_cachep, e);
|
2010-08-25 16:54:15 -04:00
|
|
|
goto free_cache;
|
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
spin_lock(&ctl->tree_lock);
|
|
|
|
ret = link_free_space(ctl, e);
|
|
|
|
ctl->total_bitmaps++;
|
|
|
|
ctl->op->recalc_thresholds(ctl);
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
if (ret) {
|
|
|
|
printk(KERN_ERR "Duplicate entries in "
|
|
|
|
"free space cache, dumping\n");
|
2011-01-28 17:05:48 -05:00
|
|
|
kmem_cache_free(btrfs_free_space_cachep, e);
|
2010-08-25 16:54:15 -04:00
|
|
|
goto free_cache;
|
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
list_add_tail(&e->list, &bitmaps);
|
2010-08-25 16:54:15 -04:00
|
|
|
}
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
num_entries--;
|
|
|
|
}
|
2010-08-25 16:54:15 -04:00
|
|
|
|
2011-11-10 20:45:05 -05:00
|
|
|
io_ctl_unmap_page(&io_ctl);
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
/*
|
|
|
|
* We add the bitmaps at the end of the entries in order that
|
|
|
|
* the bitmap entries are added to the cache.
|
|
|
|
*/
|
|
|
|
list_for_each_entry_safe(e, n, &bitmaps, list) {
|
2010-08-25 16:54:15 -04:00
|
|
|
list_del_init(&e->list);
|
2011-10-06 08:58:24 -04:00
|
|
|
ret = io_ctl_read_bitmap(&io_ctl, e);
|
|
|
|
if (ret)
|
|
|
|
goto free_cache;
|
2010-08-25 16:54:15 -04:00
|
|
|
}
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
io_ctl_drop_pages(&io_ctl);
|
2012-05-14 10:06:40 -04:00
|
|
|
merge_space_tree(ctl);
|
2010-08-25 16:54:15 -04:00
|
|
|
ret = 1;
|
|
|
|
out:
|
2011-10-05 15:18:58 -04:00
|
|
|
io_ctl_free(&io_ctl);
|
2010-08-25 16:54:15 -04:00
|
|
|
return ret;
|
|
|
|
free_cache:
|
2011-10-05 15:18:58 -04:00
|
|
|
io_ctl_drop_pages(&io_ctl);
|
2011-04-20 10:20:14 +08:00
|
|
|
__btrfs_remove_free_space_cache(ctl);
|
2010-08-25 16:54:15 -04:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
int load_free_space_cache(struct btrfs_fs_info *fs_info,
|
|
|
|
struct btrfs_block_group_cache *block_group)
|
2010-07-02 12:14:14 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
2011-04-20 10:20:14 +08:00
|
|
|
struct btrfs_root *root = fs_info->tree_root;
|
|
|
|
struct inode *inode;
|
|
|
|
struct btrfs_path *path;
|
2011-10-06 08:58:24 -04:00
|
|
|
int ret = 0;
|
2011-04-20 10:20:14 +08:00
|
|
|
bool matched;
|
|
|
|
u64 used = btrfs_block_group_used(&block_group->item);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this block group has been marked to be cleared for one reason or
|
|
|
|
* another then we can't trust the on disk cache, so just return.
|
|
|
|
*/
|
2010-08-25 16:54:15 -04:00
|
|
|
spin_lock(&block_group->lock);
|
2011-04-20 10:20:14 +08:00
|
|
|
if (block_group->disk_cache_state != BTRFS_DC_WRITTEN) {
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
2010-08-25 16:54:15 -04:00
|
|
|
spin_unlock(&block_group->lock);
|
2011-04-20 10:20:14 +08:00
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return 0;
|
2012-04-12 16:03:57 -04:00
|
|
|
path->search_commit_root = 1;
|
|
|
|
path->skip_locking = 1;
|
2011-04-20 10:20:14 +08:00
|
|
|
|
|
|
|
inode = lookup_free_space_inode(root, block_group, path);
|
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-10-06 08:58:24 -04:00
|
|
|
/* We may have converted the inode and made the cache invalid. */
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->disk_cache_state != BTRFS_DC_WRITTEN) {
|
|
|
|
spin_unlock(&block_group->lock);
|
2012-02-14 17:12:23 +09:00
|
|
|
btrfs_free_path(path);
|
2011-10-06 08:58:24 -04:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
ret = __load_free_space_cache(fs_info->tree_root, inode, ctl,
|
|
|
|
path, block_group->key.objectid);
|
|
|
|
btrfs_free_path(path);
|
|
|
|
if (ret <= 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
spin_lock(&ctl->tree_lock);
|
|
|
|
matched = (ctl->free_space == (block_group->key.offset - used -
|
|
|
|
block_group->bytes_super));
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
|
|
|
|
if (!matched) {
|
|
|
|
__btrfs_remove_free_space_cache(ctl);
|
|
|
|
printk(KERN_ERR "block group %llu has an wrong amount of free "
|
|
|
|
"space\n", block_group->key.objectid);
|
|
|
|
ret = -1;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
if (ret < 0) {
|
|
|
|
/* This cache is bogus, make sure it gets cleared */
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
block_group->disk_cache_state = BTRFS_DC_CLEAR;
|
|
|
|
spin_unlock(&block_group->lock);
|
2011-04-20 10:33:24 +08:00
|
|
|
ret = 0;
|
2011-04-20 10:20:14 +08:00
|
|
|
|
|
|
|
printk(KERN_ERR "btrfs: failed to load free space cache "
|
|
|
|
"for block group %llu\n", block_group->key.objectid);
|
|
|
|
}
|
|
|
|
|
|
|
|
iput(inode);
|
|
|
|
return ret;
|
2010-08-25 16:54:15 -04:00
|
|
|
}
|
|
|
|
|
2011-08-30 10:19:10 -04:00
|
|
|
/**
|
|
|
|
* __btrfs_write_out_cache - write out cached info to an inode
|
|
|
|
* @root - the root the inode belongs to
|
|
|
|
* @ctl - the free space cache we are going to write out
|
|
|
|
* @block_group - the block_group for this cache if it belongs to a block_group
|
|
|
|
* @trans - the trans handle
|
|
|
|
* @path - the path to use
|
|
|
|
* @offset - the offset for the key we'll insert
|
|
|
|
*
|
|
|
|
* This function writes out a free space cache struct to disk for quick recovery
|
|
|
|
* on mount. This will return 0 if it was successfull in writing the cache out,
|
|
|
|
* and -1 if it was not.
|
|
|
|
*/
|
2011-04-20 10:20:14 +08:00
|
|
|
int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
|
|
|
|
struct btrfs_free_space_ctl *ctl,
|
|
|
|
struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path, u64 offset)
|
2010-07-02 12:14:14 -04:00
|
|
|
{
|
|
|
|
struct btrfs_free_space_header *header;
|
|
|
|
struct extent_buffer *leaf;
|
|
|
|
struct rb_node *node;
|
|
|
|
struct list_head *pos, *n;
|
|
|
|
struct extent_state *cached_state = NULL;
|
2011-04-01 14:55:00 +00:00
|
|
|
struct btrfs_free_cluster *cluster = NULL;
|
|
|
|
struct extent_io_tree *unpin = NULL;
|
2011-10-05 15:18:58 -04:00
|
|
|
struct io_ctl io_ctl;
|
2010-07-02 12:14:14 -04:00
|
|
|
struct list_head bitmap_list;
|
|
|
|
struct btrfs_key key;
|
2012-01-10 16:41:01 +08:00
|
|
|
u64 start, extent_start, extent_end, len;
|
2010-07-02 12:14:14 -04:00
|
|
|
int entries = 0;
|
|
|
|
int bitmaps = 0;
|
2011-08-30 10:19:10 -04:00
|
|
|
int ret;
|
|
|
|
int err = -1;
|
2010-07-02 12:14:14 -04:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&bitmap_list);
|
|
|
|
|
2011-04-20 10:20:14 +08:00
|
|
|
if (!i_size_read(inode))
|
|
|
|
return -1;
|
2010-12-03 13:17:53 -05:00
|
|
|
|
2012-01-09 14:36:28 +08:00
|
|
|
ret = io_ctl_init(&io_ctl, inode, root);
|
|
|
|
if (ret)
|
|
|
|
return -1;
|
2011-04-06 13:05:22 -04:00
|
|
|
|
2011-04-01 14:55:00 +00:00
|
|
|
/* Get the cluster for this block_group if it exists */
|
2011-04-20 10:20:14 +08:00
|
|
|
if (block_group && !list_empty(&block_group->cluster_list))
|
2011-04-01 14:55:00 +00:00
|
|
|
cluster = list_entry(block_group->cluster_list.next,
|
|
|
|
struct btrfs_free_cluster,
|
|
|
|
block_group_list);
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
/* Lock all pages first so we can lock the extent safely. */
|
|
|
|
io_ctl_prepare_pages(&io_ctl, inode, 0);
|
2010-07-02 12:14:14 -04:00
|
|
|
|
|
|
|
lock_extent_bits(&BTRFS_I(inode)->io_tree, 0, i_size_read(inode) - 1,
|
2012-03-01 14:57:19 +01:00
|
|
|
0, &cached_state);
|
2010-07-02 12:14:14 -04:00
|
|
|
|
2011-10-05 10:00:18 -04:00
|
|
|
node = rb_first(&ctl->free_space_offset);
|
|
|
|
if (!node && cluster) {
|
|
|
|
node = rb_first(&cluster->root);
|
|
|
|
cluster = NULL;
|
|
|
|
}
|
|
|
|
|
2011-10-06 08:58:24 -04:00
|
|
|
/* Make sure we can fit our crcs into the first page */
|
|
|
|
if (io_ctl.check_crcs &&
|
|
|
|
(io_ctl.num_pages * sizeof(u32)) >= PAGE_CACHE_SIZE) {
|
|
|
|
WARN_ON(1);
|
|
|
|
goto out_nospc;
|
|
|
|
}
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
io_ctl_set_generation(&io_ctl, trans->transid);
|
2011-04-01 14:55:00 +00:00
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
/* Write out the extent entries */
|
|
|
|
while (node) {
|
|
|
|
struct btrfs_free_space *e;
|
2010-07-02 12:14:14 -04:00
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
e = rb_entry(node, struct btrfs_free_space, offset_index);
|
|
|
|
entries++;
|
2010-07-02 12:14:14 -04:00
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
ret = io_ctl_add_entry(&io_ctl, e->offset, e->bytes,
|
|
|
|
e->bitmap);
|
|
|
|
if (ret)
|
|
|
|
goto out_nospc;
|
2011-06-10 15:31:13 -04:00
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
if (e->bitmap) {
|
|
|
|
list_add_tail(&e->list, &bitmap_list);
|
|
|
|
bitmaps++;
|
2011-06-10 15:31:13 -04:00
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
node = rb_next(node);
|
|
|
|
if (!node && cluster) {
|
|
|
|
node = rb_first(&cluster->root);
|
|
|
|
cluster = NULL;
|
2011-04-01 14:55:00 +00:00
|
|
|
}
|
2011-10-05 15:18:58 -04:00
|
|
|
}
|
2011-04-01 14:55:00 +00:00
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
/*
|
|
|
|
* We want to add any pinned extents to our free space cache
|
|
|
|
* so we don't leak the space
|
|
|
|
*/
|
2012-01-10 16:41:01 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We shouldn't have switched the pinned extents yet so this is the
|
|
|
|
* right one
|
|
|
|
*/
|
|
|
|
unpin = root->fs_info->pinned_extents;
|
|
|
|
|
|
|
|
if (block_group)
|
|
|
|
start = block_group->key.objectid;
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
while (block_group && (start < block_group->key.objectid +
|
|
|
|
block_group->key.offset)) {
|
2012-01-10 16:41:01 +08:00
|
|
|
ret = find_first_extent_bit(unpin, start,
|
|
|
|
&extent_start, &extent_end,
|
2011-10-05 15:18:58 -04:00
|
|
|
EXTENT_DIRTY);
|
|
|
|
if (ret) {
|
|
|
|
ret = 0;
|
|
|
|
break;
|
2010-07-02 12:14:14 -04:00
|
|
|
}
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
/* This pinned extent is out of our range */
|
2012-01-10 16:41:01 +08:00
|
|
|
if (extent_start >= block_group->key.objectid +
|
2011-10-05 15:18:58 -04:00
|
|
|
block_group->key.offset)
|
|
|
|
break;
|
2011-06-10 15:31:13 -04:00
|
|
|
|
2012-01-10 16:41:01 +08:00
|
|
|
extent_start = max(extent_start, start);
|
|
|
|
extent_end = min(block_group->key.objectid +
|
|
|
|
block_group->key.offset, extent_end + 1);
|
|
|
|
len = extent_end - extent_start;
|
2010-07-02 12:14:14 -04:00
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
entries++;
|
2012-01-10 16:41:01 +08:00
|
|
|
ret = io_ctl_add_entry(&io_ctl, extent_start, len, NULL);
|
2011-10-05 15:18:58 -04:00
|
|
|
if (ret)
|
|
|
|
goto out_nospc;
|
2010-07-02 12:14:14 -04:00
|
|
|
|
2012-01-10 16:41:01 +08:00
|
|
|
start = extent_end;
|
2011-10-05 15:18:58 -04:00
|
|
|
}
|
2010-07-02 12:14:14 -04:00
|
|
|
|
|
|
|
/* Write out the bitmaps */
|
|
|
|
list_for_each_safe(pos, n, &bitmap_list) {
|
|
|
|
struct btrfs_free_space *entry =
|
|
|
|
list_entry(pos, struct btrfs_free_space, list);
|
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
ret = io_ctl_add_bitmap(&io_ctl, entry->bitmap);
|
|
|
|
if (ret)
|
|
|
|
goto out_nospc;
|
2010-07-02 12:14:14 -04:00
|
|
|
list_del_init(&entry->list);
|
2011-04-06 13:05:22 -04:00
|
|
|
}
|
|
|
|
|
2010-07-02 12:14:14 -04:00
|
|
|
/* Zero out the rest of the pages just to make sure */
|
2011-10-05 15:18:58 -04:00
|
|
|
io_ctl_zero_remaining_pages(&io_ctl);
|
2010-07-02 12:14:14 -04:00
|
|
|
|
2011-10-05 15:18:58 -04:00
|
|
|
ret = btrfs_dirty_pages(root, inode, io_ctl.pages, io_ctl.num_pages,
|
|
|
|
0, i_size_read(inode), &cached_state);
|
|
|
|
io_ctl_drop_pages(&io_ctl);
|
2010-07-02 12:14:14 -04:00
|
|
|
unlock_extent_cached(&BTRFS_I(inode)->io_tree, 0,
|
|
|
|
i_size_read(inode) - 1, &cached_state, GFP_NOFS);
|
|
|
|
|
2011-08-30 10:19:10 -04:00
|
|
|
if (ret)
|
2011-06-10 15:31:13 -04:00
|
|
|
goto out;
|
2011-04-06 13:05:22 -04:00
|
|
|
|
|
|
|
|
2012-05-02 14:00:54 -04:00
|
|
|
btrfs_wait_ordered_range(inode, 0, (u64)-1);
|
2010-07-02 12:14:14 -04:00
|
|
|
|
|
|
|
key.objectid = BTRFS_FREE_SPACE_OBJECTID;
|
2011-04-20 10:20:14 +08:00
|
|
|
key.offset = offset;
|
2010-07-02 12:14:14 -04:00
|
|
|
key.type = 0;
|
|
|
|
|
2011-08-19 12:06:12 -04:00
|
|
|
ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
|
2010-07-02 12:14:14 -04:00
|
|
|
if (ret < 0) {
|
2011-10-05 15:18:58 -04:00
|
|
|
clear_extent_bit(&BTRFS_I(inode)->io_tree, 0, inode->i_size - 1,
|
2011-10-06 08:58:24 -04:00
|
|
|
EXTENT_DIRTY | EXTENT_DELALLOC, 0, 0, NULL,
|
|
|
|
GFP_NOFS);
|
2011-06-10 15:31:13 -04:00
|
|
|
goto out;
|
2010-07-02 12:14:14 -04:00
|
|
|
}
|
|
|
|
leaf = path->nodes[0];
|
|
|
|
if (ret > 0) {
|
|
|
|
struct btrfs_key found_key;
|
|
|
|
BUG_ON(!path->slots[0]);
|
|
|
|
path->slots[0]--;
|
|
|
|
btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
|
|
|
|
if (found_key.objectid != BTRFS_FREE_SPACE_OBJECTID ||
|
2011-04-20 10:20:14 +08:00
|
|
|
found_key.offset != offset) {
|
2011-10-05 15:18:58 -04:00
|
|
|
clear_extent_bit(&BTRFS_I(inode)->io_tree, 0,
|
|
|
|
inode->i_size - 1,
|
2011-10-06 08:58:24 -04:00
|
|
|
EXTENT_DIRTY | EXTENT_DELALLOC, 0, 0,
|
|
|
|
NULL, GFP_NOFS);
|
2011-04-21 01:20:15 +02:00
|
|
|
btrfs_release_path(path);
|
2011-06-10 15:31:13 -04:00
|
|
|
goto out;
|
2010-07-02 12:14:14 -04:00
|
|
|
}
|
|
|
|
}
|
2011-10-05 16:33:53 -04:00
|
|
|
|
|
|
|
BTRFS_I(inode)->generation = trans->transid;
|
2010-07-02 12:14:14 -04:00
|
|
|
header = btrfs_item_ptr(leaf, path->slots[0],
|
|
|
|
struct btrfs_free_space_header);
|
|
|
|
btrfs_set_free_space_entries(leaf, header, entries);
|
|
|
|
btrfs_set_free_space_bitmaps(leaf, header, bitmaps);
|
|
|
|
btrfs_set_free_space_generation(leaf, header, trans->transid);
|
|
|
|
btrfs_mark_buffer_dirty(leaf);
|
2011-04-21 01:20:15 +02:00
|
|
|
btrfs_release_path(path);
|
2010-07-02 12:14:14 -04:00
|
|
|
|
2011-08-30 10:19:10 -04:00
|
|
|
err = 0;
|
2011-06-10 15:31:13 -04:00
|
|
|
out:
|
2011-10-05 15:18:58 -04:00
|
|
|
io_ctl_free(&io_ctl);
|
2011-08-30 10:19:10 -04:00
|
|
|
if (err) {
|
2011-10-05 15:18:58 -04:00
|
|
|
invalidate_inode_pages2(inode->i_mapping);
|
2010-07-02 12:14:14 -04:00
|
|
|
BTRFS_I(inode)->generation = 0;
|
|
|
|
}
|
|
|
|
btrfs_update_inode(trans, root, inode);
|
2011-08-30 10:19:10 -04:00
|
|
|
return err;
|
2011-10-05 15:18:58 -04:00
|
|
|
|
|
|
|
out_nospc:
|
|
|
|
list_for_each_safe(pos, n, &bitmap_list) {
|
|
|
|
struct btrfs_free_space *entry =
|
|
|
|
list_entry(pos, struct btrfs_free_space, list);
|
|
|
|
list_del_init(&entry->list);
|
|
|
|
}
|
|
|
|
io_ctl_drop_pages(&io_ctl);
|
|
|
|
unlock_extent_cached(&BTRFS_I(inode)->io_tree, 0,
|
|
|
|
i_size_read(inode) - 1, &cached_state, GFP_NOFS);
|
|
|
|
goto out;
|
2011-04-20 10:20:14 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_write_out_cache(struct btrfs_root *root,
|
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
|
|
|
struct inode *inode;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
root = root->fs_info->tree_root;
|
|
|
|
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->disk_cache_state < BTRFS_DC_SETUP) {
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
|
|
|
|
inode = lookup_free_space_inode(root, block_group, path);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = __btrfs_write_out_cache(root, inode, ctl, block_group, trans,
|
|
|
|
path, block_group->key.objectid);
|
2011-08-30 10:19:10 -04:00
|
|
|
if (ret) {
|
2011-04-20 10:20:14 +08:00
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
block_group->disk_cache_state = BTRFS_DC_ERROR;
|
|
|
|
spin_unlock(&block_group->lock);
|
2011-04-20 10:33:24 +08:00
|
|
|
ret = 0;
|
2011-08-30 10:19:10 -04:00
|
|
|
#ifdef DEBUG
|
2012-02-07 22:21:45 +09:00
|
|
|
printk(KERN_ERR "btrfs: failed to write free space cache "
|
2011-04-20 10:20:14 +08:00
|
|
|
"for block group %llu\n", block_group->key.objectid);
|
2011-08-30 10:19:10 -04:00
|
|
|
#endif
|
2011-04-20 10:20:14 +08:00
|
|
|
}
|
|
|
|
|
2010-07-02 12:14:14 -04:00
|
|
|
iput(inode);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static inline unsigned long offset_to_bit(u64 bitmap_start, u32 unit,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
u64 offset)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
{
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
BUG_ON(offset < bitmap_start);
|
|
|
|
offset -= bitmap_start;
|
2011-03-29 13:46:06 +08:00
|
|
|
return (unsigned long)(div_u64(offset, unit));
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static inline unsigned long bytes_to_bits(u64 bytes, u32 unit)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
return (unsigned long)(div_u64(bytes, unit));
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static inline u64 offset_to_bitmap(struct btrfs_free_space_ctl *ctl,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
u64 offset)
|
|
|
|
{
|
|
|
|
u64 bitmap_start;
|
|
|
|
u64 bytes_per_bitmap;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
bytes_per_bitmap = BITS_PER_BITMAP * ctl->unit;
|
|
|
|
bitmap_start = offset - ctl->start;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
bitmap_start = div64_u64(bitmap_start, bytes_per_bitmap);
|
|
|
|
bitmap_start *= bytes_per_bitmap;
|
2011-03-29 13:46:06 +08:00
|
|
|
bitmap_start += ctl->start;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
return bitmap_start;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
static int tree_insert_offset(struct rb_root *root, u64 offset,
|
|
|
|
struct rb_node *node, int bitmap)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
{
|
|
|
|
struct rb_node **p = &root->rb_node;
|
|
|
|
struct rb_node *parent = NULL;
|
|
|
|
struct btrfs_free_space *info;
|
|
|
|
|
|
|
|
while (*p) {
|
|
|
|
parent = *p;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
info = rb_entry(parent, struct btrfs_free_space, offset_index);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (offset < info->offset) {
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
p = &(*p)->rb_left;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
} else if (offset > info->offset) {
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
p = &(*p)->rb_right;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* we could have a bitmap entry and an extent entry
|
|
|
|
* share the same offset. If this is the case, we want
|
|
|
|
* the extent entry to always be found first if we do a
|
|
|
|
* linear search through the tree, since we want to have
|
|
|
|
* the quickest allocation time, and allocating from an
|
|
|
|
* extent is faster than allocating from a bitmap. So
|
|
|
|
* if we're inserting a bitmap and we find an entry at
|
|
|
|
* this offset, we want to go right, or after this entry
|
|
|
|
* logically. If we are inserting an extent and we've
|
|
|
|
* found a bitmap, we want to go left, or before
|
|
|
|
* logically.
|
|
|
|
*/
|
|
|
|
if (bitmap) {
|
2011-05-13 14:49:23 -04:00
|
|
|
if (info->bitmap) {
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
p = &(*p)->rb_right;
|
|
|
|
} else {
|
2011-05-13 14:49:23 -04:00
|
|
|
if (!info->bitmap) {
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
p = &(*p)->rb_left;
|
|
|
|
}
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
rb_link_node(node, parent, p);
|
|
|
|
rb_insert_color(node, root);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2009-04-03 10:14:19 -04:00
|
|
|
* searches the tree for the given offset.
|
|
|
|
*
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
* fuzzy - If this is set, then we are trying to make an allocation, and we just
|
|
|
|
* want a section that has at least bytes size and comes at or after the given
|
|
|
|
* offset.
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
*/
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
static struct btrfs_free_space *
|
2011-03-29 13:46:06 +08:00
|
|
|
tree_search_offset(struct btrfs_free_space_ctl *ctl,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
u64 offset, int bitmap_only, int fuzzy)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct rb_node *n = ctl->free_space_offset.rb_node;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
struct btrfs_free_space *entry, *prev = NULL;
|
|
|
|
|
|
|
|
/* find entry that is closest to the 'offset' */
|
|
|
|
while (1) {
|
|
|
|
if (!n) {
|
|
|
|
entry = NULL;
|
|
|
|
break;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
|
|
|
entry = rb_entry(n, struct btrfs_free_space, offset_index);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
prev = entry;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (offset < entry->offset)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
n = n->rb_left;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
else if (offset > entry->offset)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
n = n->rb_right;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
else
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (bitmap_only) {
|
|
|
|
if (!entry)
|
|
|
|
return NULL;
|
|
|
|
if (entry->bitmap)
|
|
|
|
return entry;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
/*
|
|
|
|
* bitmap entry and extent entry may share same offset,
|
|
|
|
* in that case, bitmap entry comes after extent entry.
|
|
|
|
*/
|
|
|
|
n = rb_next(n);
|
|
|
|
if (!n)
|
|
|
|
return NULL;
|
|
|
|
entry = rb_entry(n, struct btrfs_free_space, offset_index);
|
|
|
|
if (entry->offset != offset)
|
|
|
|
return NULL;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
WARN_ON(!entry->bitmap);
|
|
|
|
return entry;
|
|
|
|
} else if (entry) {
|
|
|
|
if (entry->bitmap) {
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
/*
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
* if previous extent entry covers the offset,
|
|
|
|
* we should return it instead of the bitmap entry
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
*/
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
n = &entry->offset_index;
|
|
|
|
while (1) {
|
|
|
|
n = rb_prev(n);
|
|
|
|
if (!n)
|
|
|
|
break;
|
|
|
|
prev = rb_entry(n, struct btrfs_free_space,
|
|
|
|
offset_index);
|
|
|
|
if (!prev->bitmap) {
|
|
|
|
if (prev->offset + prev->bytes > offset)
|
|
|
|
entry = prev;
|
|
|
|
break;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!prev)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/* find last entry before the 'offset' */
|
|
|
|
entry = prev;
|
|
|
|
if (entry->offset > offset) {
|
|
|
|
n = rb_prev(&entry->offset_index);
|
|
|
|
if (n) {
|
|
|
|
entry = rb_entry(n, struct btrfs_free_space,
|
|
|
|
offset_index);
|
|
|
|
BUG_ON(entry->offset > offset);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
} else {
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (fuzzy)
|
|
|
|
return entry;
|
|
|
|
else
|
|
|
|
return NULL;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (entry->bitmap) {
|
|
|
|
n = &entry->offset_index;
|
|
|
|
while (1) {
|
|
|
|
n = rb_prev(n);
|
|
|
|
if (!n)
|
|
|
|
break;
|
|
|
|
prev = rb_entry(n, struct btrfs_free_space,
|
|
|
|
offset_index);
|
|
|
|
if (!prev->bitmap) {
|
|
|
|
if (prev->offset + prev->bytes > offset)
|
|
|
|
return prev;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2011-03-29 13:46:06 +08:00
|
|
|
if (entry->offset + BITS_PER_BITMAP * ctl->unit > offset)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
return entry;
|
|
|
|
} else if (entry->offset + entry->bytes > offset)
|
|
|
|
return entry;
|
|
|
|
|
|
|
|
if (!fuzzy)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
if (entry->bitmap) {
|
|
|
|
if (entry->offset + BITS_PER_BITMAP *
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->unit > offset)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
if (entry->offset + entry->bytes > offset)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
n = rb_next(&entry->offset_index);
|
|
|
|
if (!n)
|
|
|
|
return NULL;
|
|
|
|
entry = rb_entry(n, struct btrfs_free_space, offset_index);
|
|
|
|
}
|
|
|
|
return entry;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
|
2010-11-09 14:57:39 +08:00
|
|
|
static inline void
|
2011-03-29 13:46:06 +08:00
|
|
|
__unlink_free_space(struct btrfs_free_space_ctl *ctl,
|
2010-11-09 14:57:39 +08:00
|
|
|
struct btrfs_free_space *info)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
rb_erase(&info->offset_index, &ctl->free_space_offset);
|
|
|
|
ctl->free_extents--;
|
2010-11-09 14:57:39 +08:00
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static void unlink_free_space(struct btrfs_free_space_ctl *ctl,
|
2010-11-09 14:57:39 +08:00
|
|
|
struct btrfs_free_space *info)
|
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
__unlink_free_space(ctl, info);
|
|
|
|
ctl->free_space -= info->bytes;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static int link_free_space(struct btrfs_free_space_ctl *ctl,
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
struct btrfs_free_space *info)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
BUG_ON(!info->bitmap && !info->bytes);
|
2011-03-29 13:46:06 +08:00
|
|
|
ret = tree_insert_offset(&ctl->free_space_offset, info->offset,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
&info->offset_index, (info->bitmap != NULL));
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->free_space += info->bytes;
|
|
|
|
ctl->free_extents++;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static void recalculate_thresholds(struct btrfs_free_space_ctl *ctl)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_block_group_cache *block_group = ctl->private;
|
2009-09-11 16:11:20 -04:00
|
|
|
u64 max_bytes;
|
|
|
|
u64 bitmap_bytes;
|
|
|
|
u64 extent_bytes;
|
2010-11-09 14:48:01 +08:00
|
|
|
u64 size = block_group->key.offset;
|
2011-03-29 13:46:06 +08:00
|
|
|
u64 bytes_per_bg = BITS_PER_BITMAP * block_group->sectorsize;
|
|
|
|
int max_bitmaps = div64_u64(size + bytes_per_bg - 1, bytes_per_bg);
|
|
|
|
|
|
|
|
BUG_ON(ctl->total_bitmaps > max_bitmaps);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The goal is to keep the total amount of memory used per 1gb of space
|
|
|
|
* at or below 32k, so we need to adjust how much memory we allow to be
|
|
|
|
* used by extent based free space tracking
|
|
|
|
*/
|
2010-11-09 14:48:01 +08:00
|
|
|
if (size < 1024 * 1024 * 1024)
|
|
|
|
max_bytes = MAX_CACHE_BYTES_PER_GIG;
|
|
|
|
else
|
|
|
|
max_bytes = MAX_CACHE_BYTES_PER_GIG *
|
|
|
|
div64_u64(size, 1024 * 1024 * 1024);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2009-09-11 16:11:20 -04:00
|
|
|
/*
|
|
|
|
* we want to account for 1 more bitmap than what we have so we can make
|
|
|
|
* sure we don't go over our overall goal of MAX_CACHE_BYTES_PER_GIG as
|
|
|
|
* we add more bitmaps.
|
|
|
|
*/
|
2011-03-29 13:46:06 +08:00
|
|
|
bitmap_bytes = (ctl->total_bitmaps + 1) * PAGE_CACHE_SIZE;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2009-09-11 16:11:20 -04:00
|
|
|
if (bitmap_bytes >= max_bytes) {
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->extents_thresh = 0;
|
2009-09-11 16:11:20 -04:00
|
|
|
return;
|
|
|
|
}
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2009-09-11 16:11:20 -04:00
|
|
|
/*
|
|
|
|
* we want the extent entry threshold to always be at most 1/2 the maxw
|
|
|
|
* bytes we can have, or whatever is less than that.
|
|
|
|
*/
|
|
|
|
extent_bytes = max_bytes - bitmap_bytes;
|
|
|
|
extent_bytes = min_t(u64, extent_bytes, div64_u64(max_bytes, 2));
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->extents_thresh =
|
2009-09-11 16:11:20 -04:00
|
|
|
div64_u64(extent_bytes, (sizeof(struct btrfs_free_space)));
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
|
|
|
|
2011-08-05 09:32:35 +00:00
|
|
|
static inline void __bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
|
|
|
|
struct btrfs_free_space *info,
|
|
|
|
u64 offset, u64 bytes)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
{
|
2011-03-14 13:40:51 +08:00
|
|
|
unsigned long start, count;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
start = offset_to_bit(info->offset, ctl->unit, offset);
|
|
|
|
count = bytes_to_bits(bytes, ctl->unit);
|
2011-03-14 13:40:51 +08:00
|
|
|
BUG_ON(start + count > BITS_PER_BITMAP);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2011-03-14 13:40:51 +08:00
|
|
|
bitmap_clear(info->bitmap, start, count);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
info->bytes -= bytes;
|
2011-08-05 09:32:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
|
|
|
|
struct btrfs_free_space *info, u64 offset,
|
|
|
|
u64 bytes)
|
|
|
|
{
|
|
|
|
__bitmap_clear_bits(ctl, info, offset, bytes);
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->free_space -= bytes;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static void bitmap_set_bits(struct btrfs_free_space_ctl *ctl,
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
struct btrfs_free_space *info, u64 offset,
|
|
|
|
u64 bytes)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
{
|
2011-03-14 13:40:51 +08:00
|
|
|
unsigned long start, count;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
start = offset_to_bit(info->offset, ctl->unit, offset);
|
|
|
|
count = bytes_to_bits(bytes, ctl->unit);
|
2011-03-14 13:40:51 +08:00
|
|
|
BUG_ON(start + count > BITS_PER_BITMAP);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2011-03-14 13:40:51 +08:00
|
|
|
bitmap_set(info->bitmap, start, count);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
info->bytes += bytes;
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->free_space += bytes;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static int search_bitmap(struct btrfs_free_space_ctl *ctl,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
struct btrfs_free_space *bitmap_info, u64 *offset,
|
|
|
|
u64 *bytes)
|
|
|
|
{
|
|
|
|
unsigned long found_bits = 0;
|
|
|
|
unsigned long bits, i;
|
|
|
|
unsigned long next_zero;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
i = offset_to_bit(bitmap_info->offset, ctl->unit,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
max_t(u64, *offset, bitmap_info->offset));
|
2011-03-29 13:46:06 +08:00
|
|
|
bits = bytes_to_bits(*bytes, ctl->unit);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
for (i = find_next_bit(bitmap_info->bitmap, BITS_PER_BITMAP, i);
|
|
|
|
i < BITS_PER_BITMAP;
|
|
|
|
i = find_next_bit(bitmap_info->bitmap, BITS_PER_BITMAP, i + 1)) {
|
|
|
|
next_zero = find_next_zero_bit(bitmap_info->bitmap,
|
|
|
|
BITS_PER_BITMAP, i);
|
|
|
|
if ((next_zero - i) >= bits) {
|
|
|
|
found_bits = next_zero - i;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
i = next_zero;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (found_bits) {
|
2011-03-29 13:46:06 +08:00
|
|
|
*offset = (u64)(i * ctl->unit) + bitmap_info->offset;
|
|
|
|
*bytes = (u64)(found_bits) * ctl->unit;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static struct btrfs_free_space *
|
|
|
|
find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
{
|
|
|
|
struct btrfs_free_space *entry;
|
|
|
|
struct rb_node *node;
|
|
|
|
int ret;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
if (!ctl->free_space_offset.rb_node)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
return NULL;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
entry = tree_search_offset(ctl, offset_to_bitmap(ctl, *offset), 0, 1);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (!entry)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
for (node = &entry->offset_index; node; node = rb_next(node)) {
|
|
|
|
entry = rb_entry(node, struct btrfs_free_space, offset_index);
|
|
|
|
if (entry->bytes < *bytes)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (entry->bitmap) {
|
2011-03-29 13:46:06 +08:00
|
|
|
ret = search_bitmap(ctl, entry, offset, bytes);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (!ret)
|
|
|
|
return entry;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
*offset = entry->offset;
|
|
|
|
*bytes = entry->bytes;
|
|
|
|
return entry;
|
|
|
|
}
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static void add_new_bitmap(struct btrfs_free_space_ctl *ctl,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
struct btrfs_free_space *info, u64 offset)
|
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
info->offset = offset_to_bitmap(ctl, offset);
|
2009-09-11 16:11:20 -04:00
|
|
|
info->bytes = 0;
|
Btrfs: initialize new bitmaps' list
We're failing to create clusters with bitmaps because
setup_cluster_no_bitmap checks that the list is empty before inserting
the bitmap entry in the list for setup_cluster_bitmap, but the list
field is only initialized when it is restored from the on-disk free
space cache, or when it is written out to disk.
Besides a potential race condition due to the multiple use of the list
field, filesystem performance severely degrades over time: as we use
up all non-bitmap free extents, the try-to-set-up-cluster dance is
done at every metadata block allocation. For every block group, we
fail to set up a cluster, and after failing on them all up to twice,
we fall back to the much slower unclustered allocation.
To make matters worse, before the unclustered allocation, we try to
create new block groups until we reach the 1% threshold, which
introduces additional bitmaps and thus block groups that we'll iterate
over at each metadata block request.
2011-11-28 12:04:43 -02:00
|
|
|
INIT_LIST_HEAD(&info->list);
|
2011-03-29 13:46:06 +08:00
|
|
|
link_free_space(ctl, info);
|
|
|
|
ctl->total_bitmaps++;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->op->recalc_thresholds(ctl);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static void free_bitmap(struct btrfs_free_space_ctl *ctl,
|
2010-11-09 14:50:07 +08:00
|
|
|
struct btrfs_free_space *bitmap_info)
|
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
unlink_free_space(ctl, bitmap_info);
|
2010-11-09 14:50:07 +08:00
|
|
|
kfree(bitmap_info->bitmap);
|
2011-01-28 17:05:48 -05:00
|
|
|
kmem_cache_free(btrfs_free_space_cachep, bitmap_info);
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->total_bitmaps--;
|
|
|
|
ctl->op->recalc_thresholds(ctl);
|
2010-11-09 14:50:07 +08:00
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
struct btrfs_free_space *bitmap_info,
|
|
|
|
u64 *offset, u64 *bytes)
|
|
|
|
{
|
|
|
|
u64 end;
|
2009-07-31 11:03:58 -04:00
|
|
|
u64 search_start, search_bytes;
|
|
|
|
int ret;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
again:
|
2011-03-29 13:46:06 +08:00
|
|
|
end = bitmap_info->offset + (u64)(BITS_PER_BITMAP * ctl->unit) - 1;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2009-07-31 11:03:58 -04:00
|
|
|
/*
|
2012-06-27 15:10:56 -04:00
|
|
|
* We need to search for bits in this bitmap. We could only cover some
|
|
|
|
* of the extent in this bitmap thanks to how we add space, so we need
|
|
|
|
* to search for as much as it as we can and clear that amount, and then
|
|
|
|
* go searching for the next bit.
|
2009-07-31 11:03:58 -04:00
|
|
|
*/
|
|
|
|
search_start = *offset;
|
2012-06-27 15:10:56 -04:00
|
|
|
search_bytes = ctl->unit;
|
2011-02-03 02:39:52 +00:00
|
|
|
search_bytes = min(search_bytes, end - search_start + 1);
|
2011-03-29 13:46:06 +08:00
|
|
|
ret = search_bitmap(ctl, bitmap_info, &search_start, &search_bytes);
|
2009-07-31 11:03:58 -04:00
|
|
|
BUG_ON(ret < 0 || search_start != *offset);
|
|
|
|
|
2012-06-27 15:10:56 -04:00
|
|
|
/* We may have found more bits than what we need */
|
|
|
|
search_bytes = min(search_bytes, *bytes);
|
|
|
|
|
|
|
|
/* Cannot clear past the end of the bitmap */
|
|
|
|
search_bytes = min(search_bytes, end - search_start + 1);
|
|
|
|
|
|
|
|
bitmap_clear_bits(ctl, bitmap_info, search_start, search_bytes);
|
|
|
|
*offset += search_bytes;
|
|
|
|
*bytes -= search_bytes;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
if (*bytes) {
|
2009-07-31 11:03:58 -04:00
|
|
|
struct rb_node *next = rb_next(&bitmap_info->offset_index);
|
2010-11-09 14:50:07 +08:00
|
|
|
if (!bitmap_info->bytes)
|
2011-03-29 13:46:06 +08:00
|
|
|
free_bitmap(ctl, bitmap_info);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2009-07-31 11:03:58 -04:00
|
|
|
/*
|
|
|
|
* no entry after this bitmap, but we still have bytes to
|
|
|
|
* remove, so something has gone wrong.
|
|
|
|
*/
|
|
|
|
if (!next)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
return -EINVAL;
|
|
|
|
|
2009-07-31 11:03:58 -04:00
|
|
|
bitmap_info = rb_entry(next, struct btrfs_free_space,
|
|
|
|
offset_index);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if the next entry isn't a bitmap we need to return to let the
|
|
|
|
* extent stuff do its work.
|
|
|
|
*/
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (!bitmap_info->bitmap)
|
|
|
|
return -EAGAIN;
|
|
|
|
|
2009-07-31 11:03:58 -04:00
|
|
|
/*
|
|
|
|
* Ok the next item is a bitmap, but it may not actually hold
|
|
|
|
* the information for the rest of this free space stuff, so
|
|
|
|
* look for it, and if we don't find it return so we can try
|
|
|
|
* everything over again.
|
|
|
|
*/
|
|
|
|
search_start = *offset;
|
2012-06-27 15:10:56 -04:00
|
|
|
search_bytes = ctl->unit;
|
2011-03-29 13:46:06 +08:00
|
|
|
ret = search_bitmap(ctl, bitmap_info, &search_start,
|
2009-07-31 11:03:58 -04:00
|
|
|
&search_bytes);
|
|
|
|
if (ret < 0 || search_start != *offset)
|
|
|
|
return -EAGAIN;
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
goto again;
|
2010-11-09 14:50:07 +08:00
|
|
|
} else if (!bitmap_info->bytes)
|
2011-03-29 13:46:06 +08:00
|
|
|
free_bitmap(ctl, bitmap_info);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-05-27 14:07:49 -04:00
|
|
|
static u64 add_bytes_to_bitmap(struct btrfs_free_space_ctl *ctl,
|
|
|
|
struct btrfs_free_space *info, u64 offset,
|
|
|
|
u64 bytes)
|
|
|
|
{
|
|
|
|
u64 bytes_to_set = 0;
|
|
|
|
u64 end;
|
|
|
|
|
|
|
|
end = info->offset + (u64)(BITS_PER_BITMAP * ctl->unit);
|
|
|
|
|
|
|
|
bytes_to_set = min(end - offset, bytes);
|
|
|
|
|
|
|
|
bitmap_set_bits(ctl, info, offset, bytes_to_set);
|
|
|
|
|
|
|
|
return bytes_to_set;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static bool use_bitmap(struct btrfs_free_space_ctl *ctl,
|
|
|
|
struct btrfs_free_space *info)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_block_group_cache *block_group = ctl->private;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are below the extents threshold then we can add this as an
|
|
|
|
* extent, and don't have to deal with the bitmap
|
|
|
|
*/
|
2011-03-29 13:46:06 +08:00
|
|
|
if (ctl->free_extents < ctl->extents_thresh) {
|
2011-03-18 16:16:21 -04:00
|
|
|
/*
|
|
|
|
* If this block group has some small extents we don't want to
|
|
|
|
* use up all of our free slots in the cache with them, we want
|
|
|
|
* to reserve them to larger extents, however if we have plent
|
|
|
|
* of cache left then go ahead an dadd them, no sense in adding
|
|
|
|
* the overhead of a bitmap if we don't have to.
|
|
|
|
*/
|
|
|
|
if (info->bytes <= block_group->sectorsize * 4) {
|
2011-03-29 13:46:06 +08:00
|
|
|
if (ctl->free_extents * 2 <= ctl->extents_thresh)
|
|
|
|
return false;
|
2011-03-18 16:16:21 -04:00
|
|
|
} else {
|
2011-03-29 13:46:06 +08:00
|
|
|
return false;
|
2011-03-18 16:16:21 -04:00
|
|
|
}
|
|
|
|
}
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* some block groups are so tiny they can't be enveloped by a bitmap, so
|
|
|
|
* don't even bother to create a bitmap for this
|
|
|
|
*/
|
|
|
|
if (BITS_PER_BITMAP * block_group->sectorsize >
|
|
|
|
block_group->key.offset)
|
2011-03-29 13:46:06 +08:00
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2011-05-27 14:07:49 -04:00
|
|
|
static struct btrfs_free_space_op free_space_op = {
|
|
|
|
.recalc_thresholds = recalculate_thresholds,
|
|
|
|
.use_bitmap = use_bitmap,
|
|
|
|
};
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
static int insert_into_bitmap(struct btrfs_free_space_ctl *ctl,
|
|
|
|
struct btrfs_free_space *info)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space *bitmap_info;
|
2011-05-27 14:07:49 -04:00
|
|
|
struct btrfs_block_group_cache *block_group = NULL;
|
2011-03-29 13:46:06 +08:00
|
|
|
int added = 0;
|
2011-05-27 14:07:49 -04:00
|
|
|
u64 bytes, offset, bytes_added;
|
2011-03-29 13:46:06 +08:00
|
|
|
int ret;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
bytes = info->bytes;
|
|
|
|
offset = info->offset;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
if (!ctl->op->use_bitmap(ctl, info))
|
|
|
|
return 0;
|
|
|
|
|
2011-05-27 14:07:49 -04:00
|
|
|
if (ctl->op == &free_space_op)
|
|
|
|
block_group = ctl->private;
|
2011-06-10 16:36:57 -04:00
|
|
|
again:
|
2011-05-27 14:07:49 -04:00
|
|
|
/*
|
|
|
|
* Since we link bitmaps right into the cluster we need to see if we
|
|
|
|
* have a cluster here, and if so and it has our bitmap we need to add
|
|
|
|
* the free space to that bitmap.
|
|
|
|
*/
|
|
|
|
if (block_group && !list_empty(&block_group->cluster_list)) {
|
|
|
|
struct btrfs_free_cluster *cluster;
|
|
|
|
struct rb_node *node;
|
|
|
|
struct btrfs_free_space *entry;
|
|
|
|
|
|
|
|
cluster = list_entry(block_group->cluster_list.next,
|
|
|
|
struct btrfs_free_cluster,
|
|
|
|
block_group_list);
|
|
|
|
spin_lock(&cluster->lock);
|
|
|
|
node = rb_first(&cluster->root);
|
|
|
|
if (!node) {
|
|
|
|
spin_unlock(&cluster->lock);
|
2011-06-10 16:36:57 -04:00
|
|
|
goto no_cluster_bitmap;
|
2011-05-27 14:07:49 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
entry = rb_entry(node, struct btrfs_free_space, offset_index);
|
|
|
|
if (!entry->bitmap) {
|
|
|
|
spin_unlock(&cluster->lock);
|
2011-06-10 16:36:57 -04:00
|
|
|
goto no_cluster_bitmap;
|
2011-05-27 14:07:49 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (entry->offset == offset_to_bitmap(ctl, offset)) {
|
|
|
|
bytes_added = add_bytes_to_bitmap(ctl, entry,
|
|
|
|
offset, bytes);
|
|
|
|
bytes -= bytes_added;
|
|
|
|
offset += bytes_added;
|
|
|
|
}
|
|
|
|
spin_unlock(&cluster->lock);
|
|
|
|
if (!bytes) {
|
|
|
|
ret = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2011-06-10 16:36:57 -04:00
|
|
|
|
|
|
|
no_cluster_bitmap:
|
2011-03-29 13:46:06 +08:00
|
|
|
bitmap_info = tree_search_offset(ctl, offset_to_bitmap(ctl, offset),
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
1, 0);
|
|
|
|
if (!bitmap_info) {
|
|
|
|
BUG_ON(added);
|
|
|
|
goto new_bitmap;
|
|
|
|
}
|
|
|
|
|
2011-05-27 14:07:49 -04:00
|
|
|
bytes_added = add_bytes_to_bitmap(ctl, bitmap_info, offset, bytes);
|
|
|
|
bytes -= bytes_added;
|
|
|
|
offset += bytes_added;
|
|
|
|
added = 0;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
if (!bytes) {
|
|
|
|
ret = 1;
|
|
|
|
goto out;
|
|
|
|
} else
|
|
|
|
goto again;
|
|
|
|
|
|
|
|
new_bitmap:
|
|
|
|
if (info && info->bitmap) {
|
2011-03-29 13:46:06 +08:00
|
|
|
add_new_bitmap(ctl, info, offset);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
added = 1;
|
|
|
|
info = NULL;
|
|
|
|
goto again;
|
|
|
|
} else {
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
/* no pre-allocated info, allocate a new one */
|
|
|
|
if (!info) {
|
2011-01-28 17:05:48 -05:00
|
|
|
info = kmem_cache_zalloc(btrfs_free_space_cachep,
|
|
|
|
GFP_NOFS);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (!info) {
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* allocate the bitmap */
|
|
|
|
info->bitmap = kzalloc(PAGE_CACHE_SIZE, GFP_NOFS);
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (!info->bitmap) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
if (info) {
|
|
|
|
if (info->bitmap)
|
|
|
|
kfree(info->bitmap);
|
2011-01-28 17:05:48 -05:00
|
|
|
kmem_cache_free(btrfs_free_space_cachep, info);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-05-22 12:33:42 -04:00
|
|
|
static bool try_merge_free_space(struct btrfs_free_space_ctl *ctl,
|
2010-11-09 14:57:39 +08:00
|
|
|
struct btrfs_free_space *info, bool update_stat)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
{
|
2010-11-09 14:56:50 +08:00
|
|
|
struct btrfs_free_space *left_info;
|
|
|
|
struct btrfs_free_space *right_info;
|
|
|
|
bool merged = false;
|
|
|
|
u64 offset = info->offset;
|
|
|
|
u64 bytes = info->bytes;
|
2009-04-03 10:14:18 -04:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
/*
|
|
|
|
* first we want to see if there is free space adjacent to the range we
|
|
|
|
* are adding, if there is remove that struct and add a new one to
|
|
|
|
* cover the entire range
|
|
|
|
*/
|
2011-03-29 13:46:06 +08:00
|
|
|
right_info = tree_search_offset(ctl, offset + bytes, 0, 0);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (right_info && rb_prev(&right_info->offset_index))
|
|
|
|
left_info = rb_entry(rb_prev(&right_info->offset_index),
|
|
|
|
struct btrfs_free_space, offset_index);
|
|
|
|
else
|
2011-03-29 13:46:06 +08:00
|
|
|
left_info = tree_search_offset(ctl, offset - 1, 0, 0);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (right_info && !right_info->bitmap) {
|
2010-11-09 14:57:39 +08:00
|
|
|
if (update_stat)
|
2011-03-29 13:46:06 +08:00
|
|
|
unlink_free_space(ctl, right_info);
|
2010-11-09 14:57:39 +08:00
|
|
|
else
|
2011-03-29 13:46:06 +08:00
|
|
|
__unlink_free_space(ctl, right_info);
|
2009-04-03 10:14:18 -04:00
|
|
|
info->bytes += right_info->bytes;
|
2011-01-28 17:05:48 -05:00
|
|
|
kmem_cache_free(btrfs_free_space_cachep, right_info);
|
2010-11-09 14:56:50 +08:00
|
|
|
merged = true;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (left_info && !left_info->bitmap &&
|
|
|
|
left_info->offset + left_info->bytes == offset) {
|
2010-11-09 14:57:39 +08:00
|
|
|
if (update_stat)
|
2011-03-29 13:46:06 +08:00
|
|
|
unlink_free_space(ctl, left_info);
|
2010-11-09 14:57:39 +08:00
|
|
|
else
|
2011-03-29 13:46:06 +08:00
|
|
|
__unlink_free_space(ctl, left_info);
|
2009-04-03 10:14:18 -04:00
|
|
|
info->offset = left_info->offset;
|
|
|
|
info->bytes += left_info->bytes;
|
2011-01-28 17:05:48 -05:00
|
|
|
kmem_cache_free(btrfs_free_space_cachep, left_info);
|
2010-11-09 14:56:50 +08:00
|
|
|
merged = true;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
|
2010-11-09 14:56:50 +08:00
|
|
|
return merged;
|
|
|
|
}
|
|
|
|
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
int __btrfs_add_free_space(struct btrfs_free_space_ctl *ctl,
|
|
|
|
u64 offset, u64 bytes)
|
2010-11-09 14:56:50 +08:00
|
|
|
{
|
|
|
|
struct btrfs_free_space *info;
|
|
|
|
int ret = 0;
|
|
|
|
|
2011-01-28 17:05:48 -05:00
|
|
|
info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
|
2010-11-09 14:56:50 +08:00
|
|
|
if (!info)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
info->offset = offset;
|
|
|
|
info->bytes = bytes;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
2010-11-09 14:56:50 +08:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
if (try_merge_free_space(ctl, info, true))
|
2010-11-09 14:56:50 +08:00
|
|
|
goto link;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* There was no extent directly to the left or right of this new
|
|
|
|
* extent then we know we're going to have to allocate a new extent, so
|
|
|
|
* before we do that see if we need to drop this into a bitmap
|
|
|
|
*/
|
2011-03-29 13:46:06 +08:00
|
|
|
ret = insert_into_bitmap(ctl, info);
|
2010-11-09 14:56:50 +08:00
|
|
|
if (ret < 0) {
|
|
|
|
goto out;
|
|
|
|
} else if (ret) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
link:
|
2011-03-29 13:46:06 +08:00
|
|
|
ret = link_free_space(ctl, info);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
if (ret)
|
2011-01-28 17:05:48 -05:00
|
|
|
kmem_cache_free(btrfs_free_space_cachep, info);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
out:
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
2009-04-03 10:14:18 -04:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
if (ret) {
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
printk(KERN_CRIT "btrfs: unable to add free space :%d\n", ret);
|
2009-04-02 17:05:11 -04:00
|
|
|
BUG_ON(ret == -EEXIST);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-04-03 10:14:18 -04:00
|
|
|
int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
|
|
|
|
u64 offset, u64 bytes)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
struct btrfs_free_space *info;
|
|
|
|
int ret = 0;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
2009-04-03 10:14:18 -04:00
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
again:
|
2012-06-27 15:10:56 -04:00
|
|
|
if (!bytes)
|
|
|
|
goto out_lock;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
info = tree_search_offset(ctl, offset, 0, 0);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (!info) {
|
2009-07-31 11:03:58 -04:00
|
|
|
/*
|
|
|
|
* oops didn't find an extent that matched the space we wanted
|
|
|
|
* to remove, look for a bitmap instead
|
|
|
|
*/
|
2011-03-29 13:46:06 +08:00
|
|
|
info = tree_search_offset(ctl, offset_to_bitmap(ctl, offset),
|
2009-07-31 11:03:58 -04:00
|
|
|
1, 0);
|
|
|
|
if (!info) {
|
2011-11-21 09:39:11 -05:00
|
|
|
/* the tree logging code might be calling us before we
|
|
|
|
* have fully loaded the free space rbtree for this
|
|
|
|
* block group. So it is possible the entry won't
|
|
|
|
* be in the rbtree yet at all. The caching code
|
|
|
|
* will make sure not to put it in the rbtree if
|
|
|
|
* the logging code has pinned it.
|
|
|
|
*/
|
2009-07-31 11:03:58 -04:00
|
|
|
goto out_lock;
|
|
|
|
}
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
|
|
|
|
2012-06-27 15:10:56 -04:00
|
|
|
if (!info->bitmap) {
|
2011-03-29 13:46:06 +08:00
|
|
|
unlink_free_space(ctl, info);
|
2012-06-27 15:10:56 -04:00
|
|
|
if (offset == info->offset) {
|
|
|
|
u64 to_free = min(bytes, info->bytes);
|
|
|
|
|
|
|
|
info->bytes -= to_free;
|
|
|
|
info->offset += to_free;
|
|
|
|
if (info->bytes) {
|
|
|
|
ret = link_free_space(ctl, info);
|
|
|
|
WARN_ON(ret);
|
|
|
|
} else {
|
|
|
|
kmem_cache_free(btrfs_free_space_cachep, info);
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
2012-06-27 15:10:56 -04:00
|
|
|
offset += to_free;
|
|
|
|
bytes -= to_free;
|
|
|
|
goto again;
|
|
|
|
} else {
|
|
|
|
u64 old_end = info->bytes + info->offset;
|
2008-09-24 11:23:25 -04:00
|
|
|
|
2012-06-27 15:10:56 -04:00
|
|
|
info->bytes = offset - info->offset;
|
2011-03-29 13:46:06 +08:00
|
|
|
ret = link_free_space(ctl, info);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
WARN_ON(ret);
|
|
|
|
if (ret)
|
|
|
|
goto out_lock;
|
|
|
|
|
2012-06-27 15:10:56 -04:00
|
|
|
/* Not enough bytes in this entry to satisfy us */
|
|
|
|
if (old_end < offset + bytes) {
|
|
|
|
bytes -= old_end - offset;
|
|
|
|
offset = old_end;
|
|
|
|
goto again;
|
|
|
|
} else if (old_end == offset + bytes) {
|
|
|
|
/* all done */
|
|
|
|
goto out_lock;
|
|
|
|
}
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
|
|
|
|
ret = btrfs_add_free_space(block_group, offset + bytes,
|
|
|
|
old_end - (offset + bytes));
|
|
|
|
WARN_ON(ret);
|
|
|
|
goto out;
|
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
ret = remove_from_bitmap(ctl, info, &offset, &bytes);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (ret == -EAGAIN)
|
|
|
|
goto again;
|
2012-03-12 16:03:00 +01:00
|
|
|
BUG_ON(ret); /* logic error */
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
out_lock:
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
out:
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 14:49:05 -04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
void btrfs_dump_free_space(struct btrfs_block_group_cache *block_group,
|
|
|
|
u64 bytes)
|
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
struct btrfs_free_space *info;
|
|
|
|
struct rb_node *n;
|
|
|
|
int count = 0;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
for (n = rb_first(&ctl->free_space_offset); n; n = rb_next(n)) {
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
info = rb_entry(n, struct btrfs_free_space, offset_index);
|
2012-07-06 03:31:36 -06:00
|
|
|
if (info->bytes >= bytes && !block_group->ro)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
count++;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
printk(KERN_CRIT "entry offset %llu, bytes %llu, bitmap %s\n",
|
2009-04-21 12:38:29 -07:00
|
|
|
(unsigned long long)info->offset,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
(unsigned long long)info->bytes,
|
|
|
|
(info->bitmap) ? "yes" : "no");
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
printk(KERN_INFO "block group has cluster?: %s\n",
|
|
|
|
list_empty(&block_group->cluster_list) ? "no" : "yes");
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
printk(KERN_INFO "%d blocks of free space at or bigger than bytes is"
|
|
|
|
"\n", count);
|
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock_init(&ctl->tree_lock);
|
|
|
|
ctl->unit = block_group->sectorsize;
|
|
|
|
ctl->start = block_group->key.objectid;
|
|
|
|
ctl->private = block_group;
|
|
|
|
ctl->op = &free_space_op;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
/*
|
|
|
|
* we only want to have 32k of ram per block group for keeping
|
|
|
|
* track of free space, and if we pass 1/2 of that we want to
|
|
|
|
* start converting things over to using bitmaps
|
|
|
|
*/
|
|
|
|
ctl->extents_thresh = ((1024 * 32) / 2) /
|
|
|
|
sizeof(struct btrfs_free_space);
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
/*
|
|
|
|
* for a given cluster, put all of its extents back into the free
|
|
|
|
* space cache. If the block group passed doesn't match the block group
|
|
|
|
* pointed to by the cluster, someone else raced in and freed the
|
|
|
|
* cluster already. In that case, we just return without changing anything
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
__btrfs_return_cluster_to_free_space(
|
|
|
|
struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_free_cluster *cluster)
|
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
2009-04-03 09:47:43 -04:00
|
|
|
struct btrfs_free_space *entry;
|
|
|
|
struct rb_node *node;
|
|
|
|
|
|
|
|
spin_lock(&cluster->lock);
|
|
|
|
if (cluster->block_group != block_group)
|
|
|
|
goto out;
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
cluster->block_group = NULL;
|
2009-04-03 09:47:43 -04:00
|
|
|
cluster->window_start = 0;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
list_del_init(&cluster->block_group_list);
|
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
node = rb_first(&cluster->root);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
while (node) {
|
2011-03-21 10:11:24 -04:00
|
|
|
bool bitmap;
|
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
entry = rb_entry(node, struct btrfs_free_space, offset_index);
|
|
|
|
node = rb_next(&entry->offset_index);
|
|
|
|
rb_erase(&entry->offset_index, &cluster->root);
|
2011-03-21 10:11:24 -04:00
|
|
|
|
|
|
|
bitmap = (entry->bitmap != NULL);
|
|
|
|
if (!bitmap)
|
2011-03-29 13:46:06 +08:00
|
|
|
try_merge_free_space(ctl, entry, false);
|
|
|
|
tree_insert_offset(&ctl->free_space_offset,
|
2011-03-21 10:11:24 -04:00
|
|
|
entry->offset, &entry->offset_index, bitmap);
|
2009-04-03 09:47:43 -04:00
|
|
|
}
|
2010-02-23 19:43:04 +00:00
|
|
|
cluster->root = RB_ROOT;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
out:
|
|
|
|
spin_unlock(&cluster->lock);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
btrfs_put_block_group(block_group);
|
2009-04-03 09:47:43 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-05-21 09:27:38 -04:00
|
|
|
void __btrfs_remove_free_space_cache_locked(struct btrfs_free_space_ctl *ctl)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
{
|
|
|
|
struct btrfs_free_space *info;
|
|
|
|
struct rb_node *node;
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
|
|
|
|
while ((node = rb_last(&ctl->free_space_offset)) != NULL) {
|
|
|
|
info = rb_entry(node, struct btrfs_free_space, offset_index);
|
2011-06-24 16:02:51 +00:00
|
|
|
if (!info->bitmap) {
|
|
|
|
unlink_free_space(ctl, info);
|
|
|
|
kmem_cache_free(btrfs_free_space_cachep, info);
|
|
|
|
} else {
|
|
|
|
free_bitmap(ctl, info);
|
|
|
|
}
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
if (need_resched()) {
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
cond_resched();
|
|
|
|
spin_lock(&ctl->tree_lock);
|
|
|
|
}
|
|
|
|
}
|
2011-05-21 09:27:38 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
void __btrfs_remove_free_space_cache(struct btrfs_free_space_ctl *ctl)
|
|
|
|
{
|
|
|
|
spin_lock(&ctl->tree_lock);
|
|
|
|
__btrfs_remove_free_space_cache_locked(ctl);
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
void btrfs_remove_free_space_cache(struct btrfs_block_group_cache *block_group)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
2009-04-03 09:47:43 -04:00
|
|
|
struct btrfs_free_cluster *cluster;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
struct list_head *head;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
while ((head = block_group->cluster_list.next) !=
|
|
|
|
&block_group->cluster_list) {
|
|
|
|
cluster = list_entry(head, struct btrfs_free_cluster,
|
|
|
|
block_group_list);
|
2009-04-03 09:47:43 -04:00
|
|
|
|
|
|
|
WARN_ON(cluster->block_group != block_group);
|
|
|
|
__btrfs_return_cluster_to_free_space(block_group, cluster);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (need_resched()) {
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
cond_resched();
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
2009-04-03 09:47:43 -04:00
|
|
|
}
|
2011-05-21 09:27:38 -04:00
|
|
|
__btrfs_remove_free_space_cache_locked(ctl);
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
2009-04-03 09:47:43 -04:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
}
|
|
|
|
|
2009-04-03 10:14:18 -04:00
|
|
|
u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
|
|
|
|
u64 offset, u64 bytes, u64 empty_size)
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
2009-04-03 10:14:18 -04:00
|
|
|
struct btrfs_free_space *entry = NULL;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
u64 bytes_search = bytes + empty_size;
|
2009-04-03 10:14:18 -04:00
|
|
|
u64 ret = 0;
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
|
|
|
entry = find_free_space(ctl, &offset, &bytes_search);
|
2009-04-03 10:14:18 -04:00
|
|
|
if (!entry)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
ret = offset;
|
|
|
|
if (entry->bitmap) {
|
2011-03-29 13:46:06 +08:00
|
|
|
bitmap_clear_bits(ctl, entry, offset, bytes);
|
2010-11-09 14:50:07 +08:00
|
|
|
if (!entry->bytes)
|
2011-03-29 13:46:06 +08:00
|
|
|
free_bitmap(ctl, entry);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
} else {
|
2011-03-29 13:46:06 +08:00
|
|
|
unlink_free_space(ctl, entry);
|
2009-04-03 10:14:18 -04:00
|
|
|
entry->offset += bytes;
|
|
|
|
entry->bytes -= bytes;
|
|
|
|
if (!entry->bytes)
|
2011-01-28 17:05:48 -05:00
|
|
|
kmem_cache_free(btrfs_free_space_cachep, entry);
|
2009-04-03 10:14:18 -04:00
|
|
|
else
|
2011-03-29 13:46:06 +08:00
|
|
|
link_free_space(ctl, entry);
|
2009-04-03 10:14:18 -04:00
|
|
|
}
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
out:
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-23 13:14:11 -04:00
|
|
|
return ret;
|
|
|
|
}
|
2009-04-03 09:47:43 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* given a cluster, put all of its extents back into the free space
|
|
|
|
* cache. If a block group is passed, this function will only free
|
|
|
|
* a cluster that belongs to the passed block group.
|
|
|
|
*
|
|
|
|
* Otherwise, it'll get a reference on the block group pointed to by the
|
|
|
|
* cluster and remove the cluster from it.
|
|
|
|
*/
|
|
|
|
int btrfs_return_cluster_to_free_space(
|
|
|
|
struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_free_cluster *cluster)
|
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl;
|
2009-04-03 09:47:43 -04:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* first, get a safe pointer to the block group */
|
|
|
|
spin_lock(&cluster->lock);
|
|
|
|
if (!block_group) {
|
|
|
|
block_group = cluster->block_group;
|
|
|
|
if (!block_group) {
|
|
|
|
spin_unlock(&cluster->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
} else if (cluster->block_group != block_group) {
|
|
|
|
/* someone else has already freed it don't redo their work */
|
|
|
|
spin_unlock(&cluster->lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
atomic_inc(&block_group->count);
|
|
|
|
spin_unlock(&cluster->lock);
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl = block_group->free_space_ctl;
|
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
/* now return any extents the cluster had on it */
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
2009-04-03 09:47:43 -04:00
|
|
|
ret = __btrfs_return_cluster_to_free_space(block_group, cluster);
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
2009-04-03 09:47:43 -04:00
|
|
|
|
|
|
|
/* finally drop our ref */
|
|
|
|
btrfs_put_block_group(block_group);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
static u64 btrfs_alloc_from_bitmap(struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_free_cluster *cluster,
|
2011-03-21 10:11:24 -04:00
|
|
|
struct btrfs_free_space *entry,
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
u64 bytes, u64 min_start)
|
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
int err;
|
|
|
|
u64 search_start = cluster->window_start;
|
|
|
|
u64 search_bytes = bytes;
|
|
|
|
u64 ret = 0;
|
|
|
|
|
|
|
|
search_start = min_start;
|
|
|
|
search_bytes = bytes;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
err = search_bitmap(ctl, entry, &search_start, &search_bytes);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
if (err)
|
2011-03-21 10:11:24 -04:00
|
|
|
return 0;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
ret = search_start;
|
2011-08-05 09:32:35 +00:00
|
|
|
__bitmap_clear_bits(ctl, entry, ret, bytes);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
/*
|
|
|
|
* given a cluster, try to allocate 'bytes' from it, returns 0
|
|
|
|
* if it couldn't find anything suitably large, or a logical disk offset
|
|
|
|
* if things worked out
|
|
|
|
*/
|
|
|
|
u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_free_cluster *cluster, u64 bytes,
|
|
|
|
u64 min_start)
|
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
2009-04-03 09:47:43 -04:00
|
|
|
struct btrfs_free_space *entry = NULL;
|
|
|
|
struct rb_node *node;
|
|
|
|
u64 ret = 0;
|
|
|
|
|
|
|
|
spin_lock(&cluster->lock);
|
|
|
|
if (bytes > cluster->max_size)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (cluster->block_group != block_group)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
node = rb_first(&cluster->root);
|
|
|
|
if (!node)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
entry = rb_entry(node, struct btrfs_free_space, offset_index);
|
|
|
|
while(1) {
|
2011-03-21 10:11:24 -04:00
|
|
|
if (entry->bytes < bytes ||
|
|
|
|
(!entry->bitmap && entry->offset < min_start)) {
|
2009-04-03 09:47:43 -04:00
|
|
|
node = rb_next(&entry->offset_index);
|
|
|
|
if (!node)
|
|
|
|
break;
|
|
|
|
entry = rb_entry(node, struct btrfs_free_space,
|
|
|
|
offset_index);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2011-03-21 10:11:24 -04:00
|
|
|
if (entry->bitmap) {
|
|
|
|
ret = btrfs_alloc_from_bitmap(block_group,
|
|
|
|
cluster, entry, bytes,
|
2012-01-26 15:01:11 -05:00
|
|
|
cluster->window_start);
|
2011-03-21 10:11:24 -04:00
|
|
|
if (ret == 0) {
|
|
|
|
node = rb_next(&entry->offset_index);
|
|
|
|
if (!node)
|
|
|
|
break;
|
|
|
|
entry = rb_entry(node, struct btrfs_free_space,
|
|
|
|
offset_index);
|
|
|
|
continue;
|
|
|
|
}
|
2012-01-26 15:01:12 -05:00
|
|
|
cluster->window_start += bytes;
|
2011-03-21 10:11:24 -04:00
|
|
|
} else {
|
|
|
|
ret = entry->offset;
|
|
|
|
|
|
|
|
entry->offset += bytes;
|
|
|
|
entry->bytes -= bytes;
|
|
|
|
}
|
2009-04-03 09:47:43 -04:00
|
|
|
|
2010-11-09 14:55:34 +08:00
|
|
|
if (entry->bytes == 0)
|
2009-04-03 09:47:43 -04:00
|
|
|
rb_erase(&entry->offset_index, &cluster->root);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
spin_unlock(&cluster->lock);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2010-11-09 14:55:34 +08:00
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
2010-11-09 14:55:34 +08:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->free_space -= bytes;
|
2010-11-09 14:55:34 +08:00
|
|
|
if (entry->bytes == 0) {
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->free_extents--;
|
2011-03-21 10:11:24 -04:00
|
|
|
if (entry->bitmap) {
|
|
|
|
kfree(entry->bitmap);
|
2011-03-29 13:46:06 +08:00
|
|
|
ctl->total_bitmaps--;
|
|
|
|
ctl->op->recalc_thresholds(ctl);
|
2011-03-21 10:11:24 -04:00
|
|
|
}
|
2011-01-28 17:05:48 -05:00
|
|
|
kmem_cache_free(btrfs_free_space_cachep, entry);
|
2010-11-09 14:55:34 +08:00
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
2010-11-09 14:55:34 +08:00
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
static int btrfs_bitmap_cluster(struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_free_space *entry,
|
|
|
|
struct btrfs_free_cluster *cluster,
|
2011-10-14 12:10:36 -03:00
|
|
|
u64 offset, u64 bytes,
|
|
|
|
u64 cont1_bytes, u64 min_bytes)
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
unsigned long next_zero;
|
|
|
|
unsigned long i;
|
2011-10-14 12:10:36 -03:00
|
|
|
unsigned long want_bits;
|
|
|
|
unsigned long min_bits;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
unsigned long found_bits;
|
|
|
|
unsigned long start = 0;
|
|
|
|
unsigned long total_found = 0;
|
2011-03-21 10:11:24 -04:00
|
|
|
int ret;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
i = offset_to_bit(entry->offset, block_group->sectorsize,
|
|
|
|
max_t(u64, offset, entry->offset));
|
2011-10-14 12:10:36 -03:00
|
|
|
want_bits = bytes_to_bits(bytes, block_group->sectorsize);
|
|
|
|
min_bits = bytes_to_bits(min_bytes, block_group->sectorsize);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
|
|
|
again:
|
|
|
|
found_bits = 0;
|
|
|
|
for (i = find_next_bit(entry->bitmap, BITS_PER_BITMAP, i);
|
|
|
|
i < BITS_PER_BITMAP;
|
|
|
|
i = find_next_bit(entry->bitmap, BITS_PER_BITMAP, i + 1)) {
|
|
|
|
next_zero = find_next_zero_bit(entry->bitmap,
|
|
|
|
BITS_PER_BITMAP, i);
|
2011-10-14 12:10:36 -03:00
|
|
|
if (next_zero - i >= min_bits) {
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
found_bits = next_zero - i;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
i = next_zero;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!found_bits)
|
2011-03-21 10:11:24 -04:00
|
|
|
return -ENOSPC;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2011-10-14 12:10:36 -03:00
|
|
|
if (!total_found) {
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
start = i;
|
2011-11-30 13:43:00 -05:00
|
|
|
cluster->max_size = 0;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
total_found += found_bits;
|
|
|
|
|
|
|
|
if (cluster->max_size < found_bits * block_group->sectorsize)
|
|
|
|
cluster->max_size = found_bits * block_group->sectorsize;
|
|
|
|
|
2011-10-14 12:10:36 -03:00
|
|
|
if (total_found < want_bits || cluster->max_size < cont1_bytes) {
|
|
|
|
i = next_zero + 1;
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
|
|
|
|
cluster->window_start = start * block_group->sectorsize +
|
|
|
|
entry->offset;
|
2011-03-29 13:46:06 +08:00
|
|
|
rb_erase(&entry->offset_index, &ctl->free_space_offset);
|
2011-03-21 10:11:24 -04:00
|
|
|
ret = tree_insert_offset(&cluster->root, entry->offset,
|
|
|
|
&entry->offset_index, 1);
|
2012-03-12 16:03:00 +01:00
|
|
|
BUG_ON(ret); /* -EEXIST; Logic error */
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
|
2011-11-10 08:29:20 -05:00
|
|
|
trace_btrfs_setup_cluster(block_group, cluster,
|
|
|
|
total_found * block_group->sectorsize, 1);
|
Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-13 21:29:25 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-03-21 10:11:24 -04:00
|
|
|
/*
|
|
|
|
* This searches the block group for just extents to fill the cluster with.
|
2011-10-14 12:10:36 -03:00
|
|
|
* Try to find a cluster with at least bytes total bytes, at least one
|
|
|
|
* extent of cont1_bytes, and other clusters of at least min_bytes.
|
2011-03-21 10:11:24 -04:00
|
|
|
*/
|
2011-05-25 13:07:37 -04:00
|
|
|
static noinline int
|
|
|
|
setup_cluster_no_bitmap(struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_free_cluster *cluster,
|
|
|
|
struct list_head *bitmaps, u64 offset, u64 bytes,
|
2011-10-14 12:10:36 -03:00
|
|
|
u64 cont1_bytes, u64 min_bytes)
|
2011-03-21 10:11:24 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
2011-03-21 10:11:24 -04:00
|
|
|
struct btrfs_free_space *first = NULL;
|
|
|
|
struct btrfs_free_space *entry = NULL;
|
|
|
|
struct btrfs_free_space *last;
|
|
|
|
struct rb_node *node;
|
|
|
|
u64 window_start;
|
|
|
|
u64 window_free;
|
|
|
|
u64 max_extent;
|
2011-11-10 08:29:20 -05:00
|
|
|
u64 total_size = 0;
|
2011-03-21 10:11:24 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
entry = tree_search_offset(ctl, offset, 0, 1);
|
2011-03-21 10:11:24 -04:00
|
|
|
if (!entry)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want bitmaps, so just move along until we find a normal
|
|
|
|
* extent entry.
|
|
|
|
*/
|
2011-10-14 12:10:36 -03:00
|
|
|
while (entry->bitmap || entry->bytes < min_bytes) {
|
|
|
|
if (entry->bitmap && list_empty(&entry->list))
|
2011-05-25 13:03:16 -04:00
|
|
|
list_add_tail(&entry->list, bitmaps);
|
2011-03-21 10:11:24 -04:00
|
|
|
node = rb_next(&entry->offset_index);
|
|
|
|
if (!node)
|
|
|
|
return -ENOSPC;
|
|
|
|
entry = rb_entry(node, struct btrfs_free_space, offset_index);
|
|
|
|
}
|
|
|
|
|
|
|
|
window_start = entry->offset;
|
|
|
|
window_free = entry->bytes;
|
|
|
|
max_extent = entry->bytes;
|
|
|
|
first = entry;
|
|
|
|
last = entry;
|
|
|
|
|
2011-10-14 12:10:36 -03:00
|
|
|
for (node = rb_next(&entry->offset_index); node;
|
|
|
|
node = rb_next(&entry->offset_index)) {
|
2011-03-21 10:11:24 -04:00
|
|
|
entry = rb_entry(node, struct btrfs_free_space, offset_index);
|
|
|
|
|
2011-05-25 13:03:16 -04:00
|
|
|
if (entry->bitmap) {
|
|
|
|
if (list_empty(&entry->list))
|
|
|
|
list_add_tail(&entry->list, bitmaps);
|
2011-03-21 10:11:24 -04:00
|
|
|
continue;
|
2011-05-25 13:03:16 -04:00
|
|
|
}
|
|
|
|
|
2011-10-14 12:10:36 -03:00
|
|
|
if (entry->bytes < min_bytes)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
last = entry;
|
|
|
|
window_free += entry->bytes;
|
|
|
|
if (entry->bytes > max_extent)
|
2011-03-21 10:11:24 -04:00
|
|
|
max_extent = entry->bytes;
|
|
|
|
}
|
|
|
|
|
2011-10-14 12:10:36 -03:00
|
|
|
if (window_free < bytes || max_extent < cont1_bytes)
|
|
|
|
return -ENOSPC;
|
|
|
|
|
2011-03-21 10:11:24 -04:00
|
|
|
cluster->window_start = first->offset;
|
|
|
|
|
|
|
|
node = &first->offset_index;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* now we've found our entries, pull them out of the free space
|
|
|
|
* cache and put them into the cluster rbtree
|
|
|
|
*/
|
|
|
|
do {
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
entry = rb_entry(node, struct btrfs_free_space, offset_index);
|
|
|
|
node = rb_next(&entry->offset_index);
|
2011-10-14 12:10:36 -03:00
|
|
|
if (entry->bitmap || entry->bytes < min_bytes)
|
2011-03-21 10:11:24 -04:00
|
|
|
continue;
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
rb_erase(&entry->offset_index, &ctl->free_space_offset);
|
2011-03-21 10:11:24 -04:00
|
|
|
ret = tree_insert_offset(&cluster->root, entry->offset,
|
|
|
|
&entry->offset_index, 0);
|
2011-11-10 08:29:20 -05:00
|
|
|
total_size += entry->bytes;
|
2012-03-12 16:03:00 +01:00
|
|
|
BUG_ON(ret); /* -EEXIST; Logic error */
|
2011-03-21 10:11:24 -04:00
|
|
|
} while (node && entry != last);
|
|
|
|
|
|
|
|
cluster->max_size = max_extent;
|
2011-11-10 08:29:20 -05:00
|
|
|
trace_btrfs_setup_cluster(block_group, cluster, total_size, 0);
|
2011-03-21 10:11:24 -04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This specifically looks for bitmaps that may work in the cluster, we assume
|
|
|
|
* that we have already failed to find extents that will work.
|
|
|
|
*/
|
2011-05-25 13:07:37 -04:00
|
|
|
static noinline int
|
|
|
|
setup_cluster_bitmap(struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_free_cluster *cluster,
|
|
|
|
struct list_head *bitmaps, u64 offset, u64 bytes,
|
2011-10-14 12:10:36 -03:00
|
|
|
u64 cont1_bytes, u64 min_bytes)
|
2011-03-21 10:11:24 -04:00
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
2011-03-21 10:11:24 -04:00
|
|
|
struct btrfs_free_space *entry;
|
|
|
|
int ret = -ENOSPC;
|
Btrfs: fix to search one more bitmap for cluster setup
Suppose there are two bitmaps [0, 256], [256, 512] and one extent
[100, 120] in the free space cache, and we want to setup a cluster
with offset=100, bytes=50.
In this case, there will be only one bitmap [256, 512] in the temporary
bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].
The cause is, the list is constructed in setup_cluster_no_bitmap(),
and only bitmaps with bitmap_entry->offset >= offset will be added
into the list, and the very bitmap that convers offset has
bitmap_entry->offset <= offset.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:33:38 -05:00
|
|
|
u64 bitmap_offset = offset_to_bitmap(ctl, offset);
|
2011-03-21 10:11:24 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
if (ctl->total_bitmaps == 0)
|
2011-03-21 10:11:24 -04:00
|
|
|
return -ENOSPC;
|
|
|
|
|
Btrfs: fix to search one more bitmap for cluster setup
Suppose there are two bitmaps [0, 256], [256, 512] and one extent
[100, 120] in the free space cache, and we want to setup a cluster
with offset=100, bytes=50.
In this case, there will be only one bitmap [256, 512] in the temporary
bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].
The cause is, the list is constructed in setup_cluster_no_bitmap(),
and only bitmaps with bitmap_entry->offset >= offset will be added
into the list, and the very bitmap that convers offset has
bitmap_entry->offset <= offset.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:33:38 -05:00
|
|
|
/*
|
|
|
|
* The bitmap that covers offset won't be in the list unless offset
|
|
|
|
* is just its start offset.
|
|
|
|
*/
|
|
|
|
entry = list_first_entry(bitmaps, struct btrfs_free_space, list);
|
|
|
|
if (entry->offset != bitmap_offset) {
|
|
|
|
entry = tree_search_offset(ctl, bitmap_offset, 1, 0);
|
|
|
|
if (entry && list_empty(&entry->list))
|
|
|
|
list_add(&entry->list, bitmaps);
|
|
|
|
}
|
|
|
|
|
2011-05-25 13:03:16 -04:00
|
|
|
list_for_each_entry(entry, bitmaps, list) {
|
2012-01-26 15:01:11 -05:00
|
|
|
if (entry->bytes < bytes)
|
2011-05-25 13:03:16 -04:00
|
|
|
continue;
|
|
|
|
ret = btrfs_bitmap_cluster(block_group, entry, cluster, offset,
|
2011-10-14 12:10:36 -03:00
|
|
|
bytes, cont1_bytes, min_bytes);
|
2011-05-25 13:03:16 -04:00
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2011-11-20 07:33:38 -05:00
|
|
|
* The bitmaps list has all the bitmaps that record free space
|
|
|
|
* starting after offset, so no more search is required.
|
2011-05-25 13:03:16 -04:00
|
|
|
*/
|
2011-11-20 07:33:38 -05:00
|
|
|
return -ENOSPC;
|
2011-03-21 10:11:24 -04:00
|
|
|
}
|
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
/*
|
|
|
|
* here we try to find a cluster of blocks in a block group. The goal
|
2011-10-14 12:10:36 -03:00
|
|
|
* is to find at least bytes+empty_size.
|
2009-04-03 09:47:43 -04:00
|
|
|
* We might not find them all in one contiguous area.
|
|
|
|
*
|
|
|
|
* returns zero and sets up cluster if things worked out, otherwise
|
|
|
|
* it returns -enospc
|
|
|
|
*/
|
|
|
|
int btrfs_find_space_cluster(struct btrfs_trans_handle *trans,
|
2009-06-09 20:28:34 -04:00
|
|
|
struct btrfs_root *root,
|
2009-04-03 09:47:43 -04:00
|
|
|
struct btrfs_block_group_cache *block_group,
|
|
|
|
struct btrfs_free_cluster *cluster,
|
|
|
|
u64 offset, u64 bytes, u64 empty_size)
|
|
|
|
{
|
2011-03-29 13:46:06 +08:00
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
2011-05-25 13:03:16 -04:00
|
|
|
struct btrfs_free_space *entry, *tmp;
|
2011-11-20 07:33:38 -05:00
|
|
|
LIST_HEAD(bitmaps);
|
2009-04-03 09:47:43 -04:00
|
|
|
u64 min_bytes;
|
2011-10-14 12:10:36 -03:00
|
|
|
u64 cont1_bytes;
|
2009-04-03 09:47:43 -04:00
|
|
|
int ret;
|
|
|
|
|
2011-10-14 12:10:36 -03:00
|
|
|
/*
|
|
|
|
* Choose the minimum extent size we'll require for this
|
|
|
|
* cluster. For SSD_SPREAD, don't allow any fragmentation.
|
|
|
|
* For metadata, allow allocates with smaller extents. For
|
|
|
|
* data, keep it dense.
|
|
|
|
*/
|
2009-06-09 20:28:34 -04:00
|
|
|
if (btrfs_test_opt(root, SSD_SPREAD)) {
|
2011-10-14 12:10:36 -03:00
|
|
|
cont1_bytes = min_bytes = bytes + empty_size;
|
2009-06-09 20:28:34 -04:00
|
|
|
} else if (block_group->flags & BTRFS_BLOCK_GROUP_METADATA) {
|
2011-10-14 12:10:36 -03:00
|
|
|
cont1_bytes = bytes;
|
|
|
|
min_bytes = block_group->sectorsize;
|
|
|
|
} else {
|
|
|
|
cont1_bytes = max(bytes, (bytes + empty_size) >> 2);
|
|
|
|
min_bytes = block_group->sectorsize;
|
|
|
|
}
|
2009-04-03 09:47:43 -04:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
2011-03-18 15:13:42 -04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we know we don't have enough space to make a cluster don't even
|
|
|
|
* bother doing all the work to try and find one.
|
|
|
|
*/
|
2011-10-14 12:10:36 -03:00
|
|
|
if (ctl->free_space < bytes) {
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
2011-03-18 15:13:42 -04:00
|
|
|
return -ENOSPC;
|
|
|
|
}
|
|
|
|
|
2009-04-03 09:47:43 -04:00
|
|
|
spin_lock(&cluster->lock);
|
|
|
|
|
|
|
|
/* someone already found a cluster, hooray */
|
|
|
|
if (cluster->block_group) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2011-11-10 08:29:20 -05:00
|
|
|
trace_btrfs_find_cluster(block_group, offset, bytes, empty_size,
|
|
|
|
min_bytes);
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&bitmaps);
|
2011-05-25 13:03:16 -04:00
|
|
|
ret = setup_cluster_no_bitmap(block_group, cluster, &bitmaps, offset,
|
2011-10-14 12:10:36 -03:00
|
|
|
bytes + empty_size,
|
|
|
|
cont1_bytes, min_bytes);
|
2011-03-21 10:11:24 -04:00
|
|
|
if (ret)
|
2011-05-25 13:03:16 -04:00
|
|
|
ret = setup_cluster_bitmap(block_group, cluster, &bitmaps,
|
2011-10-14 12:10:36 -03:00
|
|
|
offset, bytes + empty_size,
|
|
|
|
cont1_bytes, min_bytes);
|
2011-05-25 13:03:16 -04:00
|
|
|
|
|
|
|
/* Clear our temporary list */
|
|
|
|
list_for_each_entry_safe(entry, tmp, &bitmaps, list)
|
|
|
|
list_del_init(&entry->list);
|
2009-04-03 09:47:43 -04:00
|
|
|
|
2011-03-21 10:11:24 -04:00
|
|
|
if (!ret) {
|
|
|
|
atomic_inc(&block_group->count);
|
|
|
|
list_add_tail(&cluster->block_group_list,
|
|
|
|
&block_group->cluster_list);
|
|
|
|
cluster->block_group = block_group;
|
2011-11-10 08:29:20 -05:00
|
|
|
} else {
|
|
|
|
trace_btrfs_failed_cluster_setup(block_group);
|
2009-04-03 09:47:43 -04:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
spin_unlock(&cluster->lock);
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
2009-04-03 09:47:43 -04:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* simple code to zero out a cluster
|
|
|
|
*/
|
|
|
|
void btrfs_init_free_cluster(struct btrfs_free_cluster *cluster)
|
|
|
|
{
|
|
|
|
spin_lock_init(&cluster->lock);
|
|
|
|
spin_lock_init(&cluster->refill_lock);
|
2010-02-23 19:43:04 +00:00
|
|
|
cluster->root = RB_ROOT;
|
2009-04-03 09:47:43 -04:00
|
|
|
cluster->max_size = 0;
|
|
|
|
INIT_LIST_HEAD(&cluster->block_group_list);
|
|
|
|
cluster->block_group = NULL;
|
|
|
|
}
|
|
|
|
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
static int do_trimming(struct btrfs_block_group_cache *block_group,
|
|
|
|
u64 *total_trimmed, u64 start, u64 bytes,
|
|
|
|
u64 reserved_start, u64 reserved_bytes)
|
2011-03-24 10:24:28 +00:00
|
|
|
{
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
struct btrfs_space_info *space_info = block_group->space_info;
|
2011-03-24 10:24:28 +00:00
|
|
|
struct btrfs_fs_info *fs_info = block_group->fs_info;
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
int ret;
|
|
|
|
int update = 0;
|
|
|
|
u64 trimmed = 0;
|
2011-03-24 10:24:28 +00:00
|
|
|
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (!block_group->ro) {
|
|
|
|
block_group->reserved += reserved_bytes;
|
|
|
|
space_info->bytes_reserved += reserved_bytes;
|
|
|
|
update = 1;
|
|
|
|
}
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
|
|
|
|
ret = btrfs_error_discard_extent(fs_info->extent_root,
|
|
|
|
start, bytes, &trimmed);
|
|
|
|
if (!ret)
|
|
|
|
*total_trimmed += trimmed;
|
|
|
|
|
|
|
|
btrfs_add_free_space(block_group, reserved_start, reserved_bytes);
|
|
|
|
|
|
|
|
if (update) {
|
|
|
|
spin_lock(&space_info->lock);
|
|
|
|
spin_lock(&block_group->lock);
|
|
|
|
if (block_group->ro)
|
|
|
|
space_info->bytes_readonly += reserved_bytes;
|
|
|
|
block_group->reserved -= reserved_bytes;
|
|
|
|
space_info->bytes_reserved -= reserved_bytes;
|
|
|
|
spin_unlock(&space_info->lock);
|
|
|
|
spin_unlock(&block_group->lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int trim_no_bitmap(struct btrfs_block_group_cache *block_group,
|
|
|
|
u64 *total_trimmed, u64 start, u64 end, u64 minlen)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
|
|
|
struct btrfs_free_space *entry;
|
|
|
|
struct rb_node *node;
|
|
|
|
int ret = 0;
|
|
|
|
u64 extent_start;
|
|
|
|
u64 extent_bytes;
|
|
|
|
u64 bytes;
|
2011-03-24 10:24:28 +00:00
|
|
|
|
|
|
|
while (start < end) {
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_lock(&ctl->tree_lock);
|
2011-03-24 10:24:28 +00:00
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
if (ctl->free_space < minlen) {
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
2011-03-24 10:24:28 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
entry = tree_search_offset(ctl, start, 0, 1);
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
if (!entry) {
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
2011-03-24 10:24:28 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
/* skip bitmaps */
|
|
|
|
while (entry->bitmap) {
|
|
|
|
node = rb_next(&entry->offset_index);
|
|
|
|
if (!node) {
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
goto out;
|
2011-03-24 10:24:28 +00:00
|
|
|
}
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
entry = rb_entry(node, struct btrfs_free_space,
|
|
|
|
offset_index);
|
2011-03-24 10:24:28 +00:00
|
|
|
}
|
|
|
|
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
if (entry->offset >= end) {
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
break;
|
2011-03-24 10:24:28 +00:00
|
|
|
}
|
|
|
|
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
extent_start = entry->offset;
|
|
|
|
extent_bytes = entry->bytes;
|
|
|
|
start = max(start, extent_start);
|
|
|
|
bytes = min(extent_start + extent_bytes, end) - start;
|
|
|
|
if (bytes < minlen) {
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
goto next;
|
2011-03-24 10:24:28 +00:00
|
|
|
}
|
|
|
|
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
unlink_free_space(ctl, entry);
|
|
|
|
kmem_cache_free(btrfs_free_space_cachep, entry);
|
|
|
|
|
2011-03-29 13:46:06 +08:00
|
|
|
spin_unlock(&ctl->tree_lock);
|
2011-03-24 10:24:28 +00:00
|
|
|
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
ret = do_trimming(block_group, total_trimmed, start, bytes,
|
|
|
|
extent_start, extent_bytes);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
next:
|
|
|
|
start += bytes;
|
2011-03-24 10:24:28 +00:00
|
|
|
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
if (fatal_signal_pending(current)) {
|
|
|
|
ret = -ERESTARTSYS;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int trim_bitmaps(struct btrfs_block_group_cache *block_group,
|
|
|
|
u64 *total_trimmed, u64 start, u64 end, u64 minlen)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
|
|
|
|
struct btrfs_free_space *entry;
|
|
|
|
int ret = 0;
|
|
|
|
int ret2;
|
|
|
|
u64 bytes;
|
|
|
|
u64 offset = offset_to_bitmap(ctl, start);
|
|
|
|
|
|
|
|
while (offset < end) {
|
|
|
|
bool next_bitmap = false;
|
|
|
|
|
|
|
|
spin_lock(&ctl->tree_lock);
|
|
|
|
|
|
|
|
if (ctl->free_space < minlen) {
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
entry = tree_search_offset(ctl, offset, 1, 0);
|
|
|
|
if (!entry) {
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
next_bitmap = true;
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
|
|
|
bytes = minlen;
|
|
|
|
ret2 = search_bitmap(ctl, entry, &start, &bytes);
|
|
|
|
if (ret2 || start >= end) {
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
next_bitmap = true;
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
|
|
|
bytes = min(bytes, end - start);
|
|
|
|
if (bytes < minlen) {
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
|
|
|
bitmap_clear_bits(ctl, entry, start, bytes);
|
|
|
|
if (entry->bytes == 0)
|
|
|
|
free_bitmap(ctl, entry);
|
|
|
|
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
|
|
|
|
ret = do_trimming(block_group, total_trimmed, start, bytes,
|
|
|
|
start, bytes);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
next:
|
|
|
|
if (next_bitmap) {
|
|
|
|
offset += BITS_PER_BITMAP * ctl->unit;
|
|
|
|
} else {
|
|
|
|
start += bytes;
|
|
|
|
if (start >= offset + BITS_PER_BITMAP * ctl->unit)
|
|
|
|
offset += BITS_PER_BITMAP * ctl->unit;
|
2011-03-24 10:24:28 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (fatal_signal_pending(current)) {
|
|
|
|
ret = -ERESTARTSYS;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
|
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-29 14:47:27 +08:00
|
|
|
int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group,
|
|
|
|
u64 *trimmed, u64 start, u64 end, u64 minlen)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
*trimmed = 0;
|
|
|
|
|
|
|
|
ret = trim_no_bitmap(block_group, trimmed, start, end, minlen);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = trim_bitmaps(block_group, trimmed, start, end, minlen);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
/*
|
|
|
|
* Find the left-most item in the cache tree, and then return the
|
|
|
|
* smallest inode number in the item.
|
|
|
|
*
|
|
|
|
* Note: the returned inode number may not be the smallest one in
|
|
|
|
* the tree, if the left-most item is a bitmap.
|
|
|
|
*/
|
|
|
|
u64 btrfs_find_ino_for_alloc(struct btrfs_root *fs_root)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space_ctl *ctl = fs_root->free_ino_ctl;
|
|
|
|
struct btrfs_free_space *entry = NULL;
|
|
|
|
u64 ino = 0;
|
|
|
|
|
|
|
|
spin_lock(&ctl->tree_lock);
|
|
|
|
|
|
|
|
if (RB_EMPTY_ROOT(&ctl->free_space_offset))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
entry = rb_entry(rb_first(&ctl->free_space_offset),
|
|
|
|
struct btrfs_free_space, offset_index);
|
|
|
|
|
|
|
|
if (!entry->bitmap) {
|
|
|
|
ino = entry->offset;
|
|
|
|
|
|
|
|
unlink_free_space(ctl, entry);
|
|
|
|
entry->offset++;
|
|
|
|
entry->bytes--;
|
|
|
|
if (!entry->bytes)
|
|
|
|
kmem_cache_free(btrfs_free_space_cachep, entry);
|
|
|
|
else
|
|
|
|
link_free_space(ctl, entry);
|
|
|
|
} else {
|
|
|
|
u64 offset = 0;
|
|
|
|
u64 count = 1;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = search_bitmap(ctl, entry, &offset, &count);
|
2012-03-12 16:03:00 +01:00
|
|
|
/* Logic error; Should be empty if it can't find anything */
|
Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-20 10:06:11 +08:00
|
|
|
BUG_ON(ret);
|
|
|
|
|
|
|
|
ino = offset;
|
|
|
|
bitmap_clear_bits(ctl, entry, offset, 1);
|
|
|
|
if (entry->bytes == 0)
|
|
|
|
free_bitmap(ctl, entry);
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
spin_unlock(&ctl->tree_lock);
|
|
|
|
|
|
|
|
return ino;
|
|
|
|
}
|
2011-04-20 10:33:24 +08:00
|
|
|
|
|
|
|
struct inode *lookup_free_ino_inode(struct btrfs_root *root,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
struct inode *inode = NULL;
|
|
|
|
|
|
|
|
spin_lock(&root->cache_lock);
|
|
|
|
if (root->cache_inode)
|
|
|
|
inode = igrab(root->cache_inode);
|
|
|
|
spin_unlock(&root->cache_lock);
|
|
|
|
if (inode)
|
|
|
|
return inode;
|
|
|
|
|
|
|
|
inode = __lookup_free_space_inode(root, path, 0);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return inode;
|
|
|
|
|
|
|
|
spin_lock(&root->cache_lock);
|
2011-05-31 18:07:27 +02:00
|
|
|
if (!btrfs_fs_closing(root->fs_info))
|
2011-04-20 10:33:24 +08:00
|
|
|
root->cache_inode = igrab(inode);
|
|
|
|
spin_unlock(&root->cache_lock);
|
|
|
|
|
|
|
|
return inode;
|
|
|
|
}
|
|
|
|
|
|
|
|
int create_free_ino_inode(struct btrfs_root *root,
|
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
return __create_free_space_inode(root, trans, path,
|
|
|
|
BTRFS_FREE_INO_OBJECTID, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
int load_free_ino_cache(struct btrfs_fs_info *fs_info, struct btrfs_root *root)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space_ctl *ctl = root->free_ino_ctl;
|
|
|
|
struct btrfs_path *path;
|
|
|
|
struct inode *inode;
|
|
|
|
int ret = 0;
|
|
|
|
u64 root_gen = btrfs_root_generation(&root->root_item);
|
|
|
|
|
2011-06-03 09:36:29 -04:00
|
|
|
if (!btrfs_test_opt(root, INODE_MAP_CACHE))
|
|
|
|
return 0;
|
|
|
|
|
2011-04-20 10:33:24 +08:00
|
|
|
/*
|
|
|
|
* If we're unmounting then just return, since this does a search on the
|
|
|
|
* normal root and not the commit root and we could deadlock.
|
|
|
|
*/
|
2011-05-31 18:07:27 +02:00
|
|
|
if (btrfs_fs_closing(fs_info))
|
2011-04-20 10:33:24 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
inode = lookup_free_ino_inode(root, path);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (root_gen != BTRFS_I(inode)->generation)
|
|
|
|
goto out_put;
|
|
|
|
|
|
|
|
ret = __load_free_space_cache(root, inode, ctl, path, 0);
|
|
|
|
|
|
|
|
if (ret < 0)
|
|
|
|
printk(KERN_ERR "btrfs: failed to load free ino cache for "
|
|
|
|
"root %llu\n", root->root_key.objectid);
|
|
|
|
out_put:
|
|
|
|
iput(inode);
|
|
|
|
out:
|
|
|
|
btrfs_free_path(path);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btrfs_write_out_ino_cache(struct btrfs_root *root,
|
|
|
|
struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_path *path)
|
|
|
|
{
|
|
|
|
struct btrfs_free_space_ctl *ctl = root->free_ino_ctl;
|
|
|
|
struct inode *inode;
|
|
|
|
int ret;
|
|
|
|
|
2011-06-03 09:36:29 -04:00
|
|
|
if (!btrfs_test_opt(root, INODE_MAP_CACHE))
|
|
|
|
return 0;
|
|
|
|
|
2011-04-20 10:33:24 +08:00
|
|
|
inode = lookup_free_ino_inode(root, path);
|
|
|
|
if (IS_ERR(inode))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = __btrfs_write_out_cache(root, inode, ctl, NULL, trans, path, 0);
|
2011-08-30 10:19:10 -04:00
|
|
|
if (ret) {
|
|
|
|
btrfs_delalloc_release_metadata(inode, inode->i_size);
|
|
|
|
#ifdef DEBUG
|
2011-04-20 10:33:24 +08:00
|
|
|
printk(KERN_ERR "btrfs: failed to write free ino cache "
|
|
|
|
"for root %llu\n", root->root_key.objectid);
|
2011-08-30 10:19:10 -04:00
|
|
|
#endif
|
|
|
|
}
|
2011-04-20 10:33:24 +08:00
|
|
|
|
|
|
|
iput(inode);
|
|
|
|
return ret;
|
|
|
|
}
|