mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-01-01 02:36:02 +00:00
xfs: document pageable kernel memory
Add a discussion of pageable kernel memory, since online fsck needs quite a bit more memory than most other parts of the filesystem to stage records and other information. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
This commit is contained in:
parent
bae43864c0
commit
5f658dad23
@ -413,6 +413,8 @@ Algorithms") of Srinivasan.
|
||||
However, any data structure builder that maintains a resource lock for the
|
||||
duration of the repair is *always* an offline algorithm.
|
||||
|
||||
.. _secondary_metadata:
|
||||
|
||||
Secondary Metadata
|
||||
``````````````````
|
||||
|
||||
@ -1832,3 +1834,501 @@ Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
|
||||
|
||||
For more information, please see the kernel documentation of
|
||||
Documentation/staging/static-keys.rst.
|
||||
|
||||
.. _xfile:
|
||||
|
||||
Pageable Kernel Memory
|
||||
----------------------
|
||||
|
||||
Some online checking functions work by scanning the filesystem to build a
|
||||
shadow copy of an ondisk metadata structure in memory and comparing the two
|
||||
copies.
|
||||
For online repair to rebuild a metadata structure, it must compute the record
|
||||
set that will be stored in the new structure before it can persist that new
|
||||
structure to disk.
|
||||
Ideally, repairs complete with a single atomic commit that introduces
|
||||
a new data structure.
|
||||
To meet these goals, the kernel needs to collect a large amount of information
|
||||
in a place that doesn't require the correct operation of the filesystem.
|
||||
|
||||
Kernel memory isn't suitable because:
|
||||
|
||||
* Allocating a contiguous region of memory to create a C array is very
|
||||
difficult, especially on 32-bit systems.
|
||||
|
||||
* Linked lists of records introduce double pointer overhead which is very high
|
||||
and eliminate the possibility of indexed lookups.
|
||||
|
||||
* Kernel memory is pinned, which can drive the system into OOM conditions.
|
||||
|
||||
* The system might not have sufficient memory to stage all the information.
|
||||
|
||||
At any given time, online fsck does not need to keep the entire record set in
|
||||
memory, which means that individual records can be paged out if necessary.
|
||||
Continued development of online fsck demonstrated that the ability to perform
|
||||
indexed data storage would also be very useful.
|
||||
Fortunately, the Linux kernel already has a facility for byte-addressable and
|
||||
pageable storage: tmpfs.
|
||||
In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
|
||||
to store intermediate data that doesn't need to be in memory at all times, so
|
||||
that usage precedent is already established.
|
||||
Hence, the ``xfile`` was born!
|
||||
|
||||
+--------------------------------------------------------------------------+
|
||||
| **Historical Sidebar**: |
|
||||
+--------------------------------------------------------------------------+
|
||||
| The first edition of online repair inserted records into a new btree as |
|
||||
| it found them, which failed because filesystem could shut down with a |
|
||||
| built data structure, which would be live after recovery finished. |
|
||||
| |
|
||||
| The second edition solved the half-rebuilt structure problem by storing |
|
||||
| everything in memory, but frequently ran the system out of memory. |
|
||||
| |
|
||||
| The third edition solved the OOM problem by using linked lists, but the |
|
||||
| memory overhead of the list pointers was extreme. |
|
||||
+--------------------------------------------------------------------------+
|
||||
|
||||
xfile Access Models
|
||||
```````````````````
|
||||
|
||||
A survey of the intended uses of xfiles suggested these use cases:
|
||||
|
||||
1. Arrays of fixed-sized records (space management btrees, directory and
|
||||
extended attribute entries)
|
||||
|
||||
2. Sparse arrays of fixed-sized records (quotas and link counts)
|
||||
|
||||
3. Large binary objects (BLOBs) of variable sizes (directory and extended
|
||||
attribute names and values)
|
||||
|
||||
4. Staging btrees in memory (reverse mapping btrees)
|
||||
|
||||
5. Arbitrary contents (realtime space management)
|
||||
|
||||
To support the first four use cases, high level data structures wrap the xfile
|
||||
to share functionality between online fsck functions.
|
||||
The rest of this section discusses the interfaces that the xfile presents to
|
||||
four of those five higher level data structures.
|
||||
The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
|
||||
study.
|
||||
|
||||
The most general storage interface supported by the xfile enables the reading
|
||||
and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
|
||||
This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
|
||||
which behave similarly to their userspace counterparts.
|
||||
XFS is very record-based, which suggests that the ability to load and store
|
||||
complete records is important.
|
||||
To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
|
||||
functions are provided to read and persist objects into an xfile.
|
||||
They are internally the same as pread and pwrite, except that they treat any
|
||||
error as an out of memory error.
|
||||
For online repair, squashing error conditions in this manner is an acceptable
|
||||
behavior because the only reaction is to abort the operation back to userspace.
|
||||
All five xfile usecases can be serviced by these four functions.
|
||||
|
||||
However, no discussion of file access idioms is complete without answering the
|
||||
question, "But what about mmap?"
|
||||
It is convenient to access storage directly with pointers, just like userspace
|
||||
code does with regular memory.
|
||||
Online fsck must not drive the system into OOM conditions, which means that
|
||||
xfiles must be responsive to memory reclamation.
|
||||
tmpfs can only push a pagecache folio to the swap cache if the folio is neither
|
||||
pinned nor locked, which means the xfile must not pin too many folios.
|
||||
|
||||
Short term direct access to xfile contents is done by locking the pagecache
|
||||
folio and mapping it into kernel address space.
|
||||
Programmatic access (e.g. pread and pwrite) uses this mechanism.
|
||||
Folio locks are not supposed to be held for long periods of time, so long
|
||||
term direct access to xfile contents is done by bumping the folio refcount,
|
||||
mapping it into kernel address space, and dropping the folio lock.
|
||||
These long term users *must* be responsive to memory reclaim by hooking into
|
||||
the shrinker infrastructure to know when to release folios.
|
||||
|
||||
The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
|
||||
retrieve the (locked) folio that backs part of an xfile and to release it.
|
||||
The only code to use these folio lease functions are the xfarray
|
||||
:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
|
||||
btrees<xfbtree>`.
|
||||
|
||||
xfile Access Coordination
|
||||
`````````````````````````
|
||||
|
||||
For security reasons, xfiles must be owned privately by the kernel.
|
||||
They are marked ``S_PRIVATE`` to prevent interference from the security system,
|
||||
must never be mapped into process file descriptor tables, and their pages must
|
||||
never be mapped into userspace processes.
|
||||
|
||||
To avoid locking recursion issues with the VFS, all accesses to the shmfs file
|
||||
are performed by manipulating the page cache directly.
|
||||
xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
|
||||
xfile's address space to grab writable pages, copy the caller's buffer into the
|
||||
page, and release the pages.
|
||||
xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
|
||||
before copying the contents into the caller's buffer.
|
||||
In other words, xfiles ignore the VFS read and write code paths to avoid
|
||||
having to create a dummy ``struct kiocb`` and to avoid taking inode and
|
||||
freeze locks.
|
||||
tmpfs cannot be frozen, and xfiles must not be exposed to userspace.
|
||||
|
||||
If an xfile is shared between threads to stage repairs, the caller must provide
|
||||
its own locks to coordinate access.
|
||||
For example, if a scrub function stores scan results in an xfile and needs
|
||||
other threads to provide updates to the scanned data, the scrub function must
|
||||
provide a lock for all threads to share.
|
||||
|
||||
.. _xfarray:
|
||||
|
||||
Arrays of Fixed-Sized Records
|
||||
`````````````````````````````
|
||||
|
||||
In XFS, each type of indexed space metadata (free space, inodes, reference
|
||||
counts, file fork space, and reverse mappings) consists of a set of fixed-size
|
||||
records indexed with a classic B+ tree.
|
||||
Directories have a set of fixed-size dirent records that point to the names,
|
||||
and extended attributes have a set of fixed-size attribute keys that point to
|
||||
names and values.
|
||||
Quota counters and file link counters index records with numbers.
|
||||
During a repair, scrub needs to stage new records during the gathering step and
|
||||
retrieve them during the btree building step.
|
||||
|
||||
Although this requirement can be satisfied by calling the read and write
|
||||
methods of the xfile directly, it is simpler for callers for there to be a
|
||||
higher level abstraction to take care of computing array offsets, to provide
|
||||
iterator functions, and to deal with sparse records and sorting.
|
||||
The ``xfarray`` abstraction presents a linear array for fixed-size records atop
|
||||
the byte-accessible xfile.
|
||||
|
||||
.. _xfarray_access_patterns:
|
||||
|
||||
Array Access Patterns
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Array access patterns in online fsck tend to fall into three categories.
|
||||
Iteration of records is assumed to be necessary for all cases and will be
|
||||
covered in the next section.
|
||||
|
||||
The first type of caller handles records that are indexed by position.
|
||||
Gaps may exist between records, and a record may be updated multiple times
|
||||
during the collection step.
|
||||
In other words, these callers want a sparse linearly addressed table file.
|
||||
The typical use case are quota records or file link count records.
|
||||
Access to array elements is performed programmatically via ``xfarray_load`` and
|
||||
``xfarray_store`` functions, which wrap the similarly-named xfile functions to
|
||||
provide loading and storing of array elements at arbitrary array indices.
|
||||
Gaps are defined to be null records, and null records are defined to be a
|
||||
sequence of all zero bytes.
|
||||
Null records are detected by calling ``xfarray_element_is_null``.
|
||||
They are created either by calling ``xfarray_unset`` to null out an existing
|
||||
record or by never storing anything to an array index.
|
||||
|
||||
The second type of caller handles records that are not indexed by position
|
||||
and do not require multiple updates to a record.
|
||||
The typical use case here is rebuilding space btrees and key/value btrees.
|
||||
These callers can add records to the array without caring about array indices
|
||||
via the ``xfarray_append`` function, which stores a record at the end of the
|
||||
array.
|
||||
For callers that require records to be presentable in a specific order (e.g.
|
||||
rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
|
||||
records; this function will be covered later.
|
||||
|
||||
The third type of caller is a bag, which is useful for counting records.
|
||||
The typical use case here is constructing space extent reference counts from
|
||||
reverse mapping information.
|
||||
Records can be put in the bag in any order, they can be removed from the bag
|
||||
at any time, and uniqueness of records is left to callers.
|
||||
The ``xfarray_store_anywhere`` function is used to insert a record in any
|
||||
null record slot in the bag; and the ``xfarray_unset`` function removes a
|
||||
record from the bag.
|
||||
|
||||
The proposed patchset is the
|
||||
`big in-memory array
|
||||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
|
||||
|
||||
Iterating Array Elements
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Most users of the xfarray require the ability to iterate the records stored in
|
||||
the array.
|
||||
Callers can probe every possible array index with the following:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
xfarray_idx_t i;
|
||||
foreach_xfarray_idx(array, i) {
|
||||
xfarray_load(array, i, &rec);
|
||||
|
||||
/* do something with rec */
|
||||
}
|
||||
|
||||
All users of this idiom must be prepared to handle null records or must already
|
||||
know that there aren't any.
|
||||
|
||||
For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
|
||||
function ignores indices in the xfarray that have never been written to by
|
||||
calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
|
||||
of the array that are not populated with memory pages.
|
||||
Once it finds a page, it will skip the zeroed areas of the page.
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
xfarray_idx_t i = XFARRAY_CURSOR_INIT;
|
||||
while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
|
||||
/* do something with rec */
|
||||
}
|
||||
|
||||
.. _xfarray_sort:
|
||||
|
||||
Sorting Array Elements
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
During the fourth demonstration of online repair, a community reviewer remarked
|
||||
that for performance reasons, online repair ought to load batches of records
|
||||
into btree record blocks instead of inserting records into a new btree one at a
|
||||
time.
|
||||
The btree insertion code in XFS is responsible for maintaining correct ordering
|
||||
of the records, so naturally the xfarray must also support sorting the record
|
||||
set prior to bulk loading.
|
||||
|
||||
Case Study: Sorting xfarrays
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The sorting algorithm used in the xfarray is actually a combination of adaptive
|
||||
quicksort and a heapsort subalgorithm in the spirit of
|
||||
`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
|
||||
`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
|
||||
kernel.
|
||||
To sort records in a reasonably short amount of time, ``xfarray`` takes
|
||||
advantage of the binary subpartitioning offered by quicksort, but it also uses
|
||||
heapsort to hedge aginst performance collapse if the chosen quicksort pivots
|
||||
are poor.
|
||||
Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
|
||||
gulf between the two implementations.
|
||||
|
||||
The Linux kernel already contains a reasonably fast implementation of heapsort.
|
||||
It only operates on regular C arrays, which limits the scope of its usefulness.
|
||||
There are two key places where the xfarray uses it:
|
||||
|
||||
* Sorting any record subset backed by a single xfile page.
|
||||
|
||||
* Loading a small number of xfarray records from potentially disparate parts
|
||||
of the xfarray into a memory buffer, and sorting the buffer.
|
||||
|
||||
In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
|
||||
quicksort, thereby mitigating quicksort's worst runtime behavior.
|
||||
|
||||
Choosing a quicksort pivot is a tricky business.
|
||||
A good pivot splits the set to sort in half, leading to the divide and conquer
|
||||
behavior that is crucial to O(n * lg(n)) performance.
|
||||
A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
|
||||
runtime.
|
||||
The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
|
||||
records into a memory buffer and using the kernel heapsort to identify the
|
||||
median of the nine.
|
||||
|
||||
Most modern quicksort implementations employ Tukey's "ninther" to select a
|
||||
pivot from a classic C array.
|
||||
Typical ninther implementations pick three unique triads of records, sort each
|
||||
of the triads, and then sort the middle value of each triad to determine the
|
||||
ninther value.
|
||||
As stated previously, however, xfile accesses are not entirely cheap.
|
||||
It turned out to be much more performant to read the nine elements into a
|
||||
memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
|
||||
the 4th element of that buffer as the pivot.
|
||||
Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
|
||||
low-effort robust (resistant) location in large samples`, in *Contributions to
|
||||
Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
|
||||
1978), pp. 251–257.
|
||||
|
||||
The partitioning of quicksort is fairly textbook -- rearrange the record
|
||||
subset around the pivot, then set up the current and next stack frames to
|
||||
sort with the larger and the smaller halves of the pivot, respectively.
|
||||
This keeps the stack space requirements to log2(record count).
|
||||
|
||||
As a final performance optimization, the hi and lo scanning phase of quicksort
|
||||
keeps examined xfile pages mapped in the kernel for as long as possible to
|
||||
reduce map/unmap cycles.
|
||||
Surprisingly, this reduces overall sort runtime by nearly half again after
|
||||
accounting for the application of heapsort directly onto xfile pages.
|
||||
|
||||
Blob Storage
|
||||
````````````
|
||||
|
||||
Extended attributes and directories add an additional requirement for staging
|
||||
records: arbitrary byte sequences of finite length.
|
||||
Each directory entry record needs to store entry name,
|
||||
and each extended attribute needs to store both the attribute name and value.
|
||||
The names, keys, and values can consume a large amount of memory, so the
|
||||
``xfblob`` abstraction was created to simplify management of these blobs
|
||||
atop an xfile.
|
||||
|
||||
Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
|
||||
and persist objects.
|
||||
The store function returns a magic cookie for every object that it persists.
|
||||
Later, callers provide this cookie to the ``xblob_load`` to recall the object.
|
||||
The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
|
||||
function frees them all because compaction is not needed.
|
||||
|
||||
The details of repairing directories and extended attributes will be discussed
|
||||
in a subsequent section about atomic extent swapping.
|
||||
However, it should be noted that these repair functions only use blob storage
|
||||
to cache a small number of entries before adding them to a temporary ondisk
|
||||
file, which is why compaction is not required.
|
||||
|
||||
The proposed patchset is at the start of the
|
||||
`extended attribute repair
|
||||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
|
||||
|
||||
.. _xfbtree:
|
||||
|
||||
In-Memory B+Trees
|
||||
`````````````````
|
||||
|
||||
The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
|
||||
checking and repairing of secondary metadata commonly requires coordination
|
||||
between a live metadata scan of the filesystem and writer threads that are
|
||||
updating that metadata.
|
||||
Keeping the scan data up to date requires requires the ability to propagate
|
||||
metadata updates from the filesystem into the data being collected by the scan.
|
||||
This *can* be done by appending concurrent updates into a separate log file and
|
||||
applying them before writing the new metadata to disk, but this leads to
|
||||
unbounded memory consumption if the rest of the system is very busy.
|
||||
Another option is to skip the side-log and commit live updates from the
|
||||
filesystem directly into the scan data, which trades more overhead for a lower
|
||||
maximum memory requirement.
|
||||
In both cases, the data structure holding the scan results must support indexed
|
||||
access to perform well.
|
||||
|
||||
Given that indexed lookups of scan data is required for both strategies, online
|
||||
fsck employs the second strategy of committing live updates directly into
|
||||
scan data.
|
||||
Because xfarrays are not indexed and do not enforce record ordering, they
|
||||
are not suitable for this task.
|
||||
Conveniently, however, XFS has a library to create and maintain ordered reverse
|
||||
mapping records: the existing rmap btree code!
|
||||
If only there was a means to create one in memory.
|
||||
|
||||
Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
|
||||
regular file, which means that the kernel can create byte or block addressable
|
||||
virtual address spaces at will.
|
||||
The XFS buffer cache specializes in abstracting IO to block-oriented address
|
||||
spaces, which means that adaptation of the buffer cache to interface with
|
||||
xfiles enables reuse of the entire btree library.
|
||||
Btrees built atop an xfile are collectively known as ``xfbtrees``.
|
||||
The next few sections describe how they actually work.
|
||||
|
||||
The proposed patchset is the
|
||||
`in-memory btree
|
||||
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
|
||||
series.
|
||||
|
||||
Using xfiles as a Buffer Cache Target
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Two modifications are necessary to support xfiles as a buffer cache target.
|
||||
The first is to make it possible for the ``struct xfs_buftarg`` structure to
|
||||
host the ``struct xfs_buf`` rhashtable, because normally those are held by a
|
||||
per-AG structure.
|
||||
The second change is to modify the buffer ``ioapply`` function to "read" cached
|
||||
pages from the xfile and "write" cached pages back to the xfile.
|
||||
Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
|
||||
since the xfile does not provide any locking on its own.
|
||||
With this adaptation in place, users of the xfile-backed buffer cache use
|
||||
exactly the same APIs as users of the disk-backed buffer cache.
|
||||
The separation between xfile and buffer cache implies higher memory usage since
|
||||
they do not share pages, but this property could some day enable transactional
|
||||
updates to an in-memory btree.
|
||||
Today, however, it simply eliminates the need for new code.
|
||||
|
||||
Space Management with an xfbtree
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Space management for an xfile is very simple -- each btree block is one memory
|
||||
page in size.
|
||||
These blocks use the same header format as an on-disk btree, but the in-memory
|
||||
block verifiers ignore the checksums, assuming that xfile memory is no more
|
||||
corruption-prone than regular DRAM.
|
||||
Reusing existing code here is more important than absolute memory efficiency.
|
||||
|
||||
The very first block of an xfile backing an xfbtree contains a header block.
|
||||
The header describes the owner, height, and the block number of the root
|
||||
xfbtree block.
|
||||
|
||||
To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
|
||||
If there are no gaps, create one by extending the length of the xfile.
|
||||
Preallocate space for the block with ``xfile_prealloc``, and hand back the
|
||||
location.
|
||||
To free an xfbtree block, use ``xfile_discard`` (which internally uses
|
||||
``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
|
||||
|
||||
Populating an xfbtree
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
An online fsck function that wants to create an xfbtree should proceed as
|
||||
follows:
|
||||
|
||||
1. Call ``xfile_create`` to create an xfile.
|
||||
|
||||
2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
|
||||
pointing to the xfile.
|
||||
|
||||
3. Pass the buffer cache target, buffer ops, and other information to
|
||||
``xfbtree_create`` to write an initial tree header and root block to the
|
||||
xfile.
|
||||
Each btree type should define a wrapper that passes necessary arguments to
|
||||
the creation function.
|
||||
For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
|
||||
all the necessary details for callers.
|
||||
A ``struct xfbtree`` object will be returned.
|
||||
|
||||
4. Pass the xfbtree object to the btree cursor creation function for the
|
||||
btree type.
|
||||
Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
|
||||
for callers.
|
||||
|
||||
5. Pass the btree cursor to the regular btree functions to make queries against
|
||||
and to update the in-memory btree.
|
||||
For example, a btree cursor for an rmap xfbtree can be passed to the
|
||||
``xfs_rmap_*`` functions just like any other btree cursor.
|
||||
See the :ref:`next section<xfbtree_commit>` for information on dealing with
|
||||
xfbtree updates that are logged to a transaction.
|
||||
|
||||
6. When finished, delete the btree cursor, destroy the xfbtree object, free the
|
||||
buffer target, and the destroy the xfile to release all resources.
|
||||
|
||||
.. _xfbtree_commit:
|
||||
|
||||
Committing Logged xfbtree Buffers
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Although it is a clever hack to reuse the rmap btree code to handle the staging
|
||||
structure, the ephemeral nature of the in-memory btree block storage presents
|
||||
some challenges of its own.
|
||||
The XFS transaction manager must not commit buffer log items for buffers backed
|
||||
by an xfile because the log format does not understand updates for devices
|
||||
other than the data device.
|
||||
An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
|
||||
log transactions back into the filesystem, and certainly won't exist during
|
||||
log recovery.
|
||||
For these reasons, any code updating an xfbtree in transaction context must
|
||||
remove the buffer log items from the transaction and write the updates into the
|
||||
backing xfile before committing or cancelling the transaction.
|
||||
|
||||
The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
|
||||
this functionality as follows:
|
||||
|
||||
1. Find each buffer log item whose buffer targets the xfile.
|
||||
|
||||
2. Record the dirty/ordered status of the log item.
|
||||
|
||||
3. Detach the log item from the buffer.
|
||||
|
||||
4. Queue the buffer to a special delwri list.
|
||||
|
||||
5. Clear the transaction dirty flag if the only dirty log items were the ones
|
||||
that were detached in step 3.
|
||||
|
||||
6. Submit the delwri list to commit the changes to the xfile, if the updates
|
||||
are being committed.
|
||||
|
||||
After removing xfile logged buffers from the transaction in this manner, the
|
||||
transaction can be committed or cancelled.
|
||||
|
Loading…
Reference in New Issue
Block a user