2023-11-27 13:58:07 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
|
|
|
/* Miscellaneous routines.
|
|
|
|
*
|
|
|
|
* Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
|
|
|
|
* Written by David Howells (dhowells@redhat.com)
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include "internal.h"
|
|
|
|
|
2024-05-29 21:47:07 +01:00
|
|
|
/*
|
netfs: Fix write oops in generic/346 (9p) and generic/074 (cifs)
In netfslib, a buffered writeback operation has a 'write queue' of folios
that are being written, held in a linear sequence of folio_queue structs.
The 'issuer' adds new folio_queues on the leading edge of the queue and
populates each one progressively; the 'collector' pops them off the
trailing edge and discards them and the folios they point to as they are
consumed.
The queue is required to always retain at least one folio_queue structure.
This allows the queue to be accessed without locking and with just a bit of
barriering.
When a new subrequest is prepared, its ->io_iter iterator is pointed at the
current end of the write queue and then the iterator is extended as more
data is added to the queue until the subrequest is committed.
Now, the problem is that the folio_queue at the leading edge of the write
queue when a subrequest is prepared might have been entirely consumed - but
not yet removed from the queue as it is the only remaining one and is
preventing the queue from collapsing.
So, what happens is that subreq->io_iter is pointed at the spent
folio_queue, then a new folio_queue is added, and, at that point, the
collector is at entirely at liberty to immediately delete the spent
folio_queue.
This leaves the subreq->io_iter pointing at a freed object. If the system
is lucky, iterate_folioq() sees ->io_iter, sees the as-yet uncorrupted
freed object and advances to the next folio_queue in the queue.
In the case seen, however, the freed object gets recycled and put back onto
the queue at the tail and filled to the end. This confuses
iterate_folioq() and it tries to step ->next, which may be NULL - resulting
in an oops.
Fix this by the following means:
(1) When preparing a write subrequest, make sure there's a folio_queue
struct with space in it at the leading edge of the queue. A function
to make space is split out of the function to append a folio so that
it can be called for this purpose.
(2) If the request struct iterator is pointing to a completely spent
folio_queue when we make space, then advance the iterator to the newly
allocated folio_queue. The subrequest's iterator will then be set
from this.
The oops could be triggered using the generic/346 xfstest with a filesystem
on9P over TCP with cache=loose. The oops looked something like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
RIP: 0010:_copy_from_iter+0x2db/0x530
...
Call Trace:
<TASK>
...
p9pdu_vwritef+0x3d8/0x5d0
p9_client_prepare_req+0xa8/0x140
p9_client_rpc+0x81/0x280
p9_client_write+0xcf/0x1c0
v9fs_issue_write+0x87/0xc0
netfs_advance_write+0xa0/0xb0
netfs_write_folio.isra.0+0x42d/0x500
netfs_writepages+0x15a/0x1f0
do_writepages+0xd1/0x220
filemap_fdatawrite_wbc+0x5c/0x80
v9fs_mmap_vm_close+0x7d/0xb0
remove_vma+0x35/0x70
vms_complete_munmap_vmas+0x11a/0x170
do_vmi_align_munmap+0x17d/0x1c0
do_vmi_munmap+0x13e/0x150
__vm_munmap+0x92/0xd0
__x64_sys_munmap+0x17/0x20
do_syscall_64+0x80/0xe0
entry_SYSCALL_64_after_hwframe+0x71/0x79
This also fixed a similar-looking issue with cifs and generic/074.
Fixes: cd0277ed0c18 ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409180928.f20b5a08-oliver.sang@intel.com
Closes: https://lore.kernel.org/oe-lkp/202409131438.3f225fbf-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 14:58:30 +01:00
|
|
|
* Make sure there's space in the rolling queue.
|
2024-05-29 21:47:07 +01:00
|
|
|
*/
|
netfs: Fix write oops in generic/346 (9p) and generic/074 (cifs)
In netfslib, a buffered writeback operation has a 'write queue' of folios
that are being written, held in a linear sequence of folio_queue structs.
The 'issuer' adds new folio_queues on the leading edge of the queue and
populates each one progressively; the 'collector' pops them off the
trailing edge and discards them and the folios they point to as they are
consumed.
The queue is required to always retain at least one folio_queue structure.
This allows the queue to be accessed without locking and with just a bit of
barriering.
When a new subrequest is prepared, its ->io_iter iterator is pointed at the
current end of the write queue and then the iterator is extended as more
data is added to the queue until the subrequest is committed.
Now, the problem is that the folio_queue at the leading edge of the write
queue when a subrequest is prepared might have been entirely consumed - but
not yet removed from the queue as it is the only remaining one and is
preventing the queue from collapsing.
So, what happens is that subreq->io_iter is pointed at the spent
folio_queue, then a new folio_queue is added, and, at that point, the
collector is at entirely at liberty to immediately delete the spent
folio_queue.
This leaves the subreq->io_iter pointing at a freed object. If the system
is lucky, iterate_folioq() sees ->io_iter, sees the as-yet uncorrupted
freed object and advances to the next folio_queue in the queue.
In the case seen, however, the freed object gets recycled and put back onto
the queue at the tail and filled to the end. This confuses
iterate_folioq() and it tries to step ->next, which may be NULL - resulting
in an oops.
Fix this by the following means:
(1) When preparing a write subrequest, make sure there's a folio_queue
struct with space in it at the leading edge of the queue. A function
to make space is split out of the function to append a folio so that
it can be called for this purpose.
(2) If the request struct iterator is pointing to a completely spent
folio_queue when we make space, then advance the iterator to the newly
allocated folio_queue. The subrequest's iterator will then be set
from this.
The oops could be triggered using the generic/346 xfstest with a filesystem
on9P over TCP with cache=loose. The oops looked something like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
RIP: 0010:_copy_from_iter+0x2db/0x530
...
Call Trace:
<TASK>
...
p9pdu_vwritef+0x3d8/0x5d0
p9_client_prepare_req+0xa8/0x140
p9_client_rpc+0x81/0x280
p9_client_write+0xcf/0x1c0
v9fs_issue_write+0x87/0xc0
netfs_advance_write+0xa0/0xb0
netfs_write_folio.isra.0+0x42d/0x500
netfs_writepages+0x15a/0x1f0
do_writepages+0xd1/0x220
filemap_fdatawrite_wbc+0x5c/0x80
v9fs_mmap_vm_close+0x7d/0xb0
remove_vma+0x35/0x70
vms_complete_munmap_vmas+0x11a/0x170
do_vmi_align_munmap+0x17d/0x1c0
do_vmi_munmap+0x13e/0x150
__vm_munmap+0x92/0xd0
__x64_sys_munmap+0x17/0x20
do_syscall_64+0x80/0xe0
entry_SYSCALL_64_after_hwframe+0x71/0x79
This also fixed a similar-looking issue with cifs and generic/074.
Fixes: cd0277ed0c18 ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409180928.f20b5a08-oliver.sang@intel.com
Closes: https://lore.kernel.org/oe-lkp/202409131438.3f225fbf-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 14:58:30 +01:00
|
|
|
struct folio_queue *netfs_buffer_make_space(struct netfs_io_request *rreq)
|
2024-05-29 21:47:07 +01:00
|
|
|
{
|
netfs: Fix write oops in generic/346 (9p) and generic/074 (cifs)
In netfslib, a buffered writeback operation has a 'write queue' of folios
that are being written, held in a linear sequence of folio_queue structs.
The 'issuer' adds new folio_queues on the leading edge of the queue and
populates each one progressively; the 'collector' pops them off the
trailing edge and discards them and the folios they point to as they are
consumed.
The queue is required to always retain at least one folio_queue structure.
This allows the queue to be accessed without locking and with just a bit of
barriering.
When a new subrequest is prepared, its ->io_iter iterator is pointed at the
current end of the write queue and then the iterator is extended as more
data is added to the queue until the subrequest is committed.
Now, the problem is that the folio_queue at the leading edge of the write
queue when a subrequest is prepared might have been entirely consumed - but
not yet removed from the queue as it is the only remaining one and is
preventing the queue from collapsing.
So, what happens is that subreq->io_iter is pointed at the spent
folio_queue, then a new folio_queue is added, and, at that point, the
collector is at entirely at liberty to immediately delete the spent
folio_queue.
This leaves the subreq->io_iter pointing at a freed object. If the system
is lucky, iterate_folioq() sees ->io_iter, sees the as-yet uncorrupted
freed object and advances to the next folio_queue in the queue.
In the case seen, however, the freed object gets recycled and put back onto
the queue at the tail and filled to the end. This confuses
iterate_folioq() and it tries to step ->next, which may be NULL - resulting
in an oops.
Fix this by the following means:
(1) When preparing a write subrequest, make sure there's a folio_queue
struct with space in it at the leading edge of the queue. A function
to make space is split out of the function to append a folio so that
it can be called for this purpose.
(2) If the request struct iterator is pointing to a completely spent
folio_queue when we make space, then advance the iterator to the newly
allocated folio_queue. The subrequest's iterator will then be set
from this.
The oops could be triggered using the generic/346 xfstest with a filesystem
on9P over TCP with cache=loose. The oops looked something like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
RIP: 0010:_copy_from_iter+0x2db/0x530
...
Call Trace:
<TASK>
...
p9pdu_vwritef+0x3d8/0x5d0
p9_client_prepare_req+0xa8/0x140
p9_client_rpc+0x81/0x280
p9_client_write+0xcf/0x1c0
v9fs_issue_write+0x87/0xc0
netfs_advance_write+0xa0/0xb0
netfs_write_folio.isra.0+0x42d/0x500
netfs_writepages+0x15a/0x1f0
do_writepages+0xd1/0x220
filemap_fdatawrite_wbc+0x5c/0x80
v9fs_mmap_vm_close+0x7d/0xb0
remove_vma+0x35/0x70
vms_complete_munmap_vmas+0x11a/0x170
do_vmi_align_munmap+0x17d/0x1c0
do_vmi_munmap+0x13e/0x150
__vm_munmap+0x92/0xd0
__x64_sys_munmap+0x17/0x20
do_syscall_64+0x80/0xe0
entry_SYSCALL_64_after_hwframe+0x71/0x79
This also fixed a similar-looking issue with cifs and generic/074.
Fixes: cd0277ed0c18 ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409180928.f20b5a08-oliver.sang@intel.com
Closes: https://lore.kernel.org/oe-lkp/202409131438.3f225fbf-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 14:58:30 +01:00
|
|
|
struct folio_queue *tail = rreq->buffer_tail, *prev;
|
|
|
|
unsigned int prev_nr_slots = 0;
|
2024-05-29 21:47:07 +01:00
|
|
|
|
|
|
|
if (WARN_ON_ONCE(!rreq->buffer && tail) ||
|
|
|
|
WARN_ON_ONCE(rreq->buffer && !tail))
|
netfs: Fix write oops in generic/346 (9p) and generic/074 (cifs)
In netfslib, a buffered writeback operation has a 'write queue' of folios
that are being written, held in a linear sequence of folio_queue structs.
The 'issuer' adds new folio_queues on the leading edge of the queue and
populates each one progressively; the 'collector' pops them off the
trailing edge and discards them and the folios they point to as they are
consumed.
The queue is required to always retain at least one folio_queue structure.
This allows the queue to be accessed without locking and with just a bit of
barriering.
When a new subrequest is prepared, its ->io_iter iterator is pointed at the
current end of the write queue and then the iterator is extended as more
data is added to the queue until the subrequest is committed.
Now, the problem is that the folio_queue at the leading edge of the write
queue when a subrequest is prepared might have been entirely consumed - but
not yet removed from the queue as it is the only remaining one and is
preventing the queue from collapsing.
So, what happens is that subreq->io_iter is pointed at the spent
folio_queue, then a new folio_queue is added, and, at that point, the
collector is at entirely at liberty to immediately delete the spent
folio_queue.
This leaves the subreq->io_iter pointing at a freed object. If the system
is lucky, iterate_folioq() sees ->io_iter, sees the as-yet uncorrupted
freed object and advances to the next folio_queue in the queue.
In the case seen, however, the freed object gets recycled and put back onto
the queue at the tail and filled to the end. This confuses
iterate_folioq() and it tries to step ->next, which may be NULL - resulting
in an oops.
Fix this by the following means:
(1) When preparing a write subrequest, make sure there's a folio_queue
struct with space in it at the leading edge of the queue. A function
to make space is split out of the function to append a folio so that
it can be called for this purpose.
(2) If the request struct iterator is pointing to a completely spent
folio_queue when we make space, then advance the iterator to the newly
allocated folio_queue. The subrequest's iterator will then be set
from this.
The oops could be triggered using the generic/346 xfstest with a filesystem
on9P over TCP with cache=loose. The oops looked something like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
RIP: 0010:_copy_from_iter+0x2db/0x530
...
Call Trace:
<TASK>
...
p9pdu_vwritef+0x3d8/0x5d0
p9_client_prepare_req+0xa8/0x140
p9_client_rpc+0x81/0x280
p9_client_write+0xcf/0x1c0
v9fs_issue_write+0x87/0xc0
netfs_advance_write+0xa0/0xb0
netfs_write_folio.isra.0+0x42d/0x500
netfs_writepages+0x15a/0x1f0
do_writepages+0xd1/0x220
filemap_fdatawrite_wbc+0x5c/0x80
v9fs_mmap_vm_close+0x7d/0xb0
remove_vma+0x35/0x70
vms_complete_munmap_vmas+0x11a/0x170
do_vmi_align_munmap+0x17d/0x1c0
do_vmi_munmap+0x13e/0x150
__vm_munmap+0x92/0xd0
__x64_sys_munmap+0x17/0x20
do_syscall_64+0x80/0xe0
entry_SYSCALL_64_after_hwframe+0x71/0x79
This also fixed a similar-looking issue with cifs and generic/074.
Fixes: cd0277ed0c18 ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409180928.f20b5a08-oliver.sang@intel.com
Closes: https://lore.kernel.org/oe-lkp/202409131438.3f225fbf-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 14:58:30 +01:00
|
|
|
return ERR_PTR(-EIO);
|
2024-05-29 21:47:07 +01:00
|
|
|
|
netfs: Fix write oops in generic/346 (9p) and generic/074 (cifs)
In netfslib, a buffered writeback operation has a 'write queue' of folios
that are being written, held in a linear sequence of folio_queue structs.
The 'issuer' adds new folio_queues on the leading edge of the queue and
populates each one progressively; the 'collector' pops them off the
trailing edge and discards them and the folios they point to as they are
consumed.
The queue is required to always retain at least one folio_queue structure.
This allows the queue to be accessed without locking and with just a bit of
barriering.
When a new subrequest is prepared, its ->io_iter iterator is pointed at the
current end of the write queue and then the iterator is extended as more
data is added to the queue until the subrequest is committed.
Now, the problem is that the folio_queue at the leading edge of the write
queue when a subrequest is prepared might have been entirely consumed - but
not yet removed from the queue as it is the only remaining one and is
preventing the queue from collapsing.
So, what happens is that subreq->io_iter is pointed at the spent
folio_queue, then a new folio_queue is added, and, at that point, the
collector is at entirely at liberty to immediately delete the spent
folio_queue.
This leaves the subreq->io_iter pointing at a freed object. If the system
is lucky, iterate_folioq() sees ->io_iter, sees the as-yet uncorrupted
freed object and advances to the next folio_queue in the queue.
In the case seen, however, the freed object gets recycled and put back onto
the queue at the tail and filled to the end. This confuses
iterate_folioq() and it tries to step ->next, which may be NULL - resulting
in an oops.
Fix this by the following means:
(1) When preparing a write subrequest, make sure there's a folio_queue
struct with space in it at the leading edge of the queue. A function
to make space is split out of the function to append a folio so that
it can be called for this purpose.
(2) If the request struct iterator is pointing to a completely spent
folio_queue when we make space, then advance the iterator to the newly
allocated folio_queue. The subrequest's iterator will then be set
from this.
The oops could be triggered using the generic/346 xfstest with a filesystem
on9P over TCP with cache=loose. The oops looked something like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
RIP: 0010:_copy_from_iter+0x2db/0x530
...
Call Trace:
<TASK>
...
p9pdu_vwritef+0x3d8/0x5d0
p9_client_prepare_req+0xa8/0x140
p9_client_rpc+0x81/0x280
p9_client_write+0xcf/0x1c0
v9fs_issue_write+0x87/0xc0
netfs_advance_write+0xa0/0xb0
netfs_write_folio.isra.0+0x42d/0x500
netfs_writepages+0x15a/0x1f0
do_writepages+0xd1/0x220
filemap_fdatawrite_wbc+0x5c/0x80
v9fs_mmap_vm_close+0x7d/0xb0
remove_vma+0x35/0x70
vms_complete_munmap_vmas+0x11a/0x170
do_vmi_align_munmap+0x17d/0x1c0
do_vmi_munmap+0x13e/0x150
__vm_munmap+0x92/0xd0
__x64_sys_munmap+0x17/0x20
do_syscall_64+0x80/0xe0
entry_SYSCALL_64_after_hwframe+0x71/0x79
This also fixed a similar-looking issue with cifs and generic/074.
Fixes: cd0277ed0c18 ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409180928.f20b5a08-oliver.sang@intel.com
Closes: https://lore.kernel.org/oe-lkp/202409131438.3f225fbf-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 14:58:30 +01:00
|
|
|
prev = tail;
|
|
|
|
if (prev) {
|
|
|
|
if (!folioq_full(tail))
|
|
|
|
return tail;
|
|
|
|
prev_nr_slots = folioq_nr_slots(tail);
|
|
|
|
}
|
|
|
|
|
|
|
|
tail = kmalloc(sizeof(*tail), GFP_NOFS);
|
|
|
|
if (!tail)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
netfs_stat(&netfs_n_folioq);
|
|
|
|
folioq_init(tail);
|
|
|
|
tail->prev = prev;
|
|
|
|
if (prev)
|
|
|
|
/* [!] NOTE: After we set prev->next, the consumer is entirely
|
|
|
|
* at liberty to delete prev.
|
|
|
|
*/
|
|
|
|
WRITE_ONCE(prev->next, tail);
|
|
|
|
|
|
|
|
rreq->buffer_tail = tail;
|
|
|
|
if (!rreq->buffer) {
|
|
|
|
rreq->buffer = tail;
|
|
|
|
iov_iter_folio_queue(&rreq->io_iter, ITER_SOURCE, tail, 0, 0, 0);
|
|
|
|
} else {
|
|
|
|
/* Make sure we don't leave the master iterator pointing to a
|
|
|
|
* block that might get immediately consumed.
|
|
|
|
*/
|
|
|
|
if (rreq->io_iter.folioq == prev &&
|
|
|
|
rreq->io_iter.folioq_slot == prev_nr_slots) {
|
|
|
|
rreq->io_iter.folioq = tail;
|
|
|
|
rreq->io_iter.folioq_slot = 0;
|
2024-05-29 21:47:07 +01:00
|
|
|
}
|
|
|
|
}
|
netfs: Fix write oops in generic/346 (9p) and generic/074 (cifs)
In netfslib, a buffered writeback operation has a 'write queue' of folios
that are being written, held in a linear sequence of folio_queue structs.
The 'issuer' adds new folio_queues on the leading edge of the queue and
populates each one progressively; the 'collector' pops them off the
trailing edge and discards them and the folios they point to as they are
consumed.
The queue is required to always retain at least one folio_queue structure.
This allows the queue to be accessed without locking and with just a bit of
barriering.
When a new subrequest is prepared, its ->io_iter iterator is pointed at the
current end of the write queue and then the iterator is extended as more
data is added to the queue until the subrequest is committed.
Now, the problem is that the folio_queue at the leading edge of the write
queue when a subrequest is prepared might have been entirely consumed - but
not yet removed from the queue as it is the only remaining one and is
preventing the queue from collapsing.
So, what happens is that subreq->io_iter is pointed at the spent
folio_queue, then a new folio_queue is added, and, at that point, the
collector is at entirely at liberty to immediately delete the spent
folio_queue.
This leaves the subreq->io_iter pointing at a freed object. If the system
is lucky, iterate_folioq() sees ->io_iter, sees the as-yet uncorrupted
freed object and advances to the next folio_queue in the queue.
In the case seen, however, the freed object gets recycled and put back onto
the queue at the tail and filled to the end. This confuses
iterate_folioq() and it tries to step ->next, which may be NULL - resulting
in an oops.
Fix this by the following means:
(1) When preparing a write subrequest, make sure there's a folio_queue
struct with space in it at the leading edge of the queue. A function
to make space is split out of the function to append a folio so that
it can be called for this purpose.
(2) If the request struct iterator is pointing to a completely spent
folio_queue when we make space, then advance the iterator to the newly
allocated folio_queue. The subrequest's iterator will then be set
from this.
The oops could be triggered using the generic/346 xfstest with a filesystem
on9P over TCP with cache=loose. The oops looked something like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
RIP: 0010:_copy_from_iter+0x2db/0x530
...
Call Trace:
<TASK>
...
p9pdu_vwritef+0x3d8/0x5d0
p9_client_prepare_req+0xa8/0x140
p9_client_rpc+0x81/0x280
p9_client_write+0xcf/0x1c0
v9fs_issue_write+0x87/0xc0
netfs_advance_write+0xa0/0xb0
netfs_write_folio.isra.0+0x42d/0x500
netfs_writepages+0x15a/0x1f0
do_writepages+0xd1/0x220
filemap_fdatawrite_wbc+0x5c/0x80
v9fs_mmap_vm_close+0x7d/0xb0
remove_vma+0x35/0x70
vms_complete_munmap_vmas+0x11a/0x170
do_vmi_align_munmap+0x17d/0x1c0
do_vmi_munmap+0x13e/0x150
__vm_munmap+0x92/0xd0
__x64_sys_munmap+0x17/0x20
do_syscall_64+0x80/0xe0
entry_SYSCALL_64_after_hwframe+0x71/0x79
This also fixed a similar-looking issue with cifs and generic/074.
Fixes: cd0277ed0c18 ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409180928.f20b5a08-oliver.sang@intel.com
Closes: https://lore.kernel.org/oe-lkp/202409131438.3f225fbf-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 14:58:30 +01:00
|
|
|
rreq->buffer_tail_slot = 0;
|
|
|
|
return tail;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Append a folio to the rolling queue.
|
|
|
|
*/
|
|
|
|
int netfs_buffer_append_folio(struct netfs_io_request *rreq, struct folio *folio,
|
|
|
|
bool needs_put)
|
|
|
|
{
|
|
|
|
struct folio_queue *tail;
|
|
|
|
unsigned int slot, order = folio_order(folio);
|
|
|
|
|
|
|
|
tail = netfs_buffer_make_space(rreq);
|
|
|
|
if (IS_ERR(tail))
|
|
|
|
return PTR_ERR(tail);
|
2024-05-29 21:47:07 +01:00
|
|
|
|
|
|
|
rreq->io_iter.count += PAGE_SIZE << order;
|
|
|
|
|
|
|
|
slot = folioq_append(tail, folio);
|
|
|
|
/* Store the counter after setting the slot. */
|
|
|
|
smp_store_release(&rreq->buffer_tail_slot, slot);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Delete the head of a rolling queue.
|
|
|
|
*/
|
|
|
|
struct folio_queue *netfs_delete_buffer_head(struct netfs_io_request *wreq)
|
|
|
|
{
|
|
|
|
struct folio_queue *head = wreq->buffer, *next = head->next;
|
|
|
|
|
|
|
|
if (next)
|
|
|
|
next->prev = NULL;
|
|
|
|
netfs_stat_d(&netfs_n_folioq);
|
|
|
|
kfree(head);
|
|
|
|
wreq->buffer = next;
|
|
|
|
return next;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Clear out a rolling queue.
|
|
|
|
*/
|
|
|
|
void netfs_clear_buffer(struct netfs_io_request *rreq)
|
|
|
|
{
|
|
|
|
struct folio_queue *p;
|
|
|
|
|
|
|
|
while ((p = rreq->buffer)) {
|
|
|
|
rreq->buffer = p->next;
|
2024-09-30 23:31:52 -07:00
|
|
|
for (int slot = 0; slot < folioq_count(p); slot++) {
|
2024-05-29 21:47:07 +01:00
|
|
|
struct folio *folio = folioq_folio(p, slot);
|
|
|
|
if (!folio)
|
|
|
|
continue;
|
|
|
|
if (folioq_is_marked(p, slot)) {
|
|
|
|
trace_netfs_folio(folio, netfs_folio_trace_put);
|
|
|
|
folio_put(folio);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
netfs_stat_d(&netfs_n_folioq);
|
|
|
|
kfree(p);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-07-08 14:49:45 +01:00
|
|
|
/*
|
|
|
|
* Reset the subrequest iterator to refer just to the region remaining to be
|
|
|
|
* read. The iterator may or may not have been advanced by socket ops or
|
|
|
|
* extraction ops to an extent that may or may not match the amount actually
|
|
|
|
* read.
|
|
|
|
*/
|
|
|
|
void netfs_reset_iter(struct netfs_io_subrequest *subreq)
|
|
|
|
{
|
|
|
|
struct iov_iter *io_iter = &subreq->io_iter;
|
|
|
|
size_t remain = subreq->len - subreq->transferred;
|
|
|
|
|
|
|
|
if (io_iter->count > remain)
|
|
|
|
iov_iter_advance(io_iter, io_iter->count - remain);
|
|
|
|
else if (io_iter->count < remain)
|
|
|
|
iov_iter_revert(io_iter, remain - io_iter->count);
|
|
|
|
iov_iter_truncate(&subreq->io_iter, remain);
|
|
|
|
}
|
|
|
|
|
2023-11-27 13:58:07 +00:00
|
|
|
/**
|
|
|
|
* netfs_dirty_folio - Mark folio dirty and pin a cache object for writeback
|
|
|
|
* @mapping: The mapping the folio belongs to.
|
|
|
|
* @folio: The folio being dirtied.
|
|
|
|
*
|
|
|
|
* Set the dirty flag on a folio and pin an in-use cache object in memory so
|
|
|
|
* that writeback can later write to it. This is intended to be called from
|
|
|
|
* the filesystem's ->dirty_folio() method.
|
|
|
|
*
|
|
|
|
* Return: true if the dirty flag was set on the folio, false otherwise.
|
|
|
|
*/
|
|
|
|
bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio)
|
|
|
|
{
|
|
|
|
struct inode *inode = mapping->host;
|
|
|
|
struct netfs_inode *ictx = netfs_inode(inode);
|
|
|
|
struct fscache_cookie *cookie = netfs_i_cookie(ictx);
|
|
|
|
bool need_use = false;
|
|
|
|
|
2024-07-18 21:07:32 +01:00
|
|
|
_enter("");
|
2023-11-27 13:58:07 +00:00
|
|
|
|
|
|
|
if (!filemap_dirty_folio(mapping, folio))
|
|
|
|
return false;
|
|
|
|
if (!fscache_cookie_valid(cookie))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (!(inode->i_state & I_PINNING_NETFS_WB)) {
|
|
|
|
spin_lock(&inode->i_lock);
|
|
|
|
if (!(inode->i_state & I_PINNING_NETFS_WB)) {
|
|
|
|
inode->i_state |= I_PINNING_NETFS_WB;
|
|
|
|
need_use = true;
|
|
|
|
}
|
|
|
|
spin_unlock(&inode->i_lock);
|
|
|
|
|
|
|
|
if (need_use)
|
|
|
|
fscache_use_cookie(cookie, true);
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netfs_dirty_folio);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* netfs_unpin_writeback - Unpin writeback resources
|
|
|
|
* @inode: The inode on which the cookie resides
|
|
|
|
* @wbc: The writeback control
|
|
|
|
*
|
|
|
|
* Unpin the writeback resources pinned by netfs_dirty_folio(). This is
|
|
|
|
* intended to be called as/by the netfs's ->write_inode() method.
|
|
|
|
*/
|
|
|
|
int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc)
|
|
|
|
{
|
|
|
|
struct fscache_cookie *cookie = netfs_i_cookie(netfs_inode(inode));
|
|
|
|
|
|
|
|
if (wbc->unpinned_netfs_wb)
|
|
|
|
fscache_unuse_cookie(cookie, NULL, NULL);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netfs_unpin_writeback);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* netfs_clear_inode_writeback - Clear writeback resources pinned by an inode
|
|
|
|
* @inode: The inode to clean up
|
|
|
|
* @aux: Auxiliary data to apply to the inode
|
|
|
|
*
|
|
|
|
* Clear any writeback resources held by an inode when the inode is evicted.
|
|
|
|
* This must be called before clear_inode() is called.
|
|
|
|
*/
|
|
|
|
void netfs_clear_inode_writeback(struct inode *inode, const void *aux)
|
|
|
|
{
|
|
|
|
struct fscache_cookie *cookie = netfs_i_cookie(netfs_inode(inode));
|
|
|
|
|
|
|
|
if (inode->i_state & I_PINNING_NETFS_WB) {
|
|
|
|
loff_t i_size = i_size_read(inode);
|
|
|
|
fscache_unuse_cookie(cookie, aux, &i_size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netfs_clear_inode_writeback);
|
2021-08-20 17:08:30 +01:00
|
|
|
|
|
|
|
/**
|
|
|
|
* netfs_invalidate_folio - Invalidate or partially invalidate a folio
|
|
|
|
* @folio: Folio proposed for release
|
|
|
|
* @offset: Offset of the invalidated region
|
|
|
|
* @length: Length of the invalidated region
|
|
|
|
*
|
|
|
|
* Invalidate part or all of a folio for a network filesystem. The folio will
|
|
|
|
* be removed afterwards if the invalidated region covers the entire folio.
|
|
|
|
*/
|
|
|
|
void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length)
|
|
|
|
{
|
netfs: Replace PG_fscache by setting folio->private and marking dirty
When dirty data is being written to the cache, setting/waiting on/clearing
the fscache flag is always done in tandem with setting/waiting on/clearing
the writeback flag. The netfslib buffered write routines wait on and set
both flags and the write request cleanup clears both flags, so the fscache
flag is almost superfluous.
The reason it isn't superfluous is because the fscache flag is also used to
indicate that data just read from the server is being written to the cache.
The flag is used to prevent a race involving overlapping direct-I/O writes
to the cache.
Change this to indicate that a page is in need of being copied to the cache
by placing a magic value in folio->private and marking the folios dirty.
Then when the writeback code sees a folio marked in this way, it only
writes it to the cache and not to the server.
If a folio that has this magic value set is modified, the value is just
replaced and the folio will then be uplodaded too.
With this, PG_fscache is no longer required by the netfslib core, 9p and
afs.
Ceph and nfs, however, still need to use the old PG_fscache-based tracking.
To deal with this, a flag, NETFS_ICTX_USE_PGPRIV2, now has to be set on the
flags in the netfs_inode struct for those filesystems. This reenables the
use of PG_fscache in that inode. 9p and afs use the netfslib write helpers
so get switched over; cifs, for the moment, does page-by-page manual access
to the cache, so doesn't use PG_fscache and is unaffected.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: Bharath SM <bharathsm@microsoft.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2024-03-19 10:00:09 +00:00
|
|
|
struct netfs_folio *finfo;
|
netfs: Fix trimming of streaming-write folios in netfs_inval_folio()
When netfslib writes to a folio that it doesn't have data for, but that
data exists on the server, it will make a 'streaming write' whereby it
stores data in a folio that is marked dirty, but not uptodate. When it
does this, it attaches a record to folio->private to track the dirty
region.
When truncate() or fallocate() wants to invalidate part of such a folio, it
will call into ->invalidate_folio(), specifying the part of the folio that
is to be invalidated. netfs_invalidate_folio(), on behalf of the
filesystem, must then determine how to trim the streaming write record. In
a couple of cases, however, it does this incorrectly (the reduce-length and
move-start cases are switched over and don't, in any case, calculate the
value correctly).
Fix this by making the logic tree more obvious and fixing the cases.
Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-23 21:08:12 +01:00
|
|
|
struct netfs_inode *ctx = netfs_inode(folio_inode(folio));
|
2023-09-29 17:28:25 +01:00
|
|
|
size_t flen = folio_size(folio);
|
|
|
|
|
2024-07-18 21:07:32 +01:00
|
|
|
_enter("{%lx},%zx,%zx", folio->index, offset, length);
|
2021-08-20 17:08:30 +01:00
|
|
|
|
netfs: Fix trimming of streaming-write folios in netfs_inval_folio()
When netfslib writes to a folio that it doesn't have data for, but that
data exists on the server, it will make a 'streaming write' whereby it
stores data in a folio that is marked dirty, but not uptodate. When it
does this, it attaches a record to folio->private to track the dirty
region.
When truncate() or fallocate() wants to invalidate part of such a folio, it
will call into ->invalidate_folio(), specifying the part of the folio that
is to be invalidated. netfs_invalidate_folio(), on behalf of the
filesystem, must then determine how to trim the streaming write record. In
a couple of cases, however, it does this incorrectly (the reduce-length and
move-start cases are switched over and don't, in any case, calculate the
value correctly).
Fix this by making the logic tree more obvious and fixing the cases.
Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-23 21:08:12 +01:00
|
|
|
if (offset == 0 && length == flen) {
|
|
|
|
unsigned long long i_size = i_size_read(&ctx->inode);
|
|
|
|
unsigned long long fpos = folio_pos(folio), end;
|
|
|
|
|
|
|
|
end = umin(fpos + flen, i_size);
|
|
|
|
if (fpos < i_size && end > ctx->zero_point)
|
|
|
|
ctx->zero_point = end;
|
|
|
|
}
|
|
|
|
|
2024-08-14 21:38:21 +01:00
|
|
|
folio_wait_private_2(folio); /* [DEPRECATED] */
|
|
|
|
|
2023-09-29 17:28:25 +01:00
|
|
|
if (!folio_test_private(folio))
|
|
|
|
return;
|
|
|
|
|
|
|
|
finfo = netfs_folio_info(folio);
|
|
|
|
|
|
|
|
if (offset == 0 && length >= flen)
|
|
|
|
goto erase_completely;
|
|
|
|
|
|
|
|
if (finfo) {
|
|
|
|
/* We have a partially uptodate page from a streaming write. */
|
|
|
|
unsigned int fstart = finfo->dirty_offset;
|
|
|
|
unsigned int fend = fstart + finfo->dirty_len;
|
netfs: Fix trimming of streaming-write folios in netfs_inval_folio()
When netfslib writes to a folio that it doesn't have data for, but that
data exists on the server, it will make a 'streaming write' whereby it
stores data in a folio that is marked dirty, but not uptodate. When it
does this, it attaches a record to folio->private to track the dirty
region.
When truncate() or fallocate() wants to invalidate part of such a folio, it
will call into ->invalidate_folio(), specifying the part of the folio that
is to be invalidated. netfs_invalidate_folio(), on behalf of the
filesystem, must then determine how to trim the streaming write record. In
a couple of cases, however, it does this incorrectly (the reduce-length and
move-start cases are switched over and don't, in any case, calculate the
value correctly).
Fix this by making the logic tree more obvious and fixing the cases.
Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-23 21:08:12 +01:00
|
|
|
unsigned int iend = offset + length;
|
2023-09-29 17:28:25 +01:00
|
|
|
|
|
|
|
if (offset >= fend)
|
|
|
|
return;
|
netfs: Fix trimming of streaming-write folios in netfs_inval_folio()
When netfslib writes to a folio that it doesn't have data for, but that
data exists on the server, it will make a 'streaming write' whereby it
stores data in a folio that is marked dirty, but not uptodate. When it
does this, it attaches a record to folio->private to track the dirty
region.
When truncate() or fallocate() wants to invalidate part of such a folio, it
will call into ->invalidate_folio(), specifying the part of the folio that
is to be invalidated. netfs_invalidate_folio(), on behalf of the
filesystem, must then determine how to trim the streaming write record. In
a couple of cases, however, it does this incorrectly (the reduce-length and
move-start cases are switched over and don't, in any case, calculate the
value correctly).
Fix this by making the logic tree more obvious and fixing the cases.
Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-23 21:08:12 +01:00
|
|
|
if (iend <= fstart)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* The invalidation region overlaps the data. If the region
|
|
|
|
* covers the start of the data, we either move along the start
|
|
|
|
* or just erase the data entirely.
|
|
|
|
*/
|
|
|
|
if (offset <= fstart) {
|
|
|
|
if (iend >= fend)
|
|
|
|
goto erase_completely;
|
|
|
|
/* Move the start of the data. */
|
|
|
|
finfo->dirty_len = fend - iend;
|
|
|
|
finfo->dirty_offset = offset;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Reduce the length of the data if the invalidation region
|
|
|
|
* covers the tail part.
|
|
|
|
*/
|
|
|
|
if (iend >= fend) {
|
|
|
|
finfo->dirty_len = offset - fstart;
|
2023-09-29 17:28:25 +01:00
|
|
|
return;
|
netfs: Fix trimming of streaming-write folios in netfs_inval_folio()
When netfslib writes to a folio that it doesn't have data for, but that
data exists on the server, it will make a 'streaming write' whereby it
stores data in a folio that is marked dirty, but not uptodate. When it
does this, it attaches a record to folio->private to track the dirty
region.
When truncate() or fallocate() wants to invalidate part of such a folio, it
will call into ->invalidate_folio(), specifying the part of the folio that
is to be invalidated. netfs_invalidate_folio(), on behalf of the
filesystem, must then determine how to trim the streaming write record. In
a couple of cases, however, it does this incorrectly (the reduce-length and
move-start cases are switched over and don't, in any case, calculate the
value correctly).
Fix this by making the logic tree more obvious and fixing the cases.
Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-23 21:08:12 +01:00
|
|
|
}
|
|
|
|
|
2023-09-29 17:28:25 +01:00
|
|
|
/* A partial write was split. The caller has already zeroed
|
|
|
|
* it, so just absorb the hole.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
|
|
|
|
erase_completely:
|
|
|
|
netfs_put_group(netfs_folio_group(folio));
|
|
|
|
folio_detach_private(folio);
|
|
|
|
folio_clear_uptodate(folio);
|
|
|
|
kfree(finfo);
|
|
|
|
return;
|
2021-08-20 17:08:30 +01:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netfs_invalidate_folio);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* netfs_release_folio - Try to release a folio
|
|
|
|
* @folio: Folio proposed for release
|
|
|
|
* @gfp: Flags qualifying the release
|
|
|
|
*
|
|
|
|
* Request release of a folio and clean up its private state if it's not busy.
|
|
|
|
* Returns true if the folio can now be released, false if not
|
|
|
|
*/
|
|
|
|
bool netfs_release_folio(struct folio *folio, gfp_t gfp)
|
|
|
|
{
|
|
|
|
struct netfs_inode *ctx = netfs_inode(folio_inode(folio));
|
netfs: Optimise away reads above the point at which there can be no data
Track the file position above which the server is not expected to have any
data (the "zero point") and preemptively assume that we can satisfy
requests by filling them with zeroes locally rather than attempting to
download them if they're over that line - even if we've written data back
to the server. Assume that any data that was written back above that
position is held in the local cache. Note that we have to split requests
that straddle the line.
Make use of this to optimise away some reads from the server. We need to
set the zero point in the following circumstances:
(1) When we see an extant remote inode and have no cache for it, we set
the zero_point to i_size.
(2) On local inode creation, we set zero_point to 0.
(3) On local truncation down, we reduce zero_point to the new i_size if
the new i_size is lower.
(4) On local truncation up, we don't change zero_point.
(5) On local modification, we don't change zero_point.
(6) On remote invalidation, we set zero_point to the new i_size.
(7) If stored data is discarded from the pagecache or culled from fscache,
we must set zero_point above that if the data also got written to the
server.
(8) If dirty data is written back to the server, but not fscache, we must
set zero_point above that.
(9) If a direct I/O write is made, set zero_point above that.
Assuming the above, any read from the server at or above the zero_point
position will return all zeroes.
The zero_point value can be stored in the cache, provided the above rules
are applied to it by any code that culls part of the local cache.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2023-11-24 13:39:02 +00:00
|
|
|
unsigned long long end;
|
|
|
|
|
2024-08-23 21:08:11 +01:00
|
|
|
if (folio_test_dirty(folio))
|
|
|
|
return false;
|
|
|
|
|
netfs: Fix trimming of streaming-write folios in netfs_inval_folio()
When netfslib writes to a folio that it doesn't have data for, but that
data exists on the server, it will make a 'streaming write' whereby it
stores data in a folio that is marked dirty, but not uptodate. When it
does this, it attaches a record to folio->private to track the dirty
region.
When truncate() or fallocate() wants to invalidate part of such a folio, it
will call into ->invalidate_folio(), specifying the part of the folio that
is to be invalidated. netfs_invalidate_folio(), on behalf of the
filesystem, must then determine how to trim the streaming write record. In
a couple of cases, however, it does this incorrectly (the reduce-length and
move-start cases are switched over and don't, in any case, calculate the
value correctly).
Fix this by making the logic tree more obvious and fixing the cases.
Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Pankaj Raghav <p.raghav@samsung.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-23 21:08:12 +01:00
|
|
|
end = umin(folio_pos(folio) + folio_size(folio), i_size_read(&ctx->inode));
|
netfs: Optimise away reads above the point at which there can be no data
Track the file position above which the server is not expected to have any
data (the "zero point") and preemptively assume that we can satisfy
requests by filling them with zeroes locally rather than attempting to
download them if they're over that line - even if we've written data back
to the server. Assume that any data that was written back above that
position is held in the local cache. Note that we have to split requests
that straddle the line.
Make use of this to optimise away some reads from the server. We need to
set the zero point in the following circumstances:
(1) When we see an extant remote inode and have no cache for it, we set
the zero_point to i_size.
(2) On local inode creation, we set zero_point to 0.
(3) On local truncation down, we reduce zero_point to the new i_size if
the new i_size is lower.
(4) On local truncation up, we don't change zero_point.
(5) On local modification, we don't change zero_point.
(6) On remote invalidation, we set zero_point to the new i_size.
(7) If stored data is discarded from the pagecache or culled from fscache,
we must set zero_point above that if the data also got written to the
server.
(8) If dirty data is written back to the server, but not fscache, we must
set zero_point above that.
(9) If a direct I/O write is made, set zero_point above that.
Assuming the above, any read from the server at or above the zero_point
position will return all zeroes.
The zero_point value can be stored in the cache, provided the above rules
are applied to it by any code that culls part of the local cache.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2023-11-24 13:39:02 +00:00
|
|
|
if (end > ctx->zero_point)
|
|
|
|
ctx->zero_point = end;
|
2021-08-20 17:08:30 +01:00
|
|
|
|
|
|
|
if (folio_test_private(folio))
|
|
|
|
return false;
|
2024-08-14 21:38:21 +01:00
|
|
|
if (unlikely(folio_test_private_2(folio))) { /* [DEPRECATED] */
|
|
|
|
if (current_is_kswapd() || !(gfp & __GFP_FS))
|
|
|
|
return false;
|
|
|
|
folio_wait_private_2(folio);
|
|
|
|
}
|
2021-08-20 17:08:30 +01:00
|
|
|
fscache_note_page_release(netfs_i_cookie(ctx));
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netfs_release_folio);
|