Overhaul the third party-induced invalidation handling, making use of the
previously added volume-level event counters (cb_scrub and cb_ro_snapshot)
that are now being parsed out of the VolSync record returned by the
fileserver in many of its replies.
This allows better handling of RO (and Backup) volumes. Since these are
snapshot of a RW volume that are updated atomically simultantanously across
all servers that host them, they only require a single callback promise for
the entire volume. The currently upstream code assumes that RO volumes
operate in the same manner as RW volumes, and that each file has its own
individual callback - which means that it does a status fetch for *every*
file in a RO volume, whether or not the volume got "released" (volume
callback breaks can occur for other reasons too, such as the volumeserver
taking ownership of a volume from a fileserver).
To this end, make the following changes:
(1) Change the meaning of the volume's cb_v_break counter so that it is
now a hint that we need to issue a status fetch to work out the state
of a volume. cb_v_break is incremented by volume break callbacks and
by server initialisation callbacks.
(2) Add a second counter, cb_v_check, to the afs_volume struct such that
if this differs from cb_v_break, we need to do a check. When the
check is complete, cb_v_check is advanced to what cb_v_break was at
the start of the status fetch.
(3) Move the list of mmap'd vnodes to the volume and trigger removal of
PTEs that map to files on a volume break rather than on a server
break.
(4) When a server reinitialisation callback comes in, use the
server-to-volume reverse mapping added in a preceding patch to iterate
over all the volumes using that server and clear the volume callback
promises for that server and the general volume promise as a whole to
trigger reanalysis.
(5) Replace the AFS_VNODE_CB_PROMISED flag with an AFS_NO_CB_PROMISE
(TIME64_MIN) value in the cb_expires_at field, reducing the number of
checks we need to make.
(6) Change afs_check_validity() to quickly see if various event counters
have been incremented or if the vnode or volume callback promise is
due to expire/has expired without making any changes to the state.
That is now left to afs_validate() as this may get more complicated in
future as we may have to examine server records too.
(7) Overhaul afs_validate() so that it does a single status fetch if we
need to check the state of either the vnode or the volume - and do so
under appropriate locking. The function does the following steps:
(A) If the vnode/volume is no longer seen as valid, then we take the
vnode validation lock and, if the volume promise has expired, the
volume check lock also. The latter prevents redundant checks being
made to find out if a new version of the volume got released.
(B) If a previous RPC call found that the volsync changed unexpectedly
or that a RO volume was updated, then we unmap all PTEs pointing to
the file to stop mmap being used for access.
(C) If the vnode is still seen to be of uncertain validity, then we
perform an FS.FetchStatus RPC op to jointly update the volume status
and the vnode status. This assessment is done as part of parsing the
reply:
If the RO volume creation timestamp advances, cb_ro_snapshot is
incremented; if either the creation or update timestamps changes in
an unexpected way, the cb_scrub counter is incremented
If the Data Version returned doesn't match the copy we have
locally, then we ask for the pagecache to be zapped. This takes
care of handling RO update.
(D) If cb_scrub differs between volume and vnode, the vnode's
pagecache is zapped and the vnode's cb_scrub is updated unless the
file is marked as having been deleted.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
A number of fileserver RPC operations return a VolSync record as part of
their reply that gives some information about the state of the volume being
accessed, including:
(1) A volume Creation timestamp. For an RW volume, this is the time at
which the volume was created; if it changes, the RW volume was
presumably restored from a backup and all cached data should be
scrubbed as Data Version numbers could regress on the files in the
volume.
For an RO volume, this is the time it was last snapshotted from the RW
volume. It is expected to advance each time this happens; if it
regresses, cached data should be scrubbed.
(2) A volume Update timestamp (Auristor only). For an RW volume, this is
updated any time any change is made to a volume or its contents. If
it regresses, all cached data must be scrubbed.
For an RO volume, this is a copy of the RW volume's Update timestamp
at the point of snapshotting. It can be used as a version number when
checking to see if a callback on a RO volume was due to a snapshot.
If it regresses, all cached data must be scrubbed.
but this is currently not made use of by the in-kernel afs filesystem.
Make the afs filesystem use this by:
(1) Add an update time field to the afs_volsync struct and use a value of
TIME64_MIN in both that and the creation time to indicate that they
are unset.
(2) Add creation and update time fields to the afs_volume struct and use
this to track the two timestamps.
(3) Add a volsync_lock mutex to the afs_volume struct to control
modification access for when we detect a change in these values.
(3) Add a 'pre-op volsync' struct to the afs_operation struct to record
the state of the volume tracking before the op.
(4) Add a new counter, cb_scrub, to the afs_volume struct to count events
that require all data to be scrubbed. A copy is placed in the
afs_vnode struct (inode) and if they no longer match, a scrub takes
place.
(5) When the result of an operation is being parsed, parse the VolSync
data too, if it is provided. Note that the two timestamps are handled
separately, since they don't work in quite the same way.
- If the afs_volume tracking is unset, just set it and do nothing
else.
- If the result timestamps are the same as the ones in afs_volume, do
nothing.
- If the timestamps regress, increment cb_scrub if not already done
so.
- If the creation timestamp on a RW volume changes, increment cb_scrub
if not already done so.
- If the creation timestamp on a RO volume advances, update the server
list and see if the current server has been excluded, if so reissue
the op. Once over half of the replication sites have been updated,
increment cb_ro_snapshot to indicate updates may be required and
switch over to excluding unupdated replication sites.
- If the creation timestamp on a Backup volume advances, just
increment cb_ro_snapshot to trigger updates.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Don't leave servers that are marked VLSF_DONTUSE or VLSF_NEWREPSITE out of
the server list for a volume; rather, mark DONTUSE ones excluded and mark
either NEWREPSITE excluded if the number of updated servers is <50% of the
usable servers or mark !NEWREPSITE excluded otherwise.
Mark the server list as a whole with a 3-state flag to indicate whether we
think the RW volume is being replicated to the RO volume, and, if so,
whether we should switch to using updated replication sites
(VLSF_NEWREPSITE) or stick with the old for now.
This processing is pushed up from the VLDB RPC reply parser to the code
that generates the server list from that information.
Doing this allows the old list to be kept with just the exclusion flags
replaced and to keep the server records pinned and maintained.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Fix the comment in afs_do_lookup() that says that slot 0 is used for the
fid being looked up and slot 1 is used for the directory. It's actually
done the other way round.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Apply server breaks to mmap'd files that are being used from that server
from the call processor work function rather than punting it off to a
workqueue. The work item, afs_server_init_callback(), then bumps each
individual inode off to its own work item introducing a potentially lengthy
delay. This reduces that delay at the cost of extending the amount of time
we delay replying to the CB.InitCallBack3 notification RPC from the server.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Move the code that does validity checking of vnodes and volumes with
respect to third-party changes into its own file.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Defer volume record destruction to a workqueue so that afs_put_volume()
isn't going to run the destruction process in the callback workqueue whilst
the server is holding up other clients whilst waiting for us to reply to a
CB.CallBack notification RPC.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Make it possible to find the afs_volume structs that are using an
afs_server struct to aid in breaking volume callbacks.
The way this is done is that each afs_volume already has an array of
afs_server_entry records that point to the servers where that volume might
be found. An afs_volume backpointer and a list node is added to each entry
and each entry is then added to an RCU-traversable list on the afs_server
to which it points.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Combine the endpoint state bool-type members into a bitmask so that some of
them can be waited upon more easily.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Keep a record of the current fileserver endpoint state, including the probe
state, and replace it when a new probe is started rather than just
squelching the old state and overwriting it. Clearance of the old state
can cause a race if there's another thread also currently trying to
communicate with that server.
It appears that this race might be the culprit for some occasions where
kafs complains about invalid data in the RPC reply because the rotation
algorithm fell all the way through without actually issuing an RPC call and
the error return got filled in from the probe state (which has a zero error
recorded). Whatever happens to be in the caller's reply buffer is then
taken as the response.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When probing all the addresses for a volume location server, dispatch them
in order of descending priority to try and get back highest priority one
first.
Also add a tracepoint to show the transmission and completion of the
probes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When probing all the addresses for a fileserver, dispatch them in order of
descending priority to try and get back highest priority one first.
Also add a tracepoint to show the transmission and completion of the
probes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Add a field to each address in an address list (afs_addr_list struct) that
records the current priority for that address according to the address
preference table. We don't want to do this every time we use an address
list, so the version number of the address preference table is recorded in
the address list too and we only re-mark the list when we see the version
change.
These numbers are then displayed through /proc/net/afs/servers.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
AFS servers may have multiple addresses, but the client can't easily judge
between them as to which one is best. For instance, an address that has a
larger RTT might actually have a better bandwidth because it goes through a
switch rather than being directly connected - but we can't work this out
dynamically unless we push through sufficient data that we can measure it.
To allow the administrator to configure this, add a list of preference
weightings for server addresses by IPv4/IPv6 address or subnet and allow
this to be viewed through a procfile and altered by writing text commands
to that same file. Preference rules can be added/updated by:
echo "add <proto> <addr>[/<subnet>] <prior>" >/proc/fs/afs/addr_prefs
echo "add udp 1.2.3.4 1000" >/proc/fs/afs/addr_prefs
echo "add udp 192.168.0.0/16 3000" >/proc/fs/afs/addr_prefs
echo "add udp 1001:2002:0:6::/64 4000" >/proc/fs/afs/addr_prefs
and removed by:
echo "del <proto> <addr>[/<subnet>]" >/proc/fs/afs/addr_prefs
echo "del udp 1.2.3.4" >/proc/fs/afs/addr_prefs
where the priority is a number between 0 and 65535.
The list is split between IPv4 and IPv6 addresses and each sublist is kept
in numerical order, with rules that would otherwise match but have
different subnet masking being ordered with the most specific submatch
first.
A subsequent patch will apply these rules.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Remove afs_cmp_addr_list() as it was never implemented.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
In /proc/net/afs/servers, show the cell name and the last error for each
address in the server's list.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Fold the afs_addr_cursor struct into the afs_operation struct and the
afs_vl_cursor struct and fold its operations into their callers also.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Use the rxrpc_peer plus the service ID as the call address instead of
passing in a sockaddr_srx down to rxrpc. The peer record is obtained by
using rxrpc_kernel_get_peer(). This avoids the need to repeatedly look up
the peer and allows rxrpc to hold on to resources for it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Rename the ->index and ->untried fields of the afs_vl_cursor and
afs_operation struct to ->server_index and ->untried_servers to avoid
confusion with address iteration fields when those get folded in.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Add a tracepoint to track the lifetime of the afs_addr_list struct.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Simplify error handling a bit by moving it from the afs_addr_cursor struct
to the afs_operation and afs_vl_cursor structs and using the error
prioritisation function for accumulating errors from multiple sources (AFS
tries to rotate between multiple fileservers, some of which may be
inaccessible or in some state of offlinedness).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Don't put the afs_call struct in afs_wait_for_call_to_complete() but rather
have the caller do it. This will allow the caller to fish stuff out of the
afs_call struct rather than the afs_addr_cursor struct, thereby allowing a
subsequent patch to subsume it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Wrap most op->error accesses with inline funcs which will make it easier
for a subsequent patch to replace op->error with something else. Two
functions are added to this end:
(1) afs_op_error() - Get the error code.
(2) afs_op_set_error() - Set the error code.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Set op->nr_iterations to -1 to indicate that we need to begin fileserver
iteration rather than setting error to SHRT_MAX. This makes it easier to
eliminate the address cursor.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When processing the result of a call, handle the VIO and UAEIO abort
specifically rather than leaving it to a default case. Rather than
erroring out unconditionally, see if there's another server if the volume
has more than one server available, otherwise return -EREMOTEIO.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Rename the failed member of struct addr_list to probe_failed as it's
specifically related to probe failures.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
In the rotation algorithms for iterating over volume location servers and
file servers, don't skip servers from which we got a valid response to a
probe (either a reply DATA packet or an ABORT) even if we didn't manage to
get an RTT reading.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Change rxrpc's API such that:
(1) A new function, rxrpc_kernel_lookup_peer(), is provided to look up an
rxrpc_peer record for a remote address and a corresponding function,
rxrpc_kernel_put_peer(), is provided to dispose of it again.
(2) When setting up a call, the rxrpc_peer object used during a call is
now passed in rather than being set up by rxrpc_connect_call(). For
afs, this meenat passing it to rxrpc_kernel_begin_call() rather than
the full address (the service ID then has to be passed in as a
separate parameter).
(3) A new function, rxrpc_kernel_remote_addr(), is added so that afs can
get a pointer to the transport address for display purposed, and
another, rxrpc_kernel_remote_srx(), to gain a pointer to the full
rxrpc address.
(4) The function to retrieve the RTT from a call, rxrpc_kernel_get_srtt(),
is then altered to take a peer. This now returns the RTT or -1 if
there are insufficient samples.
(5) Rename rxrpc_kernel_get_peer() to rxrpc_kernel_call_get_peer().
(6) Provide a new function, rxrpc_kernel_get_peer(), to get a ref on a
peer the caller already has.
This allows the afs filesystem to pin the rxrpc_peer records that it is
using, allowing faster lookups and pointer comparisons rather than
comparing sockaddr_rxrpc contents. It also makes it easier to get hold of
the RTT. The following changes are made to afs:
(1) The addr_list struct's addrs[] elements now hold a peer struct pointer
and a service ID rather than a sockaddr_rxrpc.
(2) When displaying the transport address, rxrpc_kernel_remote_addr() is
used.
(3) The port arg is removed from afs_alloc_addrlist() since it's always
overridden.
(4) afs_merge_fs_addr4() and afs_merge_fs_addr6() do peer lookup and may
now return an error that must be handled.
(5) afs_find_server() now takes a peer pointer to specify the address.
(6) afs_find_server(), afs_compare_fs_alists() and afs_merge_fs_addr[46]{}
now do peer pointer comparison rather than address comparison.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Turn the afs_addr_list address array into an array of structs, thereby
allowing per-address (such as RTT) info to be added.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Add some comments on AFS abort code handling in the rotation algorithm and
adjust the errors produced to match.
Reported-by: Jeffrey E Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
David Howells says:
(3) afs_check_validity().
(4) afs_getattr().
These are both pretty short, so your solution is probably good for them.
That said, afs_vnode_commit_status() can spend a long time under the
write lock - and pretty much every file RPC op returns a status update.
Change these functions to use read_seqbegin(). This simplifies the code
and doesn't change the current behaviour, the "seq" counter is always even
so read_seqbegin_or_lock() can never take the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231130115617.GA21584@redhat.com/
David Howells says:
(5) afs_find_server().
There could be a lot of servers in the list and each server can have
multiple addresses, so I think this would be better with an exclusive
second pass.
The server list isn't likely to change all that often, but when it does
change, there's a good chance several servers are going to be
added/removed one after the other. Further, this is only going to be
used for incoming cache management/callback requests from the server,
which hopefully aren't going to happen too often - but it is remotely
drivable.
(6) afs_find_server_by_uuid().
Similarly to (5), there could be a lot of servers to search through, but
they are in a tree not a flat list, so it should be faster to process.
Again, it's not likely to change that often and, again, when it does
change it's likely to involve multiple changes. This can be driven
remotely by an incoming cache management request but is mostly going to
be driven by setting up or reconfiguring a volume's server list -
something that also isn't likely to happen often.
Make the "seq" counter odd on the 2nd pass, otherwise read_seqbegin_or_lock()
never takes the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231130115614.GA21581@redhat.com/
David Howells says:
(2) afs_lookup_volume_rcu().
There can be a lot of volumes known by a system. A thousand would
require a 10-step walk and this is drivable by remote operation, so I
think this should probably take a lock on the second pass too.
Make the "seq" counter odd on the 2nd pass, otherwise read_seqbegin_or_lock()
never takes the lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20231130115606.GA21571@redhat.com/
When an afs_volume struct is put, its refcount is reduced to 0 before
the cell->volume_lock is taken and the volume removed from the
cell->volumes tree.
Unfortunately, this means that the lookup code can race and see a volume
with a zero ref in the tree, resulting in a use-after-free:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 3 PID: 130782 at lib/refcount.c:25 refcount_warn_saturate+0x7a/0xda
...
RIP: 0010:refcount_warn_saturate+0x7a/0xda
...
Call Trace:
afs_get_volume+0x3d/0x55
afs_create_volume+0x126/0x1de
afs_validate_fc+0xfe/0x130
afs_get_tree+0x20/0x2e5
vfs_get_tree+0x1d/0xc9
do_new_mount+0x13b/0x22e
do_mount+0x5d/0x8a
__do_sys_mount+0x100/0x12a
do_syscall_64+0x3a/0x94
entry_SYSCALL_64_after_hwframe+0x62/0x6a
Fix this by:
(1) When putting, use a flag to indicate if the volume has been removed
from the tree and skip the rb_erase if it has.
(2) When looking up, use a conditional ref increment and if it fails
because the refcount is 0, replace the node in the tree and set the
removal flag.
Fixes: 20325960f875 ("afs: Reorganise volume and server trees to be rooted on the cell")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In afs_update_cell(), ret is the result of the DNS lookup and the errors
are to be handled by a switch - however, the value gets clobbered in
between by setting it to -ENOMEM in case afs_alloc_vlserver_list()
fails.
Fix this by moving the setting of -ENOMEM into the error handling for
OOM failure. Further, only do it if we don't have an alternative error
to return.
Found by Linux Verification Center (linuxtesting.org) with SVACE. Based
on a patch from Anastasia Belova [1].
Fixes: d5c32c89b208 ("afs: Fix cell DNS lookup")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Anastasia Belova <abelova@astralinux.ru>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: lvc-project@linuxtesting.org
Link: https://lore.kernel.org/r/20231221085849.1463-1-abelova@astralinux.ru/ [1]
Link: https://lore.kernel.org/r/1700862.1703168632@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the afs dynamic root directory, the ->lookup() function does a DNS check
on the cell being asked for and if the DNS upcall reports an error it will
report an error back to userspace (typically ENOENT).
However, if a failed DNS upcall returns a new-style result, it will return
a valid result, with the status field set appropriately to indicate the
type of failure - and in that case, dns_query() doesn't return an error and
we let stat() complete with no error - which can cause confusion in
userspace as subsequent calls that trigger d_automount then fail with
ENOENT.
Fix this by checking the status result from a valid dns_query() and
returning an error if it indicates a failure.
Fixes: bbb4c4323a4d ("dns: Allow the dns resolver to retrieve a server set")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216637
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Fix the afs dynamic root's d_delete function to always delete unused
dentries rather than only deleting them if they're positive. With things
as they stand upstream, negative dentries stemming from failed DNS lookups
stick around preventing retries.
Fixes: 66c7e1d319a5 ("afs: Split the dynroot stuff out and give it its own ops tables")
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
If an AFS cell that has an unreachable (eg. ENETUNREACH) server listed (VL
server or fileserver), an asynchronous probe to one of its addresses may
fail immediately because sendmsg() returns an error. When this happens, a
refcount underflow can happen if certain events hit a very small window.
The way this occurs is:
(1) There are two levels of "call" object, the afs_call and the
rxrpc_call. Each of them can be transitioned to a "completed" state
in the event of success or failure.
(2) Asynchronous afs_calls are self-referential whilst they are active to
prevent them from evaporating when they're not being processed. This
reference is disposed of when the afs_call is completed.
Note that an afs_call may only be completed once; once completed
completing it again will do nothing.
(3) When a call transmission is made, the app-side rxrpc code queues a Tx
buffer for the rxrpc I/O thread to transmit. The I/O thread invokes
sendmsg() to transmit it - and in the case of failure, it transitions
the rxrpc_call to the completed state.
(4) When an rxrpc_call is completed, the app layer is notified. In this
case, the app is kafs and it schedules a work item to process events
pertaining to an afs_call.
(5) When the afs_call event processor is run, it goes down through the
RPC-specific handler to afs_extract_data() to retrieve data from rxrpc
- and, in this case, it picks up the error from the rxrpc_call and
returns it.
The error is then propagated to the afs_call and that is completed
too. At this point the self-reference is released.
(6) If the rxrpc I/O thread manages to complete the rxrpc_call within the
window between rxrpc_send_data() queuing the request packet and
checking for call completion on the way out, then
rxrpc_kernel_send_data() will return the error from sendmsg() to the
app.
(7) Then afs_make_call() will see an error and will jump to the error
handling path which will attempt to clean up the afs_call.
(8) The problem comes when the error handling path in afs_make_call()
tries to unconditionally drop an async afs_call's self-reference.
This self-reference, however, may already have been dropped by
afs_extract_data() completing the afs_call
(9) The refcount underflows when we return to afs_do_probe_vlserver() and
that tries to drop its reference on the afs_call.
Fix this by making afs_make_call() attempt to complete the afs_call rather
than unconditionally putting it. That way, if afs_extract_data() manages
to complete the call first, afs_make_call() won't do anything.
The bug can be forced by making do_udp_sendmsg() return -ENETUNREACH and
sticking an msleep() in rxrpc_send_data() after the 'success:' label to
widen the race window.
The error message looks something like:
refcount_t: underflow; use-after-free.
WARNING: CPU: 3 PID: 720 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
...
RIP: 0010:refcount_warn_saturate+0xba/0x110
...
afs_put_call+0x1dc/0x1f0 [kafs]
afs_fs_get_capabilities+0x8b/0xe0 [kafs]
afs_fs_probe_fileserver+0x188/0x1e0 [kafs]
afs_lookup_server+0x3bf/0x3f0 [kafs]
afs_alloc_server_list+0x130/0x2e0 [kafs]
afs_create_volume+0x162/0x400 [kafs]
afs_get_tree+0x266/0x410 [kafs]
vfs_get_tree+0x25/0xc0
fc_mount+0xe/0x40
afs_d_automount+0x1b3/0x390 [kafs]
__traverse_mounts+0x8f/0x210
step_into+0x340/0x760
path_openat+0x13a/0x1260
do_filp_open+0xaf/0x160
do_sys_openat2+0xaf/0x170
or something like:
refcount_t: underflow; use-after-free.
...
RIP: 0010:refcount_warn_saturate+0x99/0xda
...
afs_put_call+0x4a/0x175
afs_send_vl_probes+0x108/0x172
afs_select_vlserver+0xd6/0x311
afs_do_cell_detect_alias+0x5e/0x1e9
afs_cell_detect_alias+0x44/0x92
afs_validate_fc+0x9d/0x134
afs_get_tree+0x20/0x2e6
vfs_get_tree+0x1d/0xc9
fc_mount+0xe/0x33
afs_d_automount+0x48/0x9d
__traverse_mounts+0xe0/0x166
step_into+0x140/0x274
open_last_lookups+0x1c1/0x1df
path_openat+0x138/0x1c3
do_filp_open+0x55/0xb4
do_sys_openat2+0x6c/0xb6
Fixes: 34fa47612bfe ("afs: Fix race in async call refcounting")
Reported-by: Bill MacAllister <bill@ca-zephyr.org>
Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1052304
Suggested-by: Jeffrey E Altman <jaltman@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/2633992.1702073229@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mark a superblock that is for for an R/O or Backup volume as SB_RDONLY when
mounting it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
AFS doesn't really do locking on R/O volumes as fileservers don't maintain
state with each other and thus a lock on a R/O volume file on one
fileserver will not be be visible to someone looking at the same file on
another fileserver.
Further, the server may return an error if you try it.
Fix this by doing what other AFS clients do and handle filelocking on R/O
volume files entirely within the client and don't touch the server.
Fixes: 6c6c1d63c243 ("afs: Provide mount-time configurable byte-range file locking emulation")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Make AFS return error ENOENT if no cell SRV or AFSDB DNS record (or
cellservdb config file record) can be found rather than returning
EDESTADDRREQ.
Also add cell name lookup info to the cursor dump.
Fixes: d5c32c89b208 ("afs: Fix cell DNS lookup")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216637
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When kafs tries to look up a cell in the DNS or the local config, it will
translate a lookup failure into EDESTADDRREQ whereas OpenAFS translates it
into ENOENT. Applications such as West expect the latter behaviour and
fail if they see the former.
This can be seen by trying to mount an unknown cell:
# mount -t afs %example.com:cell.root /mnt
mount: /mnt: mount(2) system call failed: Destination address required.
Fixes: 4d673da14533 ("afs: Support the AFS dynamic root")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216637
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
afs_server_list is accessed with the rcu_read_lock() held from
volume->servers, so it needs to be cleaned up correctly.
Fix this by using kfree_rcu() instead of kfree().
Fixes: 8a070a964877 ("afs: Detect cell aliases 1 - Cells with root volumes")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will
be maintained as an LTS kernel.
The architecture specific system call tables are updated for
the added map_shadow_stack() syscall and to remove references
to the long-gone sys_lookup_dcookie() syscall.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmVC40IACgkQYKtH/8kJ
Uidhmw/9EX+aWSXGoObJ3fngaNSMw+PmrEuP8qEKBHxfKHcCdX3hc451Oh4GlhaQ
tru91pPwgNvN2/rfoKusxT+V4PemGIzfNni/04rp+P0kvmdw5otQ2yNhsQNsfVmq
XGWvkxF4P2GO6bkjjfR/1dDq7GtlyXtwwPDKeLbYb6TnJOZjtx+EAN27kkfSn1Ms
R4Sa3zJ+DfHUmHL5S9g+7UD/CZ5GfKNmIskI4Mz5GsfoUz/0iiU+Bge/9sdcdSJQ
kmbLy5YnVzfooLZ3TQmBFsO3iAMWb0s/mDdtyhqhTVmTUshLolkPYyKnPFvdupyv
shXcpEST2XJNeaDRnL2K4zSCdxdbnCZHDpjfl9wfioBg7I8NfhXKpf1jYZHH1de4
LXq8ndEFEOVQw/zSpYWfQq1sux8Jiqr+UK/ukbVeFWiGGIUs91gEWtPAf8T0AZo9
ujkJvaWGl98O1g5wmBu0/dAR6QcFJMDfVwbmlIFpU8O+MEaz6X8mM+O5/T0IyTcD
eMbAUjj4uYcU7ihKzHEv/0SS9Of38kzff67CLN5k8wOP/9NlaGZ78o1bVle9b52A
BdhrsAefFiWHp1jT6Y9Rg4HOO/TguQ9e6EWSKOYFulsiLH9LEFaB9RwZLeLytV0W
vlAgY9rUW77g1OJcb7DoNv33nRFuxsKqsnz3DEIXtgozo9CzbYI=
=H1vH
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull ia64 removal and asm-generic updates from Arnd Bergmann:
- The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will be
maintained as an LTS kernel.
- The architecture specific system call tables are updated for the
added map_shadow_stack() syscall and to remove references to the
long-gone sys_lookup_dcookie() syscall.
* tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
hexagon: Remove unusable symbols from the ptrace.h uapi
asm-generic: Fix spelling of architecture
arch: Reserve map_shadow_stack() syscall number for all architectures
syscalls: Cleanup references to sys_lookup_dcookie()
Documentation: Drop or replace remaining mentions of IA64
lib/raid6: Drop IA64 support
Documentation: Drop IA64 from feature descriptions
kernel: Drop IA64 support from sig_fault handlers
arch: Remove Itanium (IA-64) architecture
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/3cUWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsEoEACBGPSiOmfSWdH3TOnIG270PD24
jGjg8KFv7RC/JTOdYmpLl0okdlGT9LvjN/ToSSDEw3PIayxoXUdhkbYy0MYtiV3m
yz2ozDTzJuplQX/W2fPE+nXSzIwHao2zjPPFjHnT7lt8IIjhgjiOtLfZ2gGUkW99
Mdu2aWh3u0r4tC8OS23++yN5ibRc5l72efsjDWjZ0aPXnxE1bjmLMiIPiizpndIf
beasPuDBs98sJVYouemCwnsPXuXOPz3Q1Cpo/fTd+TMTJCLSemCQZCTuOBU0acI/
ZjLCgCaJU1yIYKBMtrIN4G9kITZniXX3/Nm4o6NQMVlcCqMeNaHuflomqWoqWfhE
UPbRo2eghZOaMNiCKLLvZDIqPrh1IcsiEl6Ef3W4hICc42GTK96IuGisIvDXwQ4N
/SzTOupJuN42noh3z1M3XuZy5RoXJ99IYDNY5CTKf9IdqvA0bbGkU3nb1gZH/xw9
BjTqKzR/7K1kTXuSgagDZ1Wceej9pZxhX7E3IHYsP8ZOvKug3EeL4yybVwQ3HRfq
Qnzcp/qPB9cOkLSQXveRTFTsj2mX28Gixct/iDuc1jIYwGQlY1gI6dcUcqby6ptM
BrQti7eR2NH2+T3aE2UVCIWsZVhx7NaSF+z8JxfAuu56jicc4xJVsi8zrNveWX5M
m2VXyBl3121BVtKi4w==
=0iVF
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"One of the more voluminous set of changes is for adding the new
__counted_by annotation[1] to gain run-time bounds checking of
dynamically sized arrays with UBSan.
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R.
Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem
Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas
Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees
Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)"
* tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits)
hwmon: (acpi_power_meter) replace open-coded kmemdup_nul
reset: Annotate struct reset_control_array with __counted_by
kexec: Annotate struct crash_mem with __counted_by
virtio_console: Annotate struct port_buffer with __counted_by
ima: Add __counted_by for struct modsig and use struct_size()
MAINTAINERS: Include stackleak paths in hardening entry
string: Adjust strtomem() logic to allow for smaller sources
hardening: x86: drop reference to removed config AMD_IOMMU_V2
randstruct: Fix gcc-plugin performance mode to stay in group
mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by
drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by
irqchip/imx-intmux: Annotate struct intmux_data with __counted_by
KVM: Annotate struct kvm_irq_routing_table with __counted_by
virt: acrn: Annotate struct vm_memory_region_batch with __counted_by
hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by
sparc: Annotate struct cpuinfo_tree with __counted_by
isdn: kcapi: replace deprecated strncpy with strscpy_pad
isdn: replace deprecated strncpy with strscpy
NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by
nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc
okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC
YkBLU3lXqQ87nfu28ExFAzh10hG2jwM=
=m4pB
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode time accessor updates from Christian Brauner:
"This finishes the conversion of all inode time fields to accessor
functions as discussed on list. Changing timestamps manually as we
used to do before is error prone. Using accessors function makes this
robust.
It does not contain the switch of the time fields to discrete 64 bit
integers to replace struct timespec and free up space in struct inode.
But after this, the switch can be trivially made and the patch should
only affect the vfs if we decide to do it"
* tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits)
fs: rename inode i_atime and i_mtime fields
security: convert to new timestamp accessors
selinux: convert to new timestamp accessors
apparmor: convert to new timestamp accessors
sunrpc: convert to new timestamp accessors
mm: convert to new timestamp accessors
bpf: convert to new timestamp accessors
ipc: convert to new timestamp accessors
linux: convert to new timestamp accessors
zonefs: convert to new timestamp accessors
xfs: convert to new timestamp accessors
vboxsf: convert to new timestamp accessors
ufs: convert to new timestamp accessors
udf: convert to new timestamp accessors
ubifs: convert to new timestamp accessors
tracefs: convert to new timestamp accessors
sysv: convert to new timestamp accessors
squashfs: convert to new timestamp accessors
server: convert to new timestamp accessors
client: convert to new timestamp accessors
...
This makes it harder for accidental or malicious changes to
afs_xattr_handlers at runtime.
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Link: https://lore.kernel.org/r/20230930050033.41174-5-wedsonaf@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct afs_addr_list.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915201449.never.649-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct afs_permits.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: linux-afs@lists.infradead.org
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915201456.never.529-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>