linux-next

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git synced 2025-01-16 05:26:07 +00:00

Author	SHA1	Message	Date
Mark Tinguely	1e7882da23	ocfs2: convert ocfs2_readpage_inline() to take a folio Save a couple of calls to compound_head() by using a folio throughout this function. Link: https://lkml.kernel.org/r/20241205171653.3179945-8-willy@infradead.org Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:42 -08:00
Matthew Wilcox (Oracle)	a13febb586	ocfs2: pass mmap_folio around instead of mmap_page Saves a few hidden calls to compound_head() and accesses to page->mapping. Link: https://lkml.kernel.org/r/20241205171653.3179945-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:42 -08:00
Mark Tinguely	2efa52f72c	ocfs2: use a folio in ocfs2_write_begin_inline() Retrieve a folio from the page cache instead of a page and use that folio throught the function. Saves a couple of calls to compound_head(). Link: https://lkml.kernel.org/r/20241205171653.3179945-6-willy@infradead.org Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:42 -08:00
Mark Tinguely	228eb729c5	ocfs2: use a folio in ocfs2_zero_new_buffers() Convert to the new APIs, saving at least one hidden call to compound_head(). Link: https://lkml.kernel.org/r/20241205171653.3179945-5-willy@infradead.org Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:42 -08:00
Mark Tinguely	b7d5a5b0cb	ocfs2: convert w_target_page to w_target_folio Pass a folio around instead of a page. Saves a few hidden calls to compound_head() and removes a call to kmap_atomic(). Link: https://lkml.kernel.org/r/20241205171653.3179945-4-willy@infradead.org Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:41 -08:00
Matthew Wilcox (Oracle)	b1658cc5ef	ocfs2: convert ocfs2_page_mkwrite() to use a folio Pass the folio into __ocfs2_page_mkwrite() and use it throughout. Does not attempt to support large folios. Link: https://lkml.kernel.org/r/20241205171653.3179945-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:41 -08:00
Matthew Wilcox (Oracle)	d7c30202f9	ocfs2: handle a symlink read error correctly Patch series "Convert ocfs2 to use folios". Mark did a conversion of ocfs2 to use folios and sent it to me as a giant patch for review ;-) So I've redone it as individual patches, and credited Mark for the patches where his code is substantially the same. It's not a bad way to do it; his patch had some bugs and my patches had some bugs. Hopefully all our bugs were different from each other. And hopefully Mark likes all the changes I made to his code! This patch (of 23): If we can't read the buffer, be sure to unlock the page before returning. Link: https://lkml.kernel.org/r/20241205171653.3179945-1-willy@infradead.org Link: https://lkml.kernel.org/r/20241205171653.3179945-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Tinguely <mark.tinguely@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:41 -08:00
pangliyuan	848ad53b3f	Squashfs: don't allocate fragment caches more than fragments Sometimes the actual number of fragments in image is between 0 and SQUASHFS_CACHED_FRAGMENTS, which cause additional fragment caches to be allocated. Sets the number of fragment caches to the minimum of fragments and SQUASHFS_CACHED_FRAGMENTS. Link: https://lkml.kernel.org/r/20241210090842.160853-1-pangliyuan1@huawei.com Signed-off-by: pangliyuan <pangliyuan1@huawei.com> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Cc: <wangfangpeng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:41 -08:00
Eric Sandeen	fb666bddd9	ocfs2: convert to the new mount API Convert ocfs2 to the new mount API. Link: https://lkml.kernel.org/r/20241028144443.609151-3-sandeen@redhat.com Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:38 -08:00
Eric Sandeen	a7fed3f8b2	dlmfs: convert to the new mount API Patch series "ocfs2, dlmfs: convert to the new mount API". This patch (of 2): Convert dlmfs to the new mount API. Link: https://lkml.kernel.org/r/20241028144443.609151-1-sandeen@redhat.com Link: https://lkml.kernel.org/r/20241028144443.609151-2-sandeen@redhat.com Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:38 -08:00
Easwar Hariharan	ea7ec1e01e	ceph: convert timeouts to secs_to_jiffies() Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @@ constant C; @@ - msecs_to_jiffies(C * 1000) + secs_to_jiffies(C) @@ constant C; @@ - msecs_to_jiffies(C * MSEC_PER_SEC) + secs_to_jiffies(C) Link: https://lkml.kernel.org/r/20241210-converge-secs-to-jiffies-v3-17-ddfefd7e9f2a@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Daniel Mack <daniel@zonque.org> Cc: David Airlie <airlied@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dick Kennedy <dick.kennedy@broadcom.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jack Wang <jinpu.wang@cloud.ionos.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: James Smart <james.smart@broadcom.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Jeff Johnson <jjohnson@kernel.org> Cc: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jeroen de Borst <jeroendb@google.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kalle Valo <kvalo@kernel.org> Cc: Louis Peens <louis.peens@corigine.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Ofir Bitton <obitton@habana.ai> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Praveen Kaligineedi <pkaligineedi@google.com> Cc: Ray Jui <rjui@broadcom.com> Cc: Robert Jarzmik <robert.jarzmik@free.fr> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Scott Branden <sbranden@broadcom.com> Cc: Shailend Chand <shailend@google.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Simon Horman <horms@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:37 -08:00
Daniel Yang	21110531e0	ocfs2: replace deprecated simple_strtol with kstrtol simple_strtol() ignores overflows and has an awkward interface for error checking. Replace with the recommended kstrtol function leads to clearer error checking and safer conversions. Link: https://lkml.kernel.org/r/20241115080018.5372-1-danielyangkang@gmail.com Signed-off-by: Daniel Yang <danielyangkang@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:26 -08:00
Dmitry Antipov	d0ad1997d3	ocfs2: miscellaneous spelling fixes Correct spelling here and there as suggested by codespell. Link: https://lkml.kernel.org/r/20241115151013.1404929-1-dmantipov@yandex.ru Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:26 -08:00
Daniel Yang	1b777e4ae1	ocfs2: heartbeat: replace simple_strtoul with kstrtoul simple_strtoul() is deprecated due to ignoring overflows and also requires clunkier error checking. Replacing with kstrtoul() leads to safer code and cleaner error checking. Link: https://lkml.kernel.org/r/20241117215219.4012-1-danielyangkang@gmail.com Signed-off-by: Daniel Yang <danielyangkang@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:25 -08:00
Joanne Koong	aca5246a78	fuse: remove tmp folio for writebacks and internal rb tree In the current FUSE writeback design (see commit 3be5a52b30aa ("fuse: support writable mmap")), a temp page is allocated for every dirty page to be written back, the contents of the dirty page are copied over to the temp page, and the temp page gets handed to the server to write back. This is done so that writeback may be immediately cleared on the dirty page, and this in turn is done for two reasons: a) in order to mitigate the following deadlock scenario that may arise if reclaim waits on writeback on the dirty page to complete: * single-threaded FUSE server is in the middle of handling a request that needs a memory allocation * memory allocation triggers direct reclaim * direct reclaim waits on a folio under writeback * the FUSE server can't write back the folio since it's stuck in direct reclaim b) in order to unblock internal (eg sync, page compaction) waits on writeback without needing the server to complete writing back to disk, which may take an indeterminate amount of time. With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates the situations described above, FUSE writeback does not need to use temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings and removes the temporary pages + extra copying and the internal rb tree. fio benchmarks -- (using averages observed from 10 runs, throwing away outliers) Setup: sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount bs = 1k 4k 1M Before 351 MiB/s 1818 MiB/s 1851 MiB/s After 341 MiB/s 2246 MiB/s 2685 MiB/s % diff -3% 23% 45% Link: https://lkml.kernel.org/r/20241122232359.429647-6-joannelkoong@gmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Bernd Schubert <bernd.schubert@fastmail.fm> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:07 -08:00
Joanne Koong	0da1c1a78d	fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings For filesystems with the AS_WRITEBACK_INDETERMINATE flag set, writeback operations may take an indeterminate time to complete. For example, writing data back to disk in FUSE filesystems depends on the userspace server successfully completing writeback. In this commit, wait_sb_inodes() skips waiting on writeback if the inode's mapping has AS_WRITEBACK_INDETERMINATE set, else sync(2) may take an indeterminate amount of time to complete. If the caller wishes to ensure the data for a mapping with the AS_WRITEBACK_INDETERMINATE flag set has actually been written back to disk, they should use fsync(2)/fdatasync(2) instead. Link: https://lkml.kernel.org/r/20241122232359.429647-4-joannelkoong@gmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Bernd Schubert <bernd.schubert@fastmail.fm> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:06 -08:00
Uros Bizjak	9c4ba50565	percpu: use TYPEOF_UNQUAL() in variable declarations Use TYPEOF_UNQUAL() to declare variables as a corresponding type without named address space qualifier to avoid "`__seg_gs' specified for auto variable `var'" errors. Link: https://lkml.kernel.org/r/20241208204708.3742696-4-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Nadav Amit <nadav.amit@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:51:04 -08:00
Lorenzo Stoakes	567308fe23	mm: perform all memfd seal checks in a single place We no longer actually need to perform these checks in the f_op->mmap() hook any longer. We already moved the operation which clears VM_MAYWRITE on a read-only mapping of a write-sealed memfd in order to work around the restrictions imposed by commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour"). There is no reason for us not to simply go ahead and additionally check to see if any pre-existing seals are in place here rather than defer this to the f_op->mmap() hook. By doing this we remove more logic from shmem_mmap() which doesn't belong there, as well as doing the same for hugetlbfs_file_mmap(). We also remove dubious shared logic in mm.h which simply does not belong there either. It makes sense to do these checks at the earliest opportunity, we know these are shmem (or hugetlbfs) mappings whose relevant VMA flags will not change from the invoking do_mmap() so there is simply no need to wait. This also means the implementation of further memfd seal flags can be done within mm/memfd.c and also have the opportunity to modify VMA flags as necessary early in the mapping logic. Link: https://lkml.kernel.org/r/20241206212846.210835-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jeff Xu <jeffxu@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:51 -08:00
David Hildenbrand	361469d853	virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAM Let's implement the get_device_ram() vmcore callback, so architectures that select NEED_PROC_VMCORE_NEED_DEVICE_RAM, like s390 soon, can include that memory in a crash dump. Merge ranges, and process ranges that might contain a mixture of plugged and unplugged, to reduce the total number of ranges. Link: https://lkml.kernel.org/r/20241204125444.1734652-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:45 -08:00
David Hildenbrand	49c43251cd	fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges in 2nd kernel s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in the crashed kernel. RAM provided by memory devices such as virtio-mem can only be detected using the device driver; when vmcore_init() is called, these device drivers are usually not loaded yet, or the devices did not get probed yet. Consequently, on s390 these RAM ranges will not be included in the crash dump, which makes the dump partially corrupt and is unfortunate. Instead of deferring the vmcore_init() call, to an (unclear?) later point, let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as the device drivers probe the device and get access to this information. Then, we'll add these ranges to the vmcore, adding more PT_LOAD entries and updating the offsets+vmcore size. Use a separate Kconfig option to be set by an architecture to include this code only if the arch really needs it. Further, we'll make the config depend on the relevant drivers (i.e., virtio_mem) once they implement support (next). The alternative of having a PROVIDE_PROC_VMCORE_DEVICE_RAM config option was dropped for now for simplicity. The current target use case is s390, which only creates an elf64 elfcore, so focusing on elf64 is sufficient. Link: https://lkml.kernel.org/r/20241204125444.1734652-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:44 -08:00
David Hildenbrand	adf278f7aa	fs/proc/vmcore: factor out freeing a list of vmcore ranges Let's factor it out into include/linux/crash_dump.h, from where we can use it also outside of vmcore.c later. Link: https://lkml.kernel.org/r/20241204125444.1734652-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:44 -08:00
David Hildenbrand	fbdd8ccedf	fs/proc/vmcore: factor out allocating a vmcore range and adding it to a list Let's factor it out into include/linux/crash_dump.h, from where we can use it also outside of vmcore.c later. Link: https://lkml.kernel.org/r/20241204125444.1734652-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:43 -08:00
David Hildenbrand	be433ab8f5	fs/proc/vmcore: move vmcore definitions out of kcore.h These vmcore defines are not related to /proc/kcore, move them out. We'll move "struct vmcoredd_node" to vmcore.c, because it is only used internally. While "struct vmcore" is only used internally for now, we're planning on using it from inline functions in crash_dump.h next, so move it to crash_dump.h. While at it, rename "struct vmcore" to "struct vmcore_range", which is a more suitable name and will make the usage of it outside of vmcore.c clearer. Link: https://lkml.kernel.org/r/20241204125444.1734652-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:43 -08:00
David Hildenbrand	86b8a87028	fs/proc/vmcore: prefix all pr_* with "vmcore:" Let's use "vmcore: " as a prefix, converting the single "Kdump: vmcore not initialized" one to effectively be "vmcore: not initialized". Link: https://lkml.kernel.org/r/20241204125444.1734652-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:43 -08:00
David Hildenbrand	d788736b68	fs/proc/vmcore: disallow vmcore modifications while the vmcore is open The vmcoredd_update_size() call and its effects (size/offset changes) are currently completely unsynchronized, and will cause trouble when performed concurrently, or when done while someone is already reading the vmcore. Let's protect all vmcore modifications by the vmcore_mutex, disallow vmcore modifications while the vmcore is open, and warn on vmcore modifications after the vmcore was already opened once: modifications while the vmcore is open are unsafe, and modifications after the vmcore was opened indicates trouble. Properly synchronize against concurrent opening of the vmcore. No need to grab the mutex during mmap()/read(): after we opened the vmcore, modifications are impossible. It's worth noting that modifications after the vmcore was opened are completely unexpected, so failing if open, and warning if already opened (+closed again) is good enough. This change not only handles concurrent adding of device dumps + concurrent reading of the vmcore properly, it also prepares for other mechanisms that will modify the vmcore. Link: https://lkml.kernel.org/r/20241204125444.1734652-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:43 -08:00
David Hildenbrand	46de2311a6	fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutex Now that we have a mutex that synchronizes against opening of the vmcore, let's use that one to replace vmcoredd_mutex: there is no need to have two separate ones. This is a preparation for properly preventing vmcore modifications after the vmcore was opened. Link: https://lkml.kernel.org/r/20241204125444.1734652-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:42 -08:00
David Hildenbrand	5b5b813820	fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutex Patch series "fs/proc/vmcore: kdump support for virtio-mem on s390", v2. The only "different than everything else" thing about virtio-mem on s390 is kdump: The crash (2nd) kernel allocates+prepares the elfcore hdr during fs_init()->vmcore_init()->elfcorehdr_alloc(). Consequently, the kdump kernel must detect memory ranges of the crashed kernel to include via PT_LOAD in the vmcore. On other architectures, all RAM regions (boot + hotplugged) can easily be observed on the old (to crash) kernel (e.g., using /proc/iomem) to create the elfcore hdr. On s390, information about "ordinary" memory (heh, "storage") can be obtained by querying the hypervisor/ultravisor via SCLP/diag260, and that information is stored early during boot in the "physmem" memblock data structure. But virtio-mem memory is always detected by a device driver, which is usually built as a module. So in the crash kernel, this memory can only be properly detected once the virtio-mem driver has started up. The virtio-mem driver already supports the "kdump mode", where it won't hotplug any memory but instead queries the device to implement the pfn_is_ram() callback, to avoid reading unplugged memory holes when reading the vmcore. With this series, if the virtio-mem driver is included in the kdump initrd -- which dracut already takes care of under Fedora/RHEL -- it will now detect the device RAM ranges on s390 once it probes the devices, to add them to the vmcore using the same callback mechanism we already have for pfn_is_ram(). To add these device RAM ranges to the vmcore ("patch the vmcore"), we will add new PT_LOAD entries that describe these memory ranges, and update all offsets vmcore size so it is all consistent. My testing when creating+analyzing crash dumps with hotplugged virtio-mem memory (incl. holes) did not reveal any surprises. This patch (of 12): We want to protect vmcore modifications from concurrent opening of the vmcore, and also serialize vmcore modification. (a) We can currently modify the vmcore after it was opened. This can happen if a vmcoredd is added after the vmcore module was initialized and already opened by user space. We want to fix that and prepare for new code wanting to serialize against concurrent opening. (b) To handle it cleanly we need to protect the modifications against concurrent opening. As the modifications end up allocating memory and can sleep, we cannot rely on the spinlock. Let's convert the spinlock into a mutex to prepare for further changes. Link: https://lkml.kernel.org/r/20241204125444.1734652-1-david@redhat.com Link: https://lkml.kernel.org/r/20241204125444.1734652-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:42 -08:00
Lorenzo Stoakes	5302e93f34	mm: abstract get_arg_page() stack expansion and mmap read lock Right now fs/exec.c invokes expand_downwards(), an otherwise internal implementation detail of the VMA logic in order to ensure that an arg page can be obtained by get_user_pages_remote(). In order to be able to move the stack expansion logic into mm/vma.c to make it available to userland testing we need to find an alternative approach here. We do so by providing the mmap_read_lock_maybe_expand() function which also helpfully documents what get_arg_page() is doing here and adds an additional check against VM_GROWSDOWN to make explicit that the stack expansion logic is only invoked when the VMA is indeed a downward-growing stack. This allows expand_downwards() to become a static function. Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be currently user-visible in any way, that is place within an rmap or VMA tree. It must be a newly allocated VMA. This is the case when exec invokes this function. Link: https://lkml.kernel.org/r/5295d1c70c58e6aa63d14be68d4e1de9fa1c8e6d.1733248985.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:50:39 -08:00
Dennis Lam	1aefbedee7	ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv When mounting ocfs2 and then remounting it as read-only, a slab-use-after-free occurs after the user uses a syscall to quota_getnextquota. Specifically, sb_dqinfo(sb, type)->dqi_priv is the dangling pointer. During the remounting process, the pointer dqi_priv is freed but is never set as null leaving it to to be accessed. Additionally, the read-only option for remounting sets the DQUOT_SUSPENDED flag instead of setting the DQUOT_USAGE_ENABLED flags. Moreover, later in the process of getting the next quota, the function ocfs2_get_next_id is called and only checks the quota usage flags and not the quota suspended flags. To fix this, I set dqi_priv to null when it is freed after remounting with read-only and put a check for DQUOT_SUSPENDED in ocfs2_get_next_id. Link: https://lkml.kernel.org/r/20241218023924.22821-2-dennis.lamerice@gmail.com Fixes: 8f9e8f5fcc05 ("ocfs2: Fix Q_GETNEXTQUOTA for filesystem without quotas") Signed-off-by: Dennis Lam <dennis.lamerice@gmail.com> Reported-by: syzbot+d173bf8a5a7faeede34c@syzkaller.appspotmail.com Tested-by: syzbot+d173bf8a5a7faeede34c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6731d26f.050a0220.1fb99c.014b.GAE@google.com/T/ Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:49:59 -08:00
David Hildenbrand	a45c68ddb3	fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit Entries (including flags) are u64, even on 32bit. So right now we are cutting of the flags on 32bit. This way, for example the cow selftest complains about: # ./cow ... Bail Out! read and ioctl return unmatched results for populated: 0 1 Link: https://lkml.kernel.org/r/20241217195000.1734039-1-david@redhat.com Fixes: 2c1f057e5be6 ("fs/proc/task_mmu: properly detect PM_MMAP_EXCLUSIVE per page of PMD-mapped THPs") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:49:58 -08:00
Dmitry Antipov	cb11e33358	ocfs2: fix directory entry check in ocfs2_search_dirblock() Syzbot has reported the following KASAN splat: BUG: KASAN: slab-use-after-free in ocfs2_search_dirblock+0x26b/0x830 Read of size 1 at addr ffff888012009982 by task repro/5388 ... Call Trace: <TASK> dump_stack_lvl+0x241/0x360 ? __pfx_dump_stack_lvl+0x10/0x10 ? __pfx__printk+0x10/0x10 ? _printk+0xd5/0x120 ? __virt_addr_valid+0x183/0x530 ? __virt_addr_valid+0x183/0x530 print_report+0x169/0x550 ? __virt_addr_valid+0x183/0x530 ? __virt_addr_valid+0x183/0x530 ? __virt_addr_valid+0x45f/0x530 ? __phys_addr+0xba/0x170 ? ocfs2_search_dirblock+0x26b/0x830 kasan_report+0x143/0x180 ? ocfs2_search_dirblock+0x26b/0x830 ocfs2_search_dirblock+0x26b/0x830 ? ocfs2_read_inode_block+0x14c/0x1e0 ? __pfx_ocfs2_search_dirblock+0x10/0x10 ? validate_chain+0x11e/0x5900 ocfs2_find_entry+0x1169/0x2780 ? mark_lock+0x9a/0x350 ? __lock_acquire+0x137a/0x2040 ? __pfx_ocfs2_find_entry+0x10/0x10 ? __pfx_lock_acquire+0x10/0x10 ? ocfs2_inode_lock_full_nested+0x17b/0x1c10 ? __pfx_lock_release+0x10/0x10 ? do_raw_spin_lock+0x14f/0x370 ? do_raw_spin_unlock+0x58/0x8b0 ? _raw_spin_unlock+0x28/0x50 ? ocfs2_inode_lock_full_nested+0xb2f/0x1c10 ? __pfx_ocfs2_inode_lock_full_nested+0x10/0x10 ocfs2_find_files_on_disk+0xff/0x360 ocfs2_lookup_ino_from_name+0xb1/0x1e0 ? __pfx_ocfs2_lookup_ino_from_name+0x10/0x10 ocfs2_lookup+0x292/0xa60 ? __pfx_ocfs2_lookup+0x10/0x10 ? from_kgid+0x1a7/0x730 ? make_vfsgid+0x46/0x90 ? HAS_UNMAPPED_ID+0xf9/0x150 ? inode_permission+0xff/0x460 ? __pfx_ocfs2_permission+0x10/0x10 ? bpf_lsm_inode_create+0x9/0x10 ? security_inode_create+0xc2/0x110 ? __pfx_ocfs2_lookup+0x10/0x10 path_openat+0x11ce/0x3470 ? __pfx_path_openat+0x10/0x10 do_filp_open+0x235/0x490 ? __pfx_do_filp_open+0x10/0x10 ? _raw_spin_unlock+0x28/0x50 ? alloc_fd+0x5a1/0x640 do_sys_openat2+0x13e/0x1d0 ? mntput_no_expire+0xc2/0x850 ? __pfx_do_sys_openat2+0x10/0x10 ? __pfx_mntput_no_expire+0x10/0x10 __x64_sys_openat+0x247/0x2a0 ? __pfx___x64_sys_openat+0x10/0x10 ? do_syscall_64+0x100/0x230 ? do_syscall_64+0xb6/0x230 do_syscall_64+0xf3/0x230 entry_SYSCALL_64_after_hwframe+0x77/0x7f ... </TASK> This happens when 'ocfs2_search_dirblock()' makes an attempt to jump over (presumably invalid) on-disk directory entry which size exceeds 'sizeof(struct ocfs2_dir_entry)', thus touching memory used by others (including the previously freed one). So just bail out if such a directory entry is found. Link: https://lkml.kernel.org/r/20241119170745.464799-1-dmantipov@yandex.ru Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Reported-by: syzbot+b9704899e166798d57c9@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b9704899e166798d57c9 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:49:55 -08:00
Ryusuke Konishi	6309b8ce98	nilfs2: fix buffer head leaks in calls to truncate_inode_pages() When block_invalidatepage was converted to block_invalidate_folio, the fallback to block_invalidatepage in folio_invalidate() if the address_space_operations method invalidatepage (currently invalidate_folio) was not set, was removed. Unfortunately, some pseudo-inodes in nilfs2 use empty_aops set by inode_init_always_gfp() as is, or explicitly set it to address_space_operations. Therefore, with this change, block_invalidatepage() is no longer called from folio_invalidate(), and as a result, the buffer_head structures attached to these pages/folios are no longer freed via try_to_free_buffers(). Thus, these buffer heads are now leaked by truncate_inode_pages(), which cleans up the page cache from inode evict(), etc. Three types of caches use empty_aops: gc inode caches and the DAT shadow inode used by GC, and b-tree node caches. Of these, b-tree node caches explicitly call invalidate_mapping_pages() during cleanup, which involves calling try_to_free_buffers(), so the leak was not visible during normal operation but worsened when GC was performed. Fix this issue by using address_space_operations with invalidate_folio set to block_invalidate_folio instead of empty_aops, which will ensure the same behavior as before. Link: https://lkml.kernel.org/r/20241212164556.21338-1-konishi.ryusuke@gmail.com Fixes: 7ba13abbd31e ("fs: Turn block_invalidatepage into block_invalidate_folio") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> [5.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:04:45 -08:00
Edward Adam Davis	901ce9705f	nilfs2: prevent use of deleted inode syzbot reported a WARNING in nilfs_rmdir. [1] Because the inode bitmap is corrupted, an inode with an inode number that should exist as a ".nilfs" file was reassigned by nilfs_mkdir for "file0", causing an inode duplication during execution. And this causes an underflow of i_nlink in rmdir operations. The inode is used twice by the same task to unmount and remove directories ".nilfs" and "file0", it trigger warning in nilfs_rmdir. Avoid to this issue, check i_nlink in nilfs_iget(), if it is 0, it means that this inode has been deleted, and iput is executed to reclaim it. [1] WARNING: CPU: 1 PID: 5824 at fs/inode.c:407 drop_nlink+0xc4/0x110 fs/inode.c:407 ... Call Trace: <TASK> nilfs_rmdir+0x1b0/0x250 fs/nilfs2/namei.c:342 vfs_rmdir+0x3a3/0x510 fs/namei.c:4394 do_rmdir+0x3b5/0x580 fs/namei.c:4453 __do_sys_rmdir fs/namei.c:4472 [inline] __se_sys_rmdir fs/namei.c:4470 [inline] __x64_sys_rmdir+0x47/0x50 fs/namei.c:4470 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Link: https://lkml.kernel.org/r/20241209065759.6781-1-konishi.ryusuke@gmail.com Fixes: d25006523d0b ("nilfs2: pathname operations") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+9260555647a5132edd48@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9260555647a5132edd48 Tested-by: syzbot+9260555647a5132edd48@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:04:44 -08:00
Kefeng Wang	8aca2bc96c	mm: use aligned address in clear_gigantic_page() In current kernel, hugetlb_no_page() calls folio_zero_user() with the fault address. Where the fault address may be not aligned with the huge page size. Then, folio_zero_user() may call clear_gigantic_page() with the address, while clear_gigantic_page() requires the address to be huge page size aligned. So, this may cause memory corruption or information leak, addtional, use more obvious naming 'addr_hint' instead of 'addr' for clear_gigantic_page(). Link: https://lkml.kernel.org/r/20241028145656.932941-1-wangkefeng.wang@huawei.com Fixes: 78fefd04c123 ("mm: memory: convert clear_huge_page() to folio_zero_user()") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:04:42 -08:00
Heming Zhao	7782e3b3b0	ocfs2: fix the space leak in LA when releasing LA Commit 30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()") introduced an issue, the ocfs2_sync_local_to_main() ignores the last contiguous free bits, which causes an OCFS2 volume to lose the last free clusters of LA window during the release routine. Please note, because commit dfe6c5692fb5 ("ocfs2: fix the la space leak when unmounting an ocfs2 volume") was reverted, this commit is a replacement fix for commit dfe6c5692fb5. Link: https://lkml.kernel.org/r/20241205104835.18223-3-heming.zhao@suse.com Fixes: 30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Suggested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:04:41 -08:00
Heming Zhao	1a72d2ebee	ocfs2: revert "ocfs2: fix the la space leak when unmounting an ocfs2 volume" Patch series "Revert ocfs2 commit dfe6c5692fb5 and provide a new fix". SUSE QA team detected a mistake in my commit dfe6c5692fb5 ("ocfs2: fix the la space leak when unmounting an ocfs2 volume"). I am very sorry for my error. (If my eyes are correct) From the mailling list mails, this patch shouldn't be applied to 4.19 5.4 5.10 5.15 6.1 6.6, and these branches should perform a revert operation. Reason for revert: In commit dfe6c5692fb5, I mistakenly wrote: "This bug has existed since the initial OCFS2 code.". The statement is wrong. The correct introduction commit is 30dd3478c3cd. IOW, if the branch doesn't include 30dd3478c3cd, dfe6c5692fb5 should also not be included. This reverts commit dfe6c5692fb5 ("ocfs2: fix the la space leak when unmounting an ocfs2 volume"). In commit dfe6c5692fb5, the commit log "This bug has existed since the initial OCFS2 code." is wrong. The correct introduction commit is 30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()"). The influence of commit dfe6c5692fb5 is that it provides a correct fix for the latest kernel. however, it shouldn't be pushed to stable branches. Let's use this commit to revert all branches that include dfe6c5692fb5 and use a new fix method to fix commit 30dd3478c3cd. Link: https://lkml.kernel.org/r/20241205104835.18223-1-heming.zhao@suse.com Link: https://lkml.kernel.org/r/20241205104835.18223-2-heming.zhao@suse.com Fixes: dfe6c5692fb5 ("ocfs2: fix the la space leak when unmounting an ocfs2 volume") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-12-18 19:04:41 -08:00
Kees Cook	c7c1167fcb	Merge branch 'for-next/topic/execve/core' into for-next/execve	2024-12-18 17:01:53 -08:00
Mickaël Salaün	a5874fde3c	exec: Add a new AT_EXECVE_CHECK flag to execveat(2) Add a new AT_EXECVE_CHECK flag to execveat(2) to check if a file would be allowed for execution. The main use case is for script interpreters and dynamic linkers to check execution permission according to the kernel's security policy. Another use case is to add context to access logs e.g., which script (instead of interpreter) accessed a file. As any executable code, scripts could also use this check [1]. This is different from faccessat(2) + X_OK which only checks a subset of access rights (i.e. inode permission and mount options for regular files), but not the full context (e.g. all LSM access checks). The main use case for access(2) is for SUID processes to (partially) check access on behalf of their caller. The main use case for execveat(2) + AT_EXECVE_CHECK is to check if a script execution would be allowed, according to all the different restrictions in place. Because the use of AT_EXECVE_CHECK follows the exact kernel semantic as for a real execution, user space gets the same error codes. An interesting point of using execveat(2) instead of openat2(2) is that it decouples the check from the enforcement. Indeed, the security check can be logged (e.g. with audit) without blocking an execution environment not yet ready to enforce a strict security policy. LSMs can control or log execution requests with security_bprm_creds_for_exec(). However, to enforce a consistent and complete access control (e.g. on binary's dependencies) LSMs should restrict file executability, or measure executed files, with security_file_open() by checking file->f_flags & __FMODE_EXEC. Because AT_EXECVE_CHECK is dedicated to user space interpreters, it doesn't make sense for the kernel to parse the checked files, look for interpreters known to the kernel (e.g. ELF, shebang), and return ENOEXEC if the format is unknown. Because of that, security_bprm_check() is never called when AT_EXECVE_CHECK is used. It should be noted that script interpreters cannot directly use execveat(2) (without this new AT_EXECVE_CHECK flag) because this could lead to unexpected behaviors e.g., `python script.sh` could lead to Bash being executed to interpret the script. Unlike the kernel, script interpreters may just interpret the shebang as a simple comment, which should not change for backward compatibility reasons. Because scripts or libraries files might not currently have the executable permission set, or because we might want specific users to be allowed to run arbitrary scripts, the following patch provides a dynamic configuration mechanism with the SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE securebits. This is a redesign of the CLIP OS 4's O_MAYEXEC: `f5cb330d6b/1901_open_mayexec.patch` This patch has been used for more than a decade with customized script interpreters. Some examples can be found here: https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <keescook@chromium.org> Acked-by: Paul Moore <paul@paul-moore.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Jeff Xu <jeffxu@chromium.org> Tested-by: Jeff Xu <jeffxu@chromium.org> Link: https://docs.python.org/3/library/io.html#io.open_code [1] Signed-off-by: Mickaël Salaün <mic@digikod.net> Link: https://lore.kernel.org/r/20241212174223.389435-2-mic@digikod.net Signed-off-by: Kees Cook <kees@kernel.org>	2024-12-18 17:00:29 -08:00
Linus Torvalds	eabcdba3ad	for-6.13-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdhyQAACgkQxWXV+ddt WDuveg//bJSuXHrA7jkijst8rdoAFrceiUXuQPZ6bqb9QrSqlDZlP5/XQpdXZ3yU qJh/aE13cy0zWTQ2+fMcc770WSvU1cRW/f5BZ+fdXgvO8lS516suXGYd2Q06Cl9/ DriAKGKtRfJn1BrEEv8+fjKS/chxZg6IR/W4kN6AinW31myY9jE5mEDAn+vyTDgQ 8USZ/ar/3KuWo+wO5h5JzrvGnhzK0W0HRs/A0NZ3gG8J5T4yj+8zG0VJR4Gf93AL iBlsnAR8VzAYJOZCi36SD3j3/eDxJio5GhDYsdt28tk1bL8FqSuI4Yxt+LuiZ2Fg Cq/31lELEkyEH8AoVFm9pX3HNyRmV6JhpvDXiyofHaOUZ3VeivVE59gOShLUUMkn f9Pl/uh5/t/ioWWHBnCMyRpI9GZUGCvW24k7HjT7QZhsDGFLTm07diCiRgZ7eaOu LZRKMOL5jifAnfxNSvIJV19H4lQLTZfbdjmJyb6Il39tIU/1U9pXicgih3iyidW2 N5n4pHf3OQFwG8kNw1mR1g1CPBALP62ja8kMv//IgH4YXXnm1Mo7B3CcJogAAmo4 HB9f/gFqZ8kWaiuIUJKfPZkkLFt5x0TNZQyyOhVUd7V4mFdtEzVtZRWo3juYuLGk 7Shp/MTlYokwnEropiWHU5ab3Bb9vLxlh8daGK/OmwBz01DaApI= =AAmb -----END PGP SIGNATURE----- Merge tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - tree-checker catches invalid number of inline extent references - zoned mode fixes: - enhance zone append IO command so it also detects emulated writes - handle bio splitting at sectorsize boundary - when deleting a snapshot, fix a condition for visiting nodes in reloc trees * tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: tree-checker: reject inline extent items with 0 ref count btrfs: split bios to the fs sector size boundary btrfs: use bio_is_zone_append() in the completion handler btrfs: fix improper generation check in snapshot delete	2024-12-18 14:17:21 -08:00
Matthew Wilcox (Oracle)	1f2bf7049f	ntfs3: Remove an access to page->index Convert the first page passed to ni_write_frame() to a folio and use folio_pos() on that instead of open-coding the access to folio->index, cast & shift. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>	2024-12-18 16:42:01 +03:00
Jan Kara	71358f64c4	Merge inotify strcpy hardening.	2024-12-18 11:35:20 +01:00
Kees Cook	b8f2688258	inotify: Use strscpy() for event->name copies Since we have already allocated "len + 1" space for event->name, make sure that name->name cannot ever accidentally cause a copy overflow by calling strscpy() instead of the unbounded strcpy() routine. This assists in the ongoing efforts to remove the unsafe strcpy() API[1] from the kernel. Link: https://github.com/KSPP/linux/issues/88 [1] Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241216224507.work.859-kees@kernel.org	2024-12-18 11:33:40 +01:00
David Sterba	223e2d308d	Merge branch 'misc-next' into for-next-next-v6.13-20241218	2024-12-18 02:43:55 +01:00
David Sterba	e6880d4a24	Merge branch 'misc-6.13' into for-next-next-v6.13-20241218	2024-12-18 02:43:55 +01:00
Anand Jain	45d8d33c88	btrfs: modload to print RAID1 balancing status Modified the Btrfs loading message to include the RAID1 balancing status if the experimental feature is enabled. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:42:33 +01:00
Anand Jain	e51f52e1d3	btrfs: enable RAID1 balancing configuration via modprobe parameter This update allows configuring the `raid1-balancing` methods using a modprobe parameter when experimental mode CONFIG_BTRFS_EXPERIMENTAL is enabled. Examples: - Set the RAID1 balancing method to round-robin with a custom `min_contiguous_read` of 192k: $ modprobe btrfs raid1-balancing=round-robin:196608 - Set the round-robin balancing method with the default `min_contiguous_read` of 256k: $ modprobe btrfs raid1-balancing=round-robin - Set the `devid` balancing method, defaulting to the latest device: $ modprobe btrfs raid1-balancing=devid Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:50 +01:00
Anand Jain	e2f11776f9	btrfs: expose experimental mode in module information Commit c9c49e8f157e ("btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG") introduces a way to enable or disable experimental features, print its status during module load, like so: Btrfs loaded, experimental=on, debug=on, assert=on, zoned=yes, fsverity=yes Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:50 +01:00
Anand Jain	2c0cf2b44b	btrfs: add RAID1 preferred read device When there's stale data on a mirrored device, this feature lets you choose which device to read from. Mainly used for testing. echo "devid:<devid-value>" > /sys/fs/btrfs/<UUID>/read_policy Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:50 +01:00
Anand Jain	eff96dae96	btrfs: introduce RAID1 round-robin read balancing This feature balances I/O across the striped devices when reading from RAID1 blocks. echo round-robin[:min_contiguous_read] > /sys/fs/btrfs/<uuid>/read_policy The min_contiguous_read parameter defines the minimum read size before switching to the next mirrored device. This setting is optional, with a default value of 256 KiB. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00
Anand Jain	6b471f9f5c	btrfs: handle value associated with raid1 balancing parameter This change enables specifying additional configuration values alongside the raid1 balancing / read policy in a single input string. Updated btrfs_read_policy_to_enum() to parse and handle a value associated with the policy in the format `policy:value`, the value part if present is converted 64-bit integer. Update btrfs_read_policy_store() to accommodate the new parameter. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-12-18 02:41:49 +01:00

1 2 3 4 5 ...

96176 Commits