Add a function which can be used to read fragments in the readahead call.
This function is necessary because filesystems built with the -tailends
(or -always-use-fragments) option may have fragments present which cannot
be currently handled.
Link: https://lkml.kernel.org/r/20220617083810.337573-5-hsinyi@chromium.org
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement readahead callback for squashfs. It will read datablocks which
cover pages in readahead request. For a few cases it will not mark page
as uptodate, including:
- file end is 0.
- zero filled blocks.
- current batch of pages isn't in the same datablock.
- decompressor error.
Otherwise pages will be marked as uptodate. The unhandled pages will be
updated by readpage later.
Link: https://lkml.kernel.org/r/20220617083810.337573-4-hsinyi@chromium.org
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: Xiongwei Song <Xiongwei.Song@windriver.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Squashfs_readahead uses the "file direct" version of the page actor, and
so build it unconditionally.
Link: https://lkml.kernel.org/r/20220617083810.337573-3-hsinyi@chromium.org
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Implement readahead for squashfs", v7.
Commit 9eec1d897139("squashfs: provide backing_dev_info in order to
disable read-ahead") mitigates the performance drop issue for squashfs by
closing readahead for it.
This series implements readahead callback for squashfs.
This patch (of 4):
This reverts 9eec1d897139e5 ("squashfs: provide backing_dev_info in order
to disable read-ahead").
Revert closing the readahead to squashfs since the readahead callback for
squashfs is implemented.
Link: https://lkml.kernel.org/r/20220617083810.337573-1-hsinyi@chromium.org
Link: https://lkml.kernel.org/r/20220617083810.337573-2-hsinyi@chromium.org
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Suggested-by: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In this condition, it will cause an bug on error.
ocfs2_mkdir()
->ocfs2_mknod()
->ocfs2_mknod_locked()
->__ocfs2_mknod_locked()
//Assume inode->i_generation is genN.
->inode->i_generation = osb->s_next_generation++;
// The inode lockres has been initialized.
->ocfs2_populate_inode()
->ocfs2_create_new_inode_locks()
->An error happened, returned value is non-zero
// free the start_bit x in bg_blkno
->ocfs2_free_suballoc_bits()
->... /* Another process execute mkdir success in this place,
and it occupied the start_bit x in bg_blkno
which has been freed before. Its inode->i_generation
is genN + 1 */
->iput(inode)
->evict()
->ocfs2_evict_inode()
->ocfs2_delete_inode()
->ocfs2_inode_lock()
->ocfs2_inode_lock_update()
/* Bug on here, genN != genN + 1 */
->mlog_bug_on_msg(inode->i_generation !=
le32_to_cpu(fe->i_generation))
So, we need not to reclaim the inode when the inode->ip_inode_lockres
has been initialized. It will be freed in iput().
Link: http://lkml.kernel.org/r/ef080ca3-5d74-e276-17a1-d9e7c7e662c9@huawei.com
Fixes: b1529a41f777 ("ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error")
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In this condition, the inode can not be wiped when error happened.
ocfs2_mkdir()
->ocfs2_mknod()
->ocfs2_mknod_locked()
->__ocfs2_mknod_locked()
->ocfs2_set_links_count() // i_links_count is 2
-> ... // an error accrue, goto roll_back or leave.
->ocfs2_commit_trans()
->iput(inode)
->evict()
->ocfs2_evict_inode()
->ocfs2_delete_inode()
->ocfs2_inode_lock()
->ocfs2_inode_lock_update()
->ocfs2_refresh_inode()
->set_nlink(); // inode->i_nlink is 2 now.
/* if wipe is 0, it will goto bail_unlock_inode */
->ocfs2_query_inode_wipe()
->if (inode->i_nlink) return; // wipe is 0.
/* inode can not be wiped */
->ocfs2_wipe_inode()
So, we need clear links before the transaction committed.
Link: http://lkml.kernel.org/r/d8147c41-fb2b-bdf7-b660-1f3c8448c33f@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Running reflink from multiple nodes simultaneously to clone a file to the
same directory probably triggers a deadlock issue. For example, there is
a three node ocfs2 cluster, each node mounts the ocfs2 file system to
/mnt/shared, and run the reflink command from each node repeatedly, like
reflink "/mnt/shared/test" \
"/mnt/shared/.snapshots/test.`date +%m%d%H%M%S`.`hostname`"
then, reflink command process will be hung on each node, and you
can't list this file system directory.
The problematic reflink command process is blocked at one node,
task:reflink state:D stack: 0 pid: 1283 ppid: 4154
Call Trace:
__schedule+0x2fd/0x750
schedule+0x2f/0xa0
schedule_timeout+0x1cc/0x310
? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
? 0xffffffffc0e3e000
wait_for_completion+0xba/0x140
? wake_up_q+0xa0/0xa0
__ocfs2_cluster_lock.isra.41+0x3b5/0x820 [ocfs2]
? ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
ocfs2_init_security_and_acl+0xbe/0x1d0 [ocfs2]
ocfs2_reflink+0x436/0x4c0 [ocfs2]
? ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
ocfs2_ioctl+0x25e/0x670 [ocfs2]
do_vfs_ioctl+0xa0/0x680
ksys_ioctl+0x70/0x80
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5b/0x1e0
The other reflink command processes are blocked at other nodes,
task:reflink state:D stack: 0 pid:29759 ppid: 4088
Call Trace:
__schedule+0x2fd/0x750
schedule+0x2f/0xa0
schedule_timeout+0x1cc/0x310
? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
? 0xffffffffc0b19000
wait_for_completion+0xba/0x140
? wake_up_q+0xa0/0xa0
__ocfs2_cluster_lock.isra.41+0x3b5/0x820 [ocfs2]
? ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
ocfs2_mv_orphaned_inode_to_new+0x87/0x7e0 [ocfs2]
ocfs2_reflink+0x335/0x4c0 [ocfs2]
? ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
ocfs2_ioctl+0x25e/0x670 [ocfs2]
do_vfs_ioctl+0xa0/0x680
ksys_ioctl+0x70/0x80
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5b/0x1e0
or
task:reflink state:D stack: 0 pid:18465 ppid: 4156
Call Trace:
__schedule+0x302/0x940
? usleep_range+0x80/0x80
schedule+0x46/0xb0
schedule_timeout+0xff/0x140
? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
? 0xffffffffc0c3b000
__wait_for_common+0xb9/0x170
__ocfs2_cluster_lock.constprop.0+0x1d6/0x860 [ocfs2]
? ocfs2_wait_for_recovery+0x49/0xd0 [ocfs2]
? ocfs2_inode_lock_full_nested+0x30f/0xa50 [ocfs2]
ocfs2_inode_lock_full_nested+0x30f/0xa50 [ocfs2]
ocfs2_inode_lock_tracker+0xf2/0x2b0 [ocfs2]
? dput+0x32/0x2f0
ocfs2_permission+0x45/0xe0 [ocfs2]
inode_permission+0xcc/0x170
link_path_walk.part.0.constprop.0+0x2a2/0x380
? path_init+0x2c1/0x3f0
path_parentat+0x3c/0x90
filename_parentat+0xc1/0x1d0
? filename_lookup+0x138/0x1c0
filename_create+0x43/0x160
ocfs2_reflink_ioctl+0xe6/0x380 [ocfs2]
ocfs2_ioctl+0x1ea/0x2c0 [ocfs2]
? do_sys_openat2+0x81/0x150
__x64_sys_ioctl+0x82/0xb0
do_syscall_64+0x61/0xb0
The deadlock is caused by multiple acquiring the destination directory
inode dlm lock in ocfs2_reflink function, we should acquire this directory
inode dlm lock at the beginning, and hold this dlm lock until end of the
function.
Link: https://lkml.kernel.org/r/20210729110230.18983-1-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In order to identify the type of memory a process has pinned through its
open fds, add the file path to fdinfo output. This allows identifying
memory types based on common prefixes: e.g. "/memfd...", "/dmabuf...",
"/dev/ashmem...".
To be cautious, only expose the paths for anonymous inodes, and this also
avoids printing path names with strange characters.
Access to /proc/<pid>/fdinfo is governed by PTRACE_MODE_READ_FSCREDS the
same as /proc/<pid>/maps which also exposes the file path of mappings; so
the security permissions for accessing path is consistent with that of
/proc/<pid>/maps.
Link: https://lkml.kernel.org/r/20220623220613.3014268-3-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian König <christian.koenig@amd.com>
Cc: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Colin Cross <ccross@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Ioannis Ilkos <ilkos@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "procfs: Add file path and size to /proc/<pid>/fdinfo", v2.
Processes can pin shared memory by keeping a handle to it through a
file descriptor; for instance dmabufs, memfd, and ashmem (in Android).
In the case of a memory leak, to identify the process pinning the
memory, userspace needs to:
- Iterate the /proc/<pid>/fd/* for each process
- Do a readlink on each entry to identify the type of memory from
the file path.
- stat() each entry to get the size of the memory.
The file permissions on /proc/<pid>/fd/* only allows for the owner
or root to perform the operations above; and so is not suitable for
capturing the system-wide state in a production environment.
This issue was addressed for dmabufs by making /proc/*/fdinfo/*
accessible to a process with PTRACE_MODE_READ_FSCREDS credentials[1]
To allow the same kind of tracking for other types of shared memory,
add the following fields to /proc/<pid>/fdinfo/<fd>:
path - This allows identifying the type of memory based on common
prefixes: e.g. "/memfd...", "/dmabuf...", "/dev/ashmem..."
This was not an issued when dmabuf tracking was introduced
because the exp_name field of dmabuf fdinfo could be used
to distinguish dmabuf fds from other types.
size - To track the amount of memory that is being pinned.
dmabufs expose size as an additional field in fdinfo. Remove
this and make it a common field for all fds.
Access to /proc/<pid>/fdinfo is governed by PTRACE_MODE_READ_FSCREDS
-- the same as for /proc/<pid>/maps which also exposes the path and
size for mapped memory regions.
This allows for a system process with PTRACE_MODE_READ_FSCREDS to
account the pinned per-process memory via fdinfo.
This patch (of 2):
To be able to account the amount of memory a process is keeping pinned by
open file descriptors add a 'size' field to fdinfo output.
dmabufs fds already expose a 'size' field for this reason, remove this and
make it a common field for all fds. This allows tracking of other types
of memory (e.g. memfd and ashmem in Android).
Link: https://lkml.kernel.org/r/20220623220613.3014268-1-kaleshsingh@google.com
Link: https://lkml.kernel.org/r/20220623220613.3014268-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Ioannis Ilkos <ilkos@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Cc: Colin Cross <ccross@google.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pss is the sum of the sizes of clean and dirty private pages, and the
proportional sizes of clean and dirty shared pages:
Private = Private_Dirty + Private_Clean
Shared_Proportional = Shared_Dirty_Proportional + Shared_Clean_Proportional
Pss = Private + Shared_Proportional
The Shared*Proportional fields are not present in smaps, so it is not
always possible to determine how much of the Pss is from dirty pages and
how much is from clean pages. This information can be useful for
measuring memory usage for the purpose of optimisation, since clean pages
can usually be discarded by the kernel immediately while dirty pages
cannot.
The smaps routines in the kernel already have access to this data, so add
a Pss_Dirty to show it to userspace. Pss_Clean is not added since it can
be calculated from Pss and Pss_Dirty.
Link: https://lkml.kernel.org/r/20220620081251.2928103-1-vincent.whitchurch@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When we use objcg APIs to charge the LRU pages, the page will not hold a
reference to the memcg associated with the page. So the caller of the
{folio,page}_memcg() should hold an rcu read lock or obtain a reference to
the memcg associated with the page to protect memcg from being released.
So introduce get_mem_cgroup_from_{page,folio}() to obtain a reference to
the memory cgroup associated with the page.
In this patch, make all the callers hold an rcu read lock or obtain a
reference to the memcg to protect memcg from being released when the LRU
pages reparented.
We do not need to adjust the callers of {folio,page}_memcg() during the
whole process of mem_cgroup_move_task(). Because the cgroup migration and
memory cgroup offlining are serialized by @cgroup_mutex. In this routine,
the LRU pages cannot be reparented to its parent memory cgroup. So
{folio,page}_memcg() is stable and cannot be released.
This is a preparation for reparenting the LRU pages.
Link: https://lkml.kernel.org/r/20220621125658.64935-8-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The page fault path checks THP eligibility with __transhuge_page_enabled()
which does the similar thing as hugepage_vma_check(), so use
hugepage_vma_check() instead.
However page fault allows DAX and !anon_vma cases, so added a new flag,
in_pf, to hugepage_vma_check() to make page fault work correctly.
The in_pf flag is also used to skip shmem and file THP for page fault
since shmem handles THP in its own shmem_fault() and file THP allocation
on fault is not supported yet.
Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only
caller now, it is not necessary to have a helper function.
Link: https://lkml.kernel.org/r/20220616174840.1202070-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The transparent_hugepage_active() was introduced to show THP eligibility
bit in smaps in proc, smaps is the only user. But it actually does the
similar check as hugepage_vma_check() which is used by khugepaged. We
definitely don't have to maintain two similar checks, so kill
transparent_hugepage_active().
This patch also fixed the wrong behavior for VM_NO_KHUGEPAGED vmas.
Also move hugepage_vma_check() to huge_memory.c and huge_mm.h since it
is not only for khugepaged anymore.
Link: https://lkml.kernel.org/r/20220616174840.1202070-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
Link: https://lkml.kernel.org/r/20220601210951.3916598-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Charan Teja Kalla <charante@codeaurora.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files who
are going to be deduped. After that, call compare range function only
when files are both DAX or not.
Link: https://lkml.kernel.org/r/20220603053738.1218681-15-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
After that, new allocated extents needs to be remapped to the file. So,
add a CoW identification in ->iomap_begin(), and implement ->iomap_end()
to do the remapping work.
Link: https://lkml.kernel.org/r/20220603053738.1218681-14-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With dax we cannot deal with readpage() etc. So, we create a dax
comparison function which is similar with vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.
Link: https://lkml.kernel.org/r/20220603053738.1218681-13-ruansy.fnst@fujitsu.com
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Punch hole on a reflinked file needs dax_iomap_cow_copy() too. Otherwise,
data in not aligned area will be not correct. So, add the CoW operation
for not aligned case in dax_memzero().
Link: https://lkml.kernel.org/r/20220603053738.1218681-12-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected. This helps us snapshots so new write pagefaults
after snapshots trigger a CoW.
Link: https://lkml.kernel.org/r/20220603053738.1218681-11-ruansy.fnst@fujitsu.com
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the case where the iomap is a write operation and iomap is not equal to
srcmap after iomap_begin, we consider it is a CoW operation.
In this case, the destination (iomap->addr) points to a newly allocated
extent. It is needed to copy the data from srcmap to the extent. In
theory, it is better to copy the head and tail ranges which is outside of
the non-aligned area instead of copying the whole aligned range. But in
dax page fault, it will always be an aligned range. So copy the whole
range in this case.
Link: https://lkml.kernel.org/r/20220603053738.1218681-10-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case. Since this function both output address and pfn, rename it to
dax_iomap_direct_access().
Link: https://lkml.kernel.org/r/20220603053738.1218681-9-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce a PAGE_MAPPING_DAX_COW flag to support association with CoW file
mappings. In this case, since the dax-rmap has already took the
responsibility to look up for shared files by given dax page, the
page->mapping is no longer to used for rmap but for marking that this dax
page is shared. And to make sure disassociation works fine, we use
page->index as refcount, and clear page->mapping to the initial state when
page->index is decreased to 0.
With the help of this new flag, it is able to distinguish normal case and
CoW case, and keep the warning in normal case.
Link: https://lkml.kernel.org/r/20220603053738.1218681-8-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce xfs_notify_failure.c to handle failure related works, such as
implement ->notify_failure(), register/unregister dax holder in xfs, and
so on.
If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data. For now all we do is
kill processes with that file mapped into their address spaces, but future
patches could actually do something about corrupt metadata.
After that, the memory failure needs to notify the processes who are using
those files.
Link: https://lkml.kernel.org/r/20220603053738.1218681-7-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current dax_lock_page() locks dax entry by obtaining mapping and index
in page. To support 1-to-N RMAP in NVDIMM, we need a new function to lock
a specific dax entry corresponding to this file's mapping,index. And
output the page corresponding to the specific dax entry for caller use.
Link: https://lkml.kernel.org/r/20220603053738.1218681-5-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>