mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
synced 2025-01-17 22:05:08 +00:00
ad56b738c5
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
100 lines
4.1 KiB
ReStructuredText
100 lines
4.1 KiB
ReStructuredText
.. _mmu_notifier:
|
|
|
|
When do you need to notify inside page table lock ?
|
|
===================================================
|
|
|
|
When clearing a pte/pmd we are given a choice to notify the event through
|
|
(notify version of \*_clear_flush call mmu_notifier_invalidate_range) under
|
|
the page table lock. But that notification is not necessary in all cases.
|
|
|
|
For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
|
|
thing like ATS/PASID to get the IOMMU to walk the CPU page table to access a
|
|
process virtual address space). There is only 2 cases when you need to notify
|
|
those secondary TLB while holding page table lock when clearing a pte/pmd:
|
|
|
|
A) page backing address is free before mmu_notifier_invalidate_range_end()
|
|
B) a page table entry is updated to point to a new page (COW, write fault
|
|
on zero page, __replace_page(), ...)
|
|
|
|
Case A is obvious you do not want to take the risk for the device to write to
|
|
a page that might now be used by some completely different task.
|
|
|
|
Case B is more subtle. For correctness it requires the following sequence to
|
|
happen:
|
|
|
|
- take page table lock
|
|
- clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
|
|
- set page table entry to point to new page
|
|
|
|
If clearing the page table entry is not followed by a notify before setting
|
|
the new pte/pmd value then you can break memory model like C11 or C++11 for
|
|
the device.
|
|
|
|
Consider the following scenario (device use a feature similar to ATS/PASID):
|
|
|
|
Two address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume
|
|
they are write protected for COW (other case of B apply too).
|
|
|
|
::
|
|
|
|
[Time N] --------------------------------------------------------------------
|
|
CPU-thread-0 {try to write to addrA}
|
|
CPU-thread-1 {try to write to addrB}
|
|
CPU-thread-2 {}
|
|
CPU-thread-3 {}
|
|
DEV-thread-0 {read addrA and populate device TLB}
|
|
DEV-thread-2 {read addrB and populate device TLB}
|
|
[Time N+1] ------------------------------------------------------------------
|
|
CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
|
|
CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
|
|
CPU-thread-2 {}
|
|
CPU-thread-3 {}
|
|
DEV-thread-0 {}
|
|
DEV-thread-2 {}
|
|
[Time N+2] ------------------------------------------------------------------
|
|
CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}}
|
|
CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}}
|
|
CPU-thread-2 {}
|
|
CPU-thread-3 {}
|
|
DEV-thread-0 {}
|
|
DEV-thread-2 {}
|
|
[Time N+3] ------------------------------------------------------------------
|
|
CPU-thread-0 {preempted}
|
|
CPU-thread-1 {preempted}
|
|
CPU-thread-2 {write to addrA which is a write to new page}
|
|
CPU-thread-3 {}
|
|
DEV-thread-0 {}
|
|
DEV-thread-2 {}
|
|
[Time N+3] ------------------------------------------------------------------
|
|
CPU-thread-0 {preempted}
|
|
CPU-thread-1 {preempted}
|
|
CPU-thread-2 {}
|
|
CPU-thread-3 {write to addrB which is a write to new page}
|
|
DEV-thread-0 {}
|
|
DEV-thread-2 {}
|
|
[Time N+4] ------------------------------------------------------------------
|
|
CPU-thread-0 {preempted}
|
|
CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
|
|
CPU-thread-2 {}
|
|
CPU-thread-3 {}
|
|
DEV-thread-0 {}
|
|
DEV-thread-2 {}
|
|
[Time N+5] ------------------------------------------------------------------
|
|
CPU-thread-0 {preempted}
|
|
CPU-thread-1 {}
|
|
CPU-thread-2 {}
|
|
CPU-thread-3 {}
|
|
DEV-thread-0 {read addrA from old page}
|
|
DEV-thread-2 {read addrB from new page}
|
|
|
|
So here because at time N+2 the clear page table entry was not pair with a
|
|
notification to invalidate the secondary TLB, the device see the new value for
|
|
addrB before seing the new value for addrA. This break total memory ordering
|
|
for the device.
|
|
|
|
When changing a pte to write protect or to point to a new write protected page
|
|
with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
|
|
call to mmu_notifier_invalidate_range_end() outside the page table lock. This
|
|
is true even if the thread doing the page table update is preempted right after
|
|
releasing page table lock but before call mmu_notifier_invalidate_range_end().
|