2024-04-15 16:35:23 +00:00
|
|
|
|
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
|
|
|
|
|
|
=====================
|
|
|
|
|
Introduction of mseal
|
|
|
|
|
=====================
|
|
|
|
|
|
|
|
|
|
:Author: Jeff Xu <jeffxu@chromium.org>
|
|
|
|
|
|
|
|
|
|
Modern CPUs support memory permissions such as RW and NX bits. The memory
|
|
|
|
|
permission feature improves security stance on memory corruption bugs, i.e.
|
|
|
|
|
the attacker can’t just write to arbitrary memory and point the code to it,
|
|
|
|
|
the memory has to be marked with X bit, or else an exception will happen.
|
|
|
|
|
|
|
|
|
|
Memory sealing additionally protects the mapping itself against
|
|
|
|
|
modifications. This is useful to mitigate memory corruption issues where a
|
|
|
|
|
corrupted pointer is passed to a memory management system. For example,
|
|
|
|
|
such an attacker primitive can break control-flow integrity guarantees
|
|
|
|
|
since read-only memory that is supposed to be trusted can become writable
|
|
|
|
|
or .text pages can get remapped. Memory sealing can automatically be
|
|
|
|
|
applied by the runtime loader to seal .text and .rodata pages and
|
|
|
|
|
applications can additionally seal security critical data at runtime.
|
|
|
|
|
|
|
|
|
|
A similar feature already exists in the XNU kernel with the
|
|
|
|
|
VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
|
|
|
|
|
|
2024-10-08 04:09:41 +00:00
|
|
|
|
SYSCALL
|
|
|
|
|
=======
|
|
|
|
|
mseal syscall signature
|
|
|
|
|
-----------------------
|
|
|
|
|
``int mseal(void \* addr, size_t len, unsigned long flags)``
|
|
|
|
|
|
|
|
|
|
**addr**/**len**: virtual memory address range.
|
|
|
|
|
The address range set by **addr**/**len** must meet:
|
|
|
|
|
- The start address must be in an allocated VMA.
|
|
|
|
|
- The start address must be page aligned.
|
|
|
|
|
- The end address (**addr** + **len**) must be in an allocated VMA.
|
|
|
|
|
- no gap (unallocated memory) between start and end address.
|
|
|
|
|
|
|
|
|
|
The ``len`` will be paged aligned implicitly by the kernel.
|
|
|
|
|
|
|
|
|
|
**flags**: reserved for future use.
|
|
|
|
|
|
|
|
|
|
**Return values**:
|
|
|
|
|
- **0**: Success.
|
|
|
|
|
- **-EINVAL**:
|
|
|
|
|
* Invalid input ``flags``.
|
|
|
|
|
* The start address (``addr``) is not page aligned.
|
|
|
|
|
* Address range (``addr`` + ``len``) overflow.
|
|
|
|
|
- **-ENOMEM**:
|
|
|
|
|
* The start address (``addr``) is not allocated.
|
|
|
|
|
* The end address (``addr`` + ``len``) is not allocated.
|
|
|
|
|
* A gap (unallocated memory) between start and end address.
|
|
|
|
|
- **-EPERM**:
|
|
|
|
|
* sealing is supported only on 64-bit CPUs, 32-bit is not supported.
|
|
|
|
|
|
|
|
|
|
**Note about error return**:
|
|
|
|
|
- For above error cases, users can expect the given memory range is
|
|
|
|
|
unmodified, i.e. no partial update.
|
|
|
|
|
- There might be other internal errors/cases not listed here, e.g.
|
|
|
|
|
error during merging/splitting VMAs, or the process reaching the maximum
|
|
|
|
|
number of supported VMAs. In those cases, partial updates to the given
|
|
|
|
|
memory range could happen. However, those cases should be rare.
|
|
|
|
|
|
|
|
|
|
**Architecture support**:
|
|
|
|
|
mseal only works on 64-bit CPUs, not 32-bit CPUs.
|
|
|
|
|
|
|
|
|
|
**Idempotent**:
|
|
|
|
|
users can call mseal multiple times. mseal on an already sealed memory
|
|
|
|
|
is a no-action (not error).
|
|
|
|
|
|
|
|
|
|
**no munseal**
|
|
|
|
|
Once mapping is sealed, it can't be unsealed. The kernel should never
|
|
|
|
|
have munseal, this is consistent with other sealing feature, e.g.
|
|
|
|
|
F_SEAL_SEAL for file.
|
|
|
|
|
|
|
|
|
|
Blocked mm syscall for sealed mapping
|
|
|
|
|
-------------------------------------
|
|
|
|
|
It might be important to note: **once the mapping is sealed, it will
|
|
|
|
|
stay in the process's memory until the process terminates**.
|
|
|
|
|
|
|
|
|
|
Example::
|
|
|
|
|
|
|
|
|
|
*ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
|
|
|
|
|
rc = mseal(ptr, 4096, 0);
|
|
|
|
|
/* munmap will fail */
|
|
|
|
|
rc = munmap(ptr, 4096);
|
|
|
|
|
assert(rc < 0);
|
|
|
|
|
|
|
|
|
|
Blocked mm syscall:
|
|
|
|
|
- munmap
|
|
|
|
|
- mmap
|
|
|
|
|
- mremap
|
|
|
|
|
- mprotect and pkey_mprotect
|
|
|
|
|
- some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
|
|
|
|
|
MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
|
|
|
|
|
|
|
|
|
|
The first set of syscalls to block is munmap, mremap, mmap. They can
|
|
|
|
|
either leave an empty space in the address space, therefore allowing
|
|
|
|
|
replacement with a new mapping with new set of attributes, or can
|
|
|
|
|
overwrite the existing mapping with another mapping.
|
|
|
|
|
|
|
|
|
|
mprotect and pkey_mprotect are blocked because they changes the
|
|
|
|
|
protection bits (RWX) of the mapping.
|
|
|
|
|
|
|
|
|
|
Certain destructive madvise behaviors, specifically MADV_DONTNEED,
|
|
|
|
|
MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
|
|
|
|
|
risks when applied to anonymous memory by threads lacking write
|
|
|
|
|
permissions. Consequently, these operations are prohibited under such
|
|
|
|
|
conditions. The aforementioned behaviors have the potential to modify
|
|
|
|
|
region contents by discarding pages, effectively performing a memset(0)
|
|
|
|
|
operation on the anonymous memory.
|
|
|
|
|
|
|
|
|
|
Kernel will return -EPERM for blocked syscalls.
|
|
|
|
|
|
|
|
|
|
When blocked syscall return -EPERM due to sealing, the memory regions may
|
|
|
|
|
or may not be changed, depends on the syscall being blocked:
|
|
|
|
|
|
|
|
|
|
- munmap: munmap is atomic. If one of VMAs in the given range is
|
|
|
|
|
sealed, none of VMAs are updated.
|
|
|
|
|
- mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
|
|
|
|
|
when mprotect over multiple VMAs, mprotect might update the beginning
|
|
|
|
|
VMAs before reaching the sealed VMA and return -EPERM.
|
|
|
|
|
- mmap and mremap: undefined behavior.
|
|
|
|
|
|
|
|
|
|
Use cases
|
|
|
|
|
=========
|
2024-04-15 16:35:23 +00:00
|
|
|
|
- glibc:
|
|
|
|
|
The dynamic linker, during loading ELF executables, can apply sealing to
|
2024-10-08 04:09:41 +00:00
|
|
|
|
mapping segments.
|
2024-04-15 16:35:23 +00:00
|
|
|
|
|
2024-10-08 04:09:41 +00:00
|
|
|
|
- Chrome browser: protect some security sensitive data structures.
|
2024-04-15 16:35:23 +00:00
|
|
|
|
|
2024-10-08 04:09:41 +00:00
|
|
|
|
When not to use mseal
|
|
|
|
|
=====================
|
|
|
|
|
Applications can apply sealing to any virtual memory region from userspace,
|
|
|
|
|
but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
|
|
|
|
|
apply the sealing. This is because the sealed mapping *won’t be unmapped*
|
|
|
|
|
until the process terminates or the exec system call is invoked.
|
2024-04-15 16:35:23 +00:00
|
|
|
|
|
|
|
|
|
For example:
|
2024-10-08 04:09:41 +00:00
|
|
|
|
- aio/shm
|
|
|
|
|
aio/shm can call mmap and munmap on behalf of userspace, e.g.
|
|
|
|
|
ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
|
|
|
|
|
the lifetime of the process. If those memories are sealed from userspace,
|
|
|
|
|
then munmap will fail, causing leaks in VMA address space during the
|
|
|
|
|
lifetime of the process.
|
|
|
|
|
|
|
|
|
|
- ptr allocated by malloc (heap)
|
|
|
|
|
Don't use mseal on the memory ptr return from malloc().
|
|
|
|
|
malloc() is implemented by allocator, e.g. by glibc. Heap manager might
|
|
|
|
|
allocate a ptr from brk or mapping created by mmap.
|
|
|
|
|
If an app calls mseal on a ptr returned from malloc(), this can affect
|
|
|
|
|
the heap manager's ability to manage the mappings; the outcome is
|
|
|
|
|
non-deterministic.
|
|
|
|
|
|
|
|
|
|
Example::
|
|
|
|
|
|
|
|
|
|
ptr = malloc(size);
|
|
|
|
|
/* don't call mseal on ptr return from malloc. */
|
|
|
|
|
mseal(ptr, size);
|
|
|
|
|
/* free will success, allocator can't shrink heap lower than ptr */
|
|
|
|
|
free(ptr);
|
|
|
|
|
|
|
|
|
|
mseal doesn't block
|
|
|
|
|
===================
|
|
|
|
|
In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
|
|
|
|
|
attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
|
|
|
|
|
memory is immutable.
|
2024-04-15 16:35:23 +00:00
|
|
|
|
|
|
|
|
|
As Jann Horn pointed out in [3], there are still a few ways to write
|
2024-10-08 04:09:41 +00:00
|
|
|
|
to RO memory, which is, in a way, by design. And those could be blocked
|
|
|
|
|
by different security measures.
|
2024-04-15 16:35:23 +00:00
|
|
|
|
|
|
|
|
|
Those cases are:
|
|
|
|
|
|
2024-10-08 04:09:41 +00:00
|
|
|
|
- Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
|
|
|
|
|
- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
|
|
|
|
|
- userfaultfd.
|
2024-04-15 16:35:23 +00:00
|
|
|
|
|
|
|
|
|
The idea that inspired this patch comes from Stephen Röttger’s work in V8
|
|
|
|
|
CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
|
|
|
|
|
|
2024-10-08 04:09:41 +00:00
|
|
|
|
Reference
|
|
|
|
|
=========
|
|
|
|
|
- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
|
|
|
|
|
- [2] https://man.openbsd.org/mimmutable.2
|
|
|
|
|
- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
|
|
|
|
|
- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
|