Linux kernel source tree
Go to file
Christian Brauner 2a010c4128 fs: don't block i_writecount during exec
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.

Here are some of the notes that I took:

(1) The deny_write_access() mechanism is causing really pointless issues
    such as [1]. If a thread in a thread-group opens a file writable,
    then writes some stuff, then closing the file descriptor and then
    calling execve() they can fail the execve() with ETXTBUSY because
    another thread in the thread-group could have concurrently called
    fork(). Multi-threaded libraries such as go suffer from this.

(2) There are userspace attacks that rely on overwriting the binary of a
    running process. These attacks are _mitigated_ but _not at all
    prevented_ from ocurring by the deny_write_access() mechanism.

    I'll go over some details. The clearest example of such attacks was
    the attack against runC in CVE-2019-5736 (cf. [3]).

    An attack could compromise the runC host binary from inside a
    _privileged_ runC container. The malicious binary could then be used
    to take over the host.

    (It is crucial to note that this attack is _not_ possible with
     unprivileged containers. IOW, the setup here is already insecure.)

    The attack can be made when attaching to a running container or when
    starting a container running a specially crafted image. For example,
    when runC attaches to a container the attacker can trick it into
    executing itself.

    This could be done by replacing the target binary inside the
    container with a custom binary pointing back at the runC binary
    itself. As an example, if the target binary was /bin/bash, this
    could be replaced with an executable script specifying the
    interpreter path #!/proc/self/exe.

    As such when /bin/bash is executed inside the container, instead the
    target of /proc/self/exe will be executed. That magic link will
    point to the runc binary on the host. The attacker can then proceed
    to write to the target of /proc/self/exe to try and overwrite the
    runC binary on the host.

    However, this will not succeed because of deny_write_access(). Now,
    one might think that this would prevent the attack but it doesn't.

    To overcome this, the attacker has multiple ways:
    * Open a file descriptor to /proc/self/exe using the O_PATH flag and
      then proceed to reopen the binary as O_WRONLY through
      /proc/self/fd/<nr> and try to write to it in a busy loop from a
      separate process. Ultimately it will succeed when the runC binary
      exits. After this the runC binary is compromised and can be used
      to attack other containers or the host itself.
    * Use a malicious shared library annotating a function in there with
      the constructor attribute making the malicious function run as an
      initializor. The malicious library will then open /proc/self/exe
      for creating a new entry under /proc/self/fd/<nr>. It'll then call
      exec to a) force runC to exit and b) hand the file descriptor off
      to a program that then reopens /proc/self/fd/<nr> for writing
      (which is now possible because runC has exited) and overwriting
      that binary.

    To sum up: the deny_write_access() mechanism doesn't prevent such
    attacks in insecure setups. It just makes them minimally harder.
    That's all.

    The only way back then to prevent this is to create a temporary copy
    of the calling binary itself when it starts or attaches to
    containers. So what I did back then for LXC (and Aleksa for runC)
    was to create an anonymous, in-memory file using the memfd_create()
    system call and to copy itself into the temporary in-memory file,
    which is then sealed to prevent further modifications. This sealed,
    in-memory file copy is then executed instead of the original on-disk
    binary.

    Any compromising write operations from a privileged container to the
    host binary will then write to the temporary in-memory binary and
    not to the host binary on-disk, preserving the integrity of the host
    binary. Also as the temporary, in-memory binary is sealed, writes to
    this will also fail.

    The point is that deny_write_access() is uselss to prevent these
    attacks.

(3) Denying write access to an inode because it's currently used in an
    exec path could easily be done on an LSM level. It might need an
    additional hook but that should be about it.

(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
    ago so while we do protect the main executable the bigger portion of
    the things you'd think need protecting such as the shared libraries
    aren't. IOW, we let anyone happily overwrite shared libraries.

(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
    (5.1) We removed the legacy uselib() protection for preventing
          overwriting of shared libraries. Nobody cared in 3 years.
    (5.2) We allow write access to the elf interpreter after exec
          completed treating it on a par with shared libraries.

Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.

Link: https://github.com/golang/go/issues/22315 [1]
Link: 49624efa65 ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: https://github.com/golang/go/issues/22220
Link: 5bf8c0cf09/src/cmd/go/internal/work/buildid.go (L724)
Link: 5bf8c0cf09/src/cmd/go/internal/work/exec.go (L1493)
Link: 5bf8c0cf09/src/cmd/go/internal/script/cmds.go (L457)
Link: 5bf8c0cf09/src/cmd/go/internal/test/test.go (L1557)
Link: 5bf8c0cf09/src/os/exec/lp_linux_test.go (L61)
Link: https://github.com/buildkite/agent/pull/2736
Link: https://github.com/rust-lang/rust/issues/114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: https://github.com/dotnet/runtime/issues/58964
Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-03 15:52:10 +02:00
arch 16 hotfixes, 11 of which are cc:stable. 2024-05-25 15:10:33 -07:00
block block-6.10-20240523 2024-05-23 13:44:47 -07:00
certs kbuild: use $(src) instead of $(srctree)/$(src) for source directory 2024-05-10 04:34:52 +09:00
crypto This push fixes a bug in the new ecc P521 code as well as a buggy 2024-05-20 08:47:54 -07:00
Documentation 16 hotfixes, 11 of which are cc:stable. 2024-05-25 15:10:33 -07:00
drivers ipmi: Mostly updates for deprecated interfaces 2024-05-25 14:32:29 -07:00
fs fs: don't block i_writecount during exec 2024-06-03 15:52:10 +02:00
include fhandle: relax open_by_handle_at() permission checks 2024-05-28 15:57:23 +02:00
init Driver core changes for 6.10-rc1 2024-05-22 12:13:40 -07:00
io_uring io_uring-6.10-20240523 2024-05-23 13:41:49 -07:00
ipc Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
kernel fs: don't block i_writecount during exec 2024-06-03 15:52:10 +02:00
lib 16 hotfixes, 11 of which are cc:stable. 2024-05-25 15:10:33 -07:00
LICENSES LICENSES: Add the copyleft-next-0.3.1 license 2022-11-08 15:44:01 +01:00
mm 16 hotfixes, 11 of which are cc:stable. 2024-05-25 15:10:33 -07:00
net NFS client updates for Linux 6.10 2024-05-23 13:51:09 -07:00
rust LoongArch changes for v6.10 2024-05-22 09:43:07 -07:00
samples tracing/treewide: Remove second parameter of __assign_str() 2024-05-22 20:14:47 -04:00
scripts - A series ("kbuild: enable more warnings by default") from Arnd 2024-05-22 18:59:29 -07:00
security hardening fixes for v6.10-rc1 2024-05-24 08:33:44 -07:00
sound sound fixes for 6.10-rc1 2024-05-24 08:48:51 -07:00
tools Revert a patch causing a regression as described in the cset: 2024-05-26 09:54:26 -07:00
usr kbuild: use $(src) instead of $(srctree)/$(src) for source directory 2024-05-10 04:34:52 +09:00
virt The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
.clang-format clang-format: Update with v6.7-rc4's for_each macro list 2023-12-08 23:54:38 +01:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.editorconfig Add .editorconfig file for basic formatting 2023-12-28 16:22:47 +09:00
.get_maintainer.ignore Add Jeff Kirsher to .get_maintainer.ignore 2024-03-08 11:36:54 +00:00
.gitattributes .gitattributes: set diff driver for Rust source code files 2023-05-31 17:48:25 +02:00
.gitignore kbuild: create a list of all built DTB files 2024-02-19 18:20:39 +09:00
.mailmap mailmap: update email address for Satya Priya 2024-05-24 11:55:05 -07:00
.rustfmt.toml rust: add .rustfmt.toml 2022-09-28 09:02:20 +02:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: Drop Gustavo Pimentel as PCI DWC Maintainer 2024-03-27 13:41:02 -05:00
Kbuild Kbuild updates for v6.1 2022-10-10 12:00:45 -07:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS RTC for 6.10 2024-05-25 13:33:53 -07:00
Makefile Linux 6.10-rc1 2024-05-26 15:20:12 -07:00
README README: Fix spelling 2024-03-18 03:36:32 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.