Commit Graph

100634 Commits

Author SHA1 Message Date
Philipp Rudo
8aec395b84 kernel/kexec_file.c: use read-only sections in arch_kexec_apply_relocations*
When the relocations are applied to the purgatory only the section the
relocations are applied to is writable.  The other sections, i.e.  the
symtab and .rel/.rela, are in read-only kexec_purgatory.  Highlight this
by marking the corresponding variables as 'const'.

While at it also change the signatures of arch_kexec_apply_relocations* to
take section pointers instead of just the index of the relocation section.
This removes the second lookup and sanity check of the sections in arch
code.

Link: http://lkml.kernel.org/r/20180321112751.22196-6-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-13 17:10:28 -07:00
Philipp Rudo
65c225d328 kernel/kexec_file.c: make purgatory_info->ehdr const
The kexec_purgatory buffer is read-only.  Thus all pointers into
kexec_purgatory are read-only, too.  Point this out by explicitly
marking purgatory_info->ehdr as 'const' and update the comments in
purgatory_info.

Link: http://lkml.kernel.org/r/20180321112751.22196-4-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-13 17:10:28 -07:00
Philipp Rudo
ee6ebeda8d include/linux/kexec.h: silence compile warnings
Patch series "kexec_file: Clean up purgatory load", v2.

Following the discussion with Dave and AKASHI, here are the common code
patches extracted from my recent patch set (Add kexec_file_load support
to s390) [1].  The patches were extracted to allow upstream integration
together with AKASHI's common code patches before the arch code gets
adjusted to the new base.

The reason for this series is to prepare common code for adding
kexec_file_load to s390 as well as cleaning up the mis-use of the
sh_offset field during purgatory load.  In detail this series contains:

Patch #1&2: Minor cleanups/fixes.

Patch #3-9: Clean up the purgatory load/relocation code.  Especially
remove the mis-use of the purgatory_info->sechdrs->sh_offset field,
currently holding a pointer into either kexec_purgatory (ro) or
purgatory_buf (rw) depending on the section.  With these patches the
section address will be calculated verbosely and sh_offset will contain
the offset of the section in the stripped purgatory binary
(purgatory_buf).

Patch #10: Allows architectures to set the purgatory load address.  This
patch is important for s390 as the kernel and purgatory have to be
loaded to fixed addresses.  In current code this is impossible as the
purgatory load is opaque to the architecture.

Patch #11: Moves x86 purgatories sha implementation to common lib/
directory to allow reuse in other architectures.

This patch (of 11)

When building the kernel with CONFIG_KEXEC_FILE enabled gcc prints a
compile warning multiple times.

  In file included from <path>/linux/init/initramfs.c:526:0:
  <path>/include/linux/kexec.h:120:9: warning: `struct kimage' declared inside parameter list [enabled by default]
           unsigned long cmdline_len);
           ^

This is because the typedefs for kexec_file_load uses struct kimage
before it is declared.  Fix this by simply forward declaring struct
kimage.

Link: http://lkml.kernel.org/r/20180321112751.22196-2-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-13 17:10:27 -07:00
AKASHI Takahiro
babac4a84a kexec_file, x86: move re-factored code to generic side
In the previous patches, commonly-used routines, exclude_mem_range() and
prepare_elf64_headers(), were carved out.  Now place them in kexec
common code.  A prefix "crash_" is given to each of their names to avoid
possible name collisions.

Link: http://lkml.kernel.org/r/20180306102303.9063-8-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-13 17:10:27 -07:00
AKASHI Takahiro
9ec4ecef0a kexec_file,x86,powerpc: factor out kexec_file_ops functions
As arch_kexec_kernel_image_{probe,load}(),
arch_kimage_file_post_load_cleanup() and arch_kexec_kernel_verify_sig()
are almost duplicated among architectures, they can be commonalized with
an architecture-defined kexec_file_ops array.  So let's factor them out.

Link: http://lkml.kernel.org/r/20180306102303.9063-3-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-13 17:10:27 -07:00
Linus Torvalds
681857ef0d Merge branch 'parisc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:

 - fix panic when halting system via "shutdown -h now"

 - drop own coding in favour of generic CONFIG_COMPAT_BINFMT_ELF
   implementation

 - add FPE_CONDTRAP constant: last outstanding parisc-specific cleanup
   for Eric Biedermans siginfo patches

 - move some functions to .init and some to .text.hot linker sections

* 'parisc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Prevent panic at system halt
  parisc: Switch to generic COMPAT_BINFMT_ELF
  parisc: Move cache flush functions into .text.hot section
  parisc/signal: Add FPE_CONDTRAP for conditional trap handling
2018-04-12 17:07:04 -07:00
Linus Torvalds
4ac1800f81 We decided to request the latest three patches to be merged into this
merge window while it's still open.
 
 1. The first patch adds a new function to lockref: lockref_put_not_zero
 2. The second patch fixes GFS2's glock dump code so it uses the new lockref
    function. This fixes a problem whereby lock dumps could miss glocks.
 3. I made a minor patch to update some comments and fix the lock ordering
    text in our gfs2-glocks.txt Documentation file.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJaz6pdAAoJENeLYdPf93o71wMH/0cEo34xWiScRM07EgLmZZ3q
 YXMvpTvrwK+9i2u8anxiX1smezHeS+7jPrYOG8AGu3IZvKYGTDOwoIY9pxESy5gs
 1Rf60s6pPE/dkTSqPaNNuBxPrM1yVyRWOPx04LxC5BCXhsS/6U2RS9ElxGDe7Nyq
 P66z1wfm63+erDR7mKSuOL3Ejtglj2EPcrAupaBlRS0wjdUQ9ORyrZBpT6JMOWqd
 HWjchrzWVAqx+iyLHlKZjTyPHsPaUBaj1fuv/Vcgu5sJmEJ9mF4s/GQTdwIzi8ip
 ByD7MfilyrT7dxRm1uw8OJ7TvqNeaCtxsyNGGBOlSx81s/pk5Vhs8bevnczNvi8=
 =jWsi
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.17.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull more gfs2 updates from Bob Peterson:
 "We decided to request the latest three patches to be merged into this
  merge window while it's still open.

   - The first patch adds a new function to lockref:
     lockref_put_not_zero

   - The second patch fixes GFS2's glock dump code so it uses the new
     lockref function. This fixes a problem whereby lock dumps could
     miss glocks.

   - I made a minor patch to update some comments and fix the lock
     ordering text in our gfs2-glocks.txt Documentation file"

* tag 'gfs2-4.17.fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  GFS2: Minor improvements to comments and documentation
  gfs2: Stop using rhashtable_walk_peek
  lockref: Add lockref_put_not_zero
2018-04-12 13:00:44 -07:00
Linus Torvalds
a1bf4c7da6 NFS client updates for Linux 4.17
Stable bugfixes:
 - xprtrdma: Fix corner cases when handling device removal # v4.12+
 - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+
 
 Features:
 - New sunrpc tracepoint for RPC pings
 - Finer grained NFSv4 attribute checking
 - Don't unnecessarily return NFS v4 delegations
 
 Other bugfixes and cleanups:
 - Several other small NFSoRDMA cleanups
 - Improvements to the sunrpc RTT measurements
 - A few sunrpc tracepoint cleanups
 - Various fixes for NFS v4 lock notifications
 - Various sunrpc and NFS v4 XDR encoding cleanups
 - Switch to the ida_simple API
 - Fix NFSv4.1 exclusive create
 - Forget acl cache after setattr operation
 - Don't advance the nfs_entry readdir cookie if xdr decoding fails
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlrNG1IACgkQ18tUv7Cl
 QOvotw//fQoUgQ/AOJGlZo/4ws2mGJN3dfwwKM8xYOnHaxppOYubZRHwvswK8d22
 +XR/Q6IVbUxI3mJluv1L0d9CJT06s3c9CO90McIJbk4CWihGP19bNIY4JiPlzrbv
 4FDiyOvMBej2UXbHX5EzKj0srxyBoEVf3iUAIa6DaHi3c6EIUo6fP3d2eRNJStqd
 WMyZs+nqr2W9biyClxntT7l/Sk+o+4I7M3Oo9pjjS+PiePYdaMrL5T1kPeHaJshF
 GMGXkbvVdqpDRiXX84R9+2/nuSiA15eEnaR94UNvs84oLR3qob3ZhxhudqFdSPrX
 RS6E7m34gY/EaQm/wbB26PZm+3jHd4Pqm5SKLbyFfoCmG6oMwBvXNRJZas1DFaHM
 CMOECvfAr6kixVLkAN0MNQ2Ku/FuJ52OLP1dRLmxsblocnhEPujc6RSz6Ju/v3a0
 adbpmJMA2IoSGgXMu3g1VGnjHfMj7ZmjtpigXVvlcUqQGCL7t4ngh23cpeTQeJ76
 bMwSHUQu18NbmtJjBTE+PIm7mdCrpQD7ZuOPWpK62zxLYUnnv7nm75m84DrDru7d
 XAmrCmdUJNrVWQs6BAtCXgO4PZ6xNGLosb0xTQXTAQYftc+DRJ9SW/VGc0Mp1L9m
 0G0iz++b8cy4Pih5UCDJcCkpjCIvHLcn72zn1kbufWqG3xr2koc=
 =IlWo
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - xprtrdma: Fix corner cases when handling device removal # v4.12+
   - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+

  Features:
   - New sunrpc tracepoint for RPC pings
   - Finer grained NFSv4 attribute checking
   - Don't unnecessarily return NFS v4 delegations

  Other bugfixes and cleanups:
   - Several other small NFSoRDMA cleanups
   - Improvements to the sunrpc RTT measurements
   - A few sunrpc tracepoint cleanups
   - Various fixes for NFS v4 lock notifications
   - Various sunrpc and NFS v4 XDR encoding cleanups
   - Switch to the ida_simple API
   - Fix NFSv4.1 exclusive create
   - Forget acl cache after setattr operation
   - Don't advance the nfs_entry readdir cookie if xdr decoding fails"

* tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits)
  NFS: advance nfs_entry cookie only after decoding completes successfully
  NFSv3/acl: forget acl cache after setattr
  NFSv4.1: Fix exclusive create
  NFSv4: Declare the size up to date after it was set.
  nfs: Use ida_simple API
  NFSv4: Fix the nfs_inode_set_delegation() arguments
  NFSv4: Clean up CB_GETATTR encoding
  NFSv4: Don't ask for attributes when ACCESS is protected by a delegation
  NFSv4: Add a helper to encode/decode struct timespec
  NFSv4: Clean up encode_attrs
  NFSv4; Clean up XDR encoding of type bitmap4
  NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group
  SUNRPC: Add a helper for encoding opaque data inline
  SUNRPC: Add helpers for decoding opaque and string types
  NFSv4: Ignore change attribute invalidations if we hold a delegation
  NFS: More fine grained attribute tracking
  NFS: Don't force unnecessary cache invalidation in nfs_update_inode()
  NFS: Don't redirty the attribute cache in nfs_wcc_update_inode()
  NFS: Don't force a revalidation of all attributes if change is missing
  NFS: Convert NFS_INO_INVALID flags to unsigned long
  ...
2018-04-12 12:55:50 -07:00
Linus Torvalds
7214dd4ea9 Merge branch 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs thaw updates from Al Viro:
 "An ancient series that has fallen through the cracks in the previous
  cycle"

* 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  buffer.c: call thaw_super during emergency thaw
  vfs: factor sb iteration out of do_emergency_remount
2018-04-12 12:28:32 -07:00
Linus Torvalds
19e8a2f875 Merge branch 'afs-dh' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull AFS updates from Al Viro:
 "The AFS series posted by dhowells depended upon lookup_one_len()
  rework; now that prereq is in the mainline, that series had been
  rebased on top of it and got some exposure and testing..."

* 'afs-dh' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  afs: Do better accretion of small writes on newly created content
  afs: Add stats for data transfer operations
  afs: Trace protocol errors
  afs: Locally edit directory data for mkdir/create/unlink/...
  afs: Adjust the directory XDR structures
  afs: Split the directory content defs into a header
  afs: Fix directory handling
  afs: Split the dynroot stuff out and give it its own ops tables
  afs: Keep track of invalid-before version for dentry coherency
  afs: Rearrange status mapping
  afs: Make it possible to get the data version in readpage
  afs: Init inode before accessing cache
  afs: Introduce a statistics proc file
  afs: Dump bad status record
  afs: Implement @cell substitution handling
  afs: Implement @sys substitution handling
  afs: Prospectively look up extra files when doing a single lookup
  afs: Don't over-increment the cell usage count when pinning it
  afs: Fix checker warnings
  vfs: Remove the const from dir_context::actor
2018-04-12 11:59:06 -07:00
Linus Torvalds
5d1365940a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) In ip_gre tunnel, handle the conflict between TUNNEL_{SEQ,CSUM} and
    GSO/LLTX properly. From Sabrina Dubroca.

 2) Stop properly on error in lan78xx_read_otp(), from Phil Elwell.

 3) Don't uncompress in slip before rstate is initialized, from Tejaswi
    Tanikella.

 4) When using 1.x firmware on aquantia, issue a deinit before we
    hardware reset the chip, otherwise we break dirty wake WOL. From
    Igor Russkikh.

 5) Correct log check in vhost_vq_access_ok(), from Stefan Hajnoczi.

 6) Fix ethtool -x crashes in bnxt_en, from Michael Chan.

 7) Fix races in l2tp tunnel creation and duplicate tunnel detection,
    from Guillaume Nault.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
  l2tp: fix race in duplicate tunnel detection
  l2tp: fix races in tunnel creation
  tun: send netlink notification when the device is modified
  tun: set the flags before registering the netdevice
  lan78xx: Don't reset the interface on open
  bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().
  bnxt_en: Need to include RDMA rings in bnxt_check_rings().
  bnxt_en: Support max-mtu with VF-reps
  bnxt_en: Ignore src port field in decap filter nodes
  bnxt_en: do not allow wildcard matches for L2 flows
  bnxt_en: Fix ethtool -x crash when device is down.
  vhost: return bool from *_access_ok() functions
  vhost: fix vhost_vq_access_ok() log check
  vhost: Fix vhost_copy_to_user()
  net: aquantia: oops when shutdown on already stopped device
  net: aquantia: Regression on reset with 1.x firmware
  cdc_ether: flag the Cinterion AHS8 modem by gemalto as WWAN
  slip: Check if rstate is initialized before uncompressing
  lan78xx: Avoid spurious kevent 4 "error"
  lan78xx: Correctly indicate invalid OTP
  ...
2018-04-12 11:09:05 -07:00
Linus Torvalds
67a7a8fff8 xen: fixes for 4.17-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEhRJncuj2BJSl0Jf3sN6d1ii/Ey8FAlrPnM8ACgkQsN6d1ii/
 Ey9Kzwf/eQVb6zzn7FDHAb6pLaZ5i2xi2xohsKmhAVQIEa94rZ3mLoRegtnIfyjO
 RcjjSAzHSZO9NQgNA2ALdu6bBdzu4/ywQEQCnY2Gqxp0ocG/+k3p/FqLHZGdcqPo
 e3gpcVxHSFWUCCGm1t3umI25driqrUq4xa6UFi2IB4djDvTrK/JsSygKx6GiVujL
 2eV7v7rgqaaVZQyo8iOd+LlWuKZewKLfnALUDC21X5J2HmvfoyTdn85kldzbiIsG
 YR7mcfgAtAVTyCfgXI3eqAGpRFEyqR4ga87oahdV3/iW+4wreh4hm2Xd/IETXklv
 Epxyet8IlMB9886PuZhZqgnW6o1RDA==
 =z3bP
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
 "A few fixes of Xen related core code and drivers"

* tag 'for-linus-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/pvh: Indicate XENFEAT_linux_rsdp_unrestricted to Xen
  xen/acpi: off by one in read_acpi_id()
  xen/acpi: upload _PSD info for non Dom0 CPUs too
  x86/xen: Delay get_cpu_cap until stack canary is established
  xen: xenbus_dev_frontend: Verify body of XS_TRANSACTION_END
  xen: xenbus: Catch closing of non existent transactions
  xen: xenbus_dev_frontend: Fix XS_TRANSACTION_END handling
2018-04-12 11:04:35 -07:00
Linus Torvalds
cb098d50ec * Fix 2032 time access issues and new compiler warnings
* minor regression test cleanup
    * formatting fixes for end user use of kdb
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJaz048AAoJEIciOldedpOjdSEP/i07tDKf/A7cFIsRgJgXO4hV
 M3fB3Kzr1DYrrfhWtWfjez/H7ScmYgNSwH7lsP8YibrpvwwxXblsE67zlg7w3oll
 qaGx7zVvBRwHo/0xCJicM7sb3Ey5KX3/ycCpRTmJvj+ywnKlMed6oTU/N9V7mBR0
 ScFpst/omZEkJzYJQwkZPpW8A1zxWYKp/F3g8jAOSz50/S2RWjzSFfg7Efm7+ND7
 IRo/Qcvj+gRxTJyEHxS0wU2EO1egnGLjHmzl1PZMq5X0WsWSUYJ7s6faYh/geuiD
 KFsIapYhRm3SEtgFmCnrVySk3GfdjaU+XDRPzSQk9qehySxU/oZdZbwtaI8YFo3t
 HvoMyvZg4B3BSU1s4WqGyo97Ug2T3z58V2mnfU0IiDH5wiiFg3uCNoBY7CQXG+GP
 wzPheSD+rWVAlcKuuNOQfufIkHrtWhJzjOPsVs4GfgOnZg6T1N7p40+i+hW6JNNi
 K2NTTc7o/SZ7P7de5RibuaGnvE9zCVPpag27Zsasvhrh3BKriBv1ijYUXVbgoImL
 sCFnERUYnR2M4iIAX2oMXyyW5KoiNJWCr+XaEmaYeoCOCcO2FQwo6J3SiNf2WZ4K
 BXZ4LlvTFqG1ew/GCcWxenCo5mtEqPvt9eyAF2R0CCgiP4m2SG6sEB4JkvJBvoI9
 ZtJBLWguNYJyBwbKqKaq
 =zz/y
 -----END PGP SIGNATURE-----

Merge tag 'for_linus-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb

Pull kdb updates from Jason Wessel:

 - fix 2032 time access issues and new compiler warnings

 - minor regression test cleanup

 - formatting fixes for end user use of kdb

* tag 'for_linus-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
  kdb: use memmove instead of overlapping memcpy
  kdb: use ktime_get_mono_fast_ns() instead of ktime_get_ts()
  kdb: bl: don't use tab character in output
  kdb: drop newline in unknown command output
  kdb: make "mdr" command repeat
  kdb: use __ktime_get_real_seconds instead of __current_kernel_time
  misc: kgdbts: Display progress of asynchronous tests
2018-04-12 10:21:19 -07:00
Andreas Gruenbacher
450b1f6f56 lockref: Add lockref_put_not_zero
Put a lockref unless the lockref is dead or its count would become zero.
This is the same as lockref_put_or_lock except that the lock is never
left held.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-04-12 09:41:19 -07:00
Linus Torvalds
c17b0aadb7 asm-generic fixes for v4.17-rc1
I have one regression fix for a minor build problem after the architecture
 removal series, plus a rework of the barriers in the readl/writel
 functions, thanks to work by Sinan Kaya:
 
 This started from a discussion on the linuxpcc and rdma mailing lists
 [1]. To summarize, we decided that architectures are responsible to
 serialize readl() and writel() accesses on a device MMIO space relative
 to DMA performed by that device.
 
 This series provides a pessimistic implementation of that behavior for
 asm-generic/io.h, which is in turn used by a number of architectures
 (h8300, microblaze, nios2, openrisc, s390, sparc, um, unicore32, and
 xtensa). Some of those presumably need no extra barriers, or something
 weaker than rmb()/wmb(), and they are advised to override the new default
 for better performance.
 
 For inb()/outb(), the same barriers are used, but architectures might
 want to add another barrier to outb() here if that can guarantee
 non-posted behavior (some architectures can, others cannot do that).
 
 The readl_relaxed()/writel_relaxed() family of functions retains the
 existing behavior with no extra barriers.
 
 [1]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-March/170481.html
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJazitHAAoJEGCrR//JCVInd0wP/iMzr1HWDgMjeeuxekFjwWDg
 9fL+BFt1afeYb4wniqJcF7ymLow/H5Fbhj4dwM1p34De+CZ3+3JGNyK8qzoeKPjR
 I2U5QqjWCHWDqpWRGWxO28dbs5/1EoW1zgctTNMUPHiamnomz9XIn0xaVKpu4HZ3
 OtaeJm8seKTSj1+A2fye9sDpqMUJuVcnZAWJgqMJ8T98uMBOiJYWHftnFEJpSlwG
 SJSt4AYsJnE+3BFawX1g3VWrHn9WN1uwVasJ1INFkLYNuLMYaK7RYjoBWNwHW+RQ
 luq4xZE+HZehyZptilfs05x2IlhGSOVN5m0nVM2if9aXoEoO1UdaySbwO6Ukq085
 VyfCzY+k4l0v44o4JqaSyAFLEae0809E6cQcGg3cjdstQv1Q3cgAJ96myP0x+QTw
 b0xJGoo46eOfqpK4njARyjTSceYPgzkB5Dqngg9rCuh+EogotWpRRDB6zoeGGRK8
 oOzMp0qLsAZFcYvjft5h0Cp6X51qfyJpBkJkvnASmF4yJPZlpCRGux+HM3jFb9bV
 zbH+KPqTa47OmOK8MNIaFHMR1yMgZU6B2oEwFDEaG0M+6FC5irMSkgcDwIIMJXlJ
 wLp7+4WhwFzFDe1mp/tKM5V4h9D6vQtSUjgOJffhxRXqCMkxc7eABmYBBkjMCsca
 ibKXyZN16d1kRU9j7upb
 =oBQh
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic fixes from Arnd Bergmann:
 "I have one regression fix for a minor build problem after the
  architecture removal series, plus a rework of the barriers in the
  readl/writel functions, thanks to work by Sinan Kaya:

  This started from a discussion on the linuxpcc and rdma mailing
  lists[1]. To summarize, we decided that architectures are responsible
  to serialize readl() and writel() accesses on a device MMIO space
  relative to DMA performed by that device.

  This series provides a pessimistic implementation of that behavior for
  asm-generic/io.h, which is in turn used by a number of architectures
  (h8300, microblaze, nios2, openrisc, s390, sparc, um, unicore32, and
  xtensa). Some of those presumably need no extra barriers, or something
  weaker than rmb()/wmb(), and they are advised to override the new
  default for better performance.

  For inb()/outb(), the same barriers are used, but architectures might
  want to add another barrier to outb() here if that can guarantee
  non-posted behavior (some architectures can, others cannot do that).

  The readl_relaxed()/writel_relaxed() family of functions retains the
  existing behavior with no extra barriers"

[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-March/170481.html

* tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  io: change writeX_relaxed() to remove barriers
  io: change readX_relaxed() to remove barriers
  dts: remove cris & metag dts hard link file
  io: change inX() to have their own IO barrier overrides
  io: change outX() to have their own IO barrier overrides
  io: define stronger ordering for the default writeX() implementation
  io: define stronger ordering for the default readX() implementation
  io: define several IO & PIO barrier types for the asm-generic version
2018-04-12 09:15:48 -07:00
Linus Torvalds
e241e3f2bf virtio: feature
This adds reporting hugepage stats to virtio-balloon.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJaziF/AAoJECgfDbjSjVRpVu8H/Aw8MRgCDNx85w6HdruPeJWx
 NzRGAlZLaCnTc23PJ+bcAeribyPSeuTIj3M7QOMaY1fVGV8MmpQfS5lzdvmL9vJ/
 Lug/7f+QNYLlao1QlszVg+4n79BRtXvH6qOdS+nj8zvTbm/pCr3ec/yrBv4Rfqy5
 TWrZcceQ7Jhw/7EF7AFUxkmw2/TpRV/4yF9wOgDabshAytdN3PAzs38IYtOa+BLp
 bUiJTXGPeYe0M4qkZ6zfwU2fLZqc2DCSFAagPb8jU46OfcViH8/fYfPbm5kQ7X81
 LlSOg/ui6+ZJPHWzDjDy8N/HWpi0Qqbbdne60pKJC7dPlyQMRb2m5w6TqivmPyg=
 =QwFg
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio update from Michael Tsirkin:
 "This adds reporting hugepage stats to virtio-balloon"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_balloon: export hugetlb page allocation counts
2018-04-11 18:58:27 -07:00
Linus Torvalds
e5c372280b IOMMU Updates for Linux v4.17
These updates come with:
 
 	- OF_IOMMU support for the Rockchip iommu driver so that it can
 	  use generic DT bindings
 
 	- Rework of locking in the AMD IOMMU interrupt remapping code to
 	  make it work better in RT kernels
 
 	- Support for improved iotlb flushing in the AMD IOMMU driver
 
 	- Support for 52-bit physical and virtual addressing in the
 	  ARM-SMMU
 
 	- Various other small fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJazi2BAAoJECvwRC2XARrjYVwP/AyXK7CjRvaiHFopIUO0WpwY
 V3GiKrODtSNHqPSKuFnqIssIhxZPw/SKFz6E/pe08pZ/pxHYTxeTL78Wz7D1+4Sp
 n0YokSM5qLb660OTQVnyKNCku8cEMCb9hkQ/75SFgwcILQYF93cZBDIdBn93OKVO
 6xAOE+tqd8Daulnk0YpdiCTFTJPzYHPl6B7scoUav26uaKxWeMJxeYe+EXC+4WQG
 U1u/jDiVXyllzGgRqqfrmO4L2acmsK8HL97hD4+m1URJKDlb8ho6xwaRThFZWqXS
 SbrYnvH0ruWGrLiQKmVUssw8FqbcXCzq3236g2O8jE4jqWSm70twg+q31iMjwD7v
 bwsJGMkk7aLrquv9Zpaylpf8tRECk5bjhTFC2zB0pdum5XLx47j0IHKWMLPYhkCz
 E0pBefvuhoSTbt/5X0urSRzH2Hk4ljEsM+QjlfH8SN3ALTljFjay607wbxC7t35M
 LEL5AuNsDDBddoJIi9D13CdJEZa4lps8dbpB8m40lQVvmiLPLcKraaG0RfKQ397T
 wsxhsDOQYM2FCwfUP3n8RTsMKRIp/UVkKY+2G7AsKofciSeulK6nDbrV7jFnitx4
 vTxbRgpNejJpqzKZG/W9lCGWk1BhmQK/Cbu6JW5IA4+ew9omWkFp61U6rtc645Te
 6cNEYBiMz/RZIiC2b18J
 =kte5
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull IOMMU updates from Joerg Roedel:

 - OF_IOMMU support for the Rockchip iommu driver so that it can use
   generic DT bindings

 - rework of locking in the AMD IOMMU interrupt remapping code to make
   it work better in RT kernels

 - support for improved iotlb flushing in the AMD IOMMU driver

 - support for 52-bit physical and virtual addressing in the ARM-SMMU

 - various other small fixes and cleanups

* tag 'iommu-updates-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (53 commits)
  iommu/io-pgtable-arm: Avoid warning with 32-bit phys_addr_t
  iommu/rockchip: Support sharing IOMMU between masters
  iommu/rockchip: Add runtime PM support
  iommu/rockchip: Fix error handling in init
  iommu/rockchip: Use OF_IOMMU to attach devices automatically
  iommu/rockchip: Use IOMMU device for dma mapping operations
  dt-bindings: iommu/rockchip: Add clock property
  iommu/rockchip: Control clocks needed to access the IOMMU
  iommu/rockchip: Fix TLB flush of secondary IOMMUs
  iommu/rockchip: Use iopoll helpers to wait for hardware
  iommu/rockchip: Fix error handling in attach
  iommu/rockchip: Request irqs in rk_iommu_probe()
  iommu/rockchip: Fix error handling in probe
  iommu/rockchip: Prohibit unbind and remove
  iommu/amd: Return proper error code in irq_remapping_alloc()
  iommu/amd: Make amd_iommu_devtable_lock a spin_lock
  iommu/amd: Drop the lock while allocating new irq remap table
  iommu/amd: Factor out setting the remap table for a devid
  iommu/amd: Use `table' instead `irt' as variable name in amd_iommu_update_ga()
  iommu/amd: Remove the special case from alloc_irq_table()
  ...
2018-04-11 18:50:41 -07:00
Linus Torvalds
1fe43114ea More power management updates for 4.17-rc1
- Rework the idle loop in order to prevent CPUs from spending too
    much time in shallow idle states by making it stop the scheduler
    tick before putting the CPU into an idle state only if the idle
    duration predicted by the idle governor is long enough.  That
    required the code to be reordered to invoke the idle governor
    before stopping the tick, among other things (Rafael Wysocki,
    Frederic Weisbecker, Arnd Bergmann).
 
  - Add the missing description of the residency sysfs attribute to
    the cpuidle documentation (Prashanth Prakash).
 
  - Finalize the cpufreq cleanup moving frequency table validation
    from drivers to the core (Viresh Kumar).
 
  - Fix a clock leak regression in the armada-37xx cpufreq driver
    (Gregory Clement).
 
  - Fix the initialization of the CPU performance data structures
    for shared policies in the CPPC cpufreq driver (Shunyong Yang).
 
  - Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers
    a bit (Viresh Kumar, Rafael Wysocki).
 
  - Mark the expected switch fall-throughs in the PM QoS core (Gustavo
    Silva).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJazfv7AAoJEILEb/54YlRx/kYP+gPOX5O5cFF22Y2xvDHPMWjm
 D/3Nc2aRo+5DuHHECSIJ3ZVQzVoamN5zQ1KbsBRV0bJgwim4fw4M199Jr/0I2nES
 1pkByuxLrAtwb83uX3uBIQnwgKOAwRftOTeVaFaMoXgIbyUqK7ZFkGq0xQTnKqor
 6+J+78O7wMaIZ0YXQP98BC6g96vs/f+ICrh7qqY85r4NtO/thTA1IKevBmlFeIWR
 yVhEYgwSFBaWehKK8KgbshmBBEk3qzDOYfwZF/JprPhiN/6madgHgYjHC8Seok5c
 QUUTRlyO1ULTQe4JulyJUKobx7HE9u/FXC0RjbBiKPnYR4tb9Hd8OpajPRZo96AT
 8IQCdzL2Iw/ZyQsmQZsWeO1HwPTwVlF/TO2gf6VdQtH221izuHG025p8/RcZe6zb
 fTTFhh6/tmBvmOlbKMwxaLbGbwcj/5W5GvQXlXAtaElLobwwNEcEyVfF4jo4Zx/U
 DQc7agaAps67lcgFAqNDy0PoU6bxV7yoiAIlTJHO9uyPkDNyIfb0ZPlmdIi3xYZd
 tUD7C+VBezrNCkw7JWL1xXLFfJ5X7K6x5bi9I7TBj1l928Hak0dwzs7KlcNBtF1Y
 SwnJsNa3kxunGsPajya8dy5gdO0aFeB9Bse0G429+ugk2IJO/Q9M9nQUArJiC9Xl
 Gw1bw5Ynv6lx+r5EqxHa
 =Pnk4
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These include one big-ticket item which is the rework of the idle loop
  in order to prevent CPUs from spending too much time in shallow idle
  states. It reduces idle power on some systems by 10% or more and may
  improve performance of workloads in which the idle loop overhead
  matters. This has been in the works for several weeks and it has been
  tested and reviewed quite thoroughly.

  Also included are changes that finalize the cpufreq cleanup moving
  frequency table validation from drivers to the core, a few fixes and
  cleanups of cpufreq drivers, a cpuidle documentation update and a PM
  QoS core update to mark the expected switch fall-throughs in it.

  Specifics:

   - Rework the idle loop in order to prevent CPUs from spending too
     much time in shallow idle states by making it stop the scheduler
     tick before putting the CPU into an idle state only if the idle
     duration predicted by the idle governor is long enough.

     That required the code to be reordered to invoke the idle governor
     before stopping the tick, among other things (Rafael Wysocki,
     Frederic Weisbecker, Arnd Bergmann).

   - Add the missing description of the residency sysfs attribute to the
     cpuidle documentation (Prashanth Prakash).

   - Finalize the cpufreq cleanup moving frequency table validation from
     drivers to the core (Viresh Kumar).

   - Fix a clock leak regression in the armada-37xx cpufreq driver
     (Gregory Clement).

   - Fix the initialization of the CPU performance data structures for
     shared policies in the CPPC cpufreq driver (Shunyong Yang).

   - Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers a
     bit (Viresh Kumar, Rafael Wysocki).

   - Mark the expected switch fall-throughs in the PM QoS core (Gustavo
     Silva)"

* tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
  tick-sched: avoid a maybe-uninitialized warning
  cpufreq: Drop cpufreq_table_validate_and_show()
  cpufreq: SCMI: Don't validate the frequency table twice
  cpufreq: CPPC: Initialize shared perf capabilities of CPUs
  cpufreq: armada-37xx: Fix clock leak
  cpufreq: CPPC: Don't set transition_latency
  cpufreq: ti-cpufreq: Use builtin_platform_driver()
  cpufreq: intel_pstate: Do not include debugfs.h
  PM / QoS: mark expected switch fall-throughs
  cpuidle: Add definition of residency to sysfs documentation
  time: hrtimer: Use timerqueue_iterate_next() to get to the next timer
  nohz: Avoid duplication of code related to got_idle_tick
  nohz: Gather tick_sched booleans under a common flag field
  cpuidle: menu: Avoid selecting shallow states with stopped tick
  cpuidle: menu: Refine idle state selection for running tick
  sched: idle: Select idle state before stopping the tick
  time: hrtimer: Introduce hrtimer_next_event_without()
  time: tick-sched: Split tick_nohz_stop_sched_tick()
  cpuidle: Return nohz hint from cpuidle_select()
  jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC
  ...
2018-04-11 17:03:20 -07:00
Linus Torvalds
8837c70d53 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - almost all of the rest of MM

 - kasan updates

 - lots of procfs work

 - misc things

 - lib/ updates

 - checkpatch

 - rapidio

 - ipc/shm updates

 - the start of willy's XArray conversion

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (140 commits)
  page cache: use xa_lock
  xarray: add the xa_lock to the radix_tree_root
  fscache: use appropriate radix tree accessors
  export __set_page_dirty
  unicore32: turn flush_dcache_mmap_lock into a no-op
  arm64: turn flush_dcache_mmap_lock into a no-op
  mac80211_hwsim: use DEFINE_IDA
  radix tree: use GFP_ZONEMASK bits of gfp_t for flags
  linux/const.h: refactor _BITUL and _BITULL a bit
  linux/const.h: move UL() macro to include/linux/const.h
  linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI
  xen, mm: allow deferred page initialization for xen pv domains
  elf: enforce MAP_FIXED on overlaying elf segments
  fs, elf: drop MAP_FIXED usage from elf_map
  mm: introduce MAP_FIXED_NOREPLACE
  MAINTAINERS: update bouncing aacraid@adaptec.com addresses
  fs/dcache.c: add cond_resched() in shrink_dentry_list()
  include/linux/kfifo.h: fix comment
  ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops
  kernel/sysctl.c: add kdoc comments to do_proc_do{u}intvec_minmax_conv_param
  ...
2018-04-11 10:51:26 -07:00
Matthew Wilcox
b93b016313 page cache: use xa_lock
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
  Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Matthew Wilcox
f6bb2a2c0b xarray: add the xa_lock to the radix_tree_root
This results in no change in structure size on 64-bit machines as it
fits in the padding between the gfp_t and the void *.  32-bit machines
will grow the structure from 8 to 12 bytes.  Almost all radix trees are
protected with (at least) a spinlock, so as they are converted from
radix trees to xarrays, the data structures will shrink again.

Initialising the spinlock requires a name for the benefit of lockdep, so
RADIX_TREE_INIT() now needs to know the name of the radix tree it's
initialising, and so do IDR_INIT() and IDA_INIT().

Also add the xa_lock() and xa_unlock() family of wrappers to make it
easier to use the lock.  If we could rely on -fplan9-extensions in the
compiler, we could avoid all of this syntactic sugar, but that wasn't
added until gcc 4.6.

Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Matthew Wilcox
f82b376413 export __set_page_dirty
XFS currently contains a copy-and-paste of __set_page_dirty().  Export
it from buffer.c instead.

Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Matthew Wilcox
fa290cda10 radix tree: use GFP_ZONEMASK bits of gfp_t for flags
Patch series "XArray", v9.  (First part thereof).

This patchset is, I believe, appropriate for merging for 4.17.  It
contains the XArray implementation, to eventually replace the radix
tree, and converts the page cache to use it.

This conversion keeps the radix tree and XArray data structures in sync
at all times.  That allows us to convert the page cache one function at
a time and should allow for easier bisection.  Other than renaming some
elements of the structures, the data structures are fundamentally
unchanged; a radix tree walk and an XArray walk will touch the same
number of cachelines.  I have changes planned to the XArray data
structure, but those will happen in future patches.

Improvements the XArray has over the radix tree:

 - The radix tree provides operations like other trees do; 'insert' and
   'delete'. But what most users really want is an automatically
   resizing array, and so it makes more sense to give users an API that
   is like an array -- 'load' and 'store'. We still have an 'insert'
   operation for users that really want that semantic.

 - The XArray considers locking as part of its API. This simplifies a
   lot of users who formerly had to manage their own locking just for
   the radix tree. It also improves code generation as we can now tell
   RCU that we're holding a lock and it doesn't need to generate as much
   fencing code. The other advantage is that tree nodes can be moved
   (not yet implemented).

 - GFP flags are now parameters to calls which may need to allocate
   memory. The radix tree forced users to decide what the allocation
   flags would be at creation time. It's much clearer to specify them at
   allocation time.

 - Memory is not preloaded; we don't tie up dozens of pages on the off
   chance that the slab allocator fails. Instead, we drop the lock,
   allocate a new node and retry the operation. We have to convert all
   the radix tree, IDA and IDR preload users before we can realise this
   benefit, but I have not yet found a user which cannot be converted.

 - The XArray provides a cmpxchg operation. The radix tree forces users
   to roll their own (and at least four have).

 - Iterators take a 'max' parameter. That simplifies many users and will
   reduce the amount of iteration done.

 - Iteration can proceed backwards. We only have one user for this, but
   since it's called as part of the pagefault readahead algorithm, that
   seemed worth mentioning.

 - RCU-protected pointers are not exposed as part of the API. There are
   some fun bugs where the page cache forgets to use rcu_dereference()
   in the current codebase.

 - Value entries gain an extra bit compared to radix tree exceptional
   entries. That gives us the extra bit we need to put huge page swap
   entries in the page cache.

 - Some iterators now take a 'filter' argument instead of having
   separate iterators for tagged/untagged iterations.

The page cache is improved by this:

 - Shorter, easier to read code

 - More efficient iterations

 - Reduction in size of struct address_space

 - Fewer walks from the top of the data structure; the XArray API
   encourages staying at the leaf node and conducting operations there.

This patch (of 8):

None of these bits may be used for slab allocations, so we can use them
as radix tree flags as long as we mask them off before passing them to
the slab allocator. Move the IDR flag from the high bits to the
GFP_ZONEMASK bits.

Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Masahiro Yamada
21e7bc600e linux/const.h: refactor _BITUL and _BITULL a bit
Minor cleanups available by _UL and _ULL.

Link: http://lkml.kernel.org/r/1519301715-31798-5-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Masahiro Yamada
2dd8a62c64 linux/const.h: move UL() macro to include/linux/const.h
ARM, ARM64 and UniCore32 duplicate the definition of UL():

  #define UL(x) _AC(x, UL)

This is not actually arch-specific, so it will be useful to move it to a
common header.  Currently, we only have the uapi variant for
linux/const.h, so I am creating include/linux/const.h.

I also added _UL(), _ULL() and ULL() because _AC() is mostly used in
the form either _AC(..., UL) or _AC(..., ULL).  I expect they will be
replaced in follow-up cleanups.  The underscore-prefixed ones should
be used for exported headers.

Link: http://lkml.kernel.org/r/1519301715-31798-4-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Masahiro Yamada
2a6cc8a6c0 linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI
Patch series "linux/const.h: cleanups of macros such as UL(), _BITUL(),
BIT() etc", v3.

ARM, ARM64, UniCore32 define UL() as a shorthand of _AC(..., UL).  More
architectures may introduce it in the future.

UL() is arch-agnostic, and useful. So let's move it to
include/linux/const.h

Currently, <asm/memory.h> must be included to use UL().  It pulls in more
bloats just for defining some bit macros.

I posted V2 one year ago.

The previous posts are:
https://patchwork.kernel.org/patch/9498273/
https://patchwork.kernel.org/patch/9498275/
https://patchwork.kernel.org/patch/9498269/
https://patchwork.kernel.org/patch/9498271/

At that time, what blocked this series was a comment from
David Howells:
  You need to be very careful doing this.  Some userspace stuff
  depends on the guard macro names on the kernel header files.

(https://patchwork.kernel.org/patch/9498275/)

Looking at the code closer, I noticed this is not a problem.

See the following line.
https://github.com/torvalds/linux/blob/v4.16-rc2/scripts/headers_install.sh#L40

scripts/headers_install.sh rips off _UAPI prefix from guard macro names.

I ran "make headers_install" and confirmed the result is what I expect.

So, we can prefix the include guard of include/uapi/linux/const.h,
and add a new include/linux/const.h.

This patch (of 4):

I am going to add include/linux/const.h for the kernel space.

Add _UAPI to the include guard of include/uapi/linux/const.h to
prepare for that.

Please notice the guard name of the exported one will be kept as-is.
So, this commit has no impact to the userspace even if some userspace
stuff depends on the guard macro names.

scripts/headers_install.sh processes exported headers by SED, and
rips off "_UAPI" from guard macro names.

  #ifndef _UAPI_LINUX_CONST_H
  #define _UAPI_LINUX_CONST_H

will be turned into

  #ifndef _LINUX_CONST_H
  #define _LINUX_CONST_H

Link: http://lkml.kernel.org/r/1519301715-31798-2-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Michal Hocko
4ed2863951 fs, elf: drop MAP_FIXED usage from elf_map
Both load_elf_interp and load_elf_binary rely on elf_map to map segments
on a controlled address and they use MAP_FIXED to enforce that.  This is
however dangerous thing prone to silent data corruption which can be
even exploitable.

Let's take CVE-2017-1000253 as an example.  At the time (before commit
eab09532d4: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
the stack top on 32b (legacy) memory layout (only 1GB away).  Therefore
we could end up mapping over the existing stack with some luck.

The issue has been fixed since then (a87938b2e2: "fs/binfmt_elf.c: fix
bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much
further from the stack (eab09532d4 and later by c715b72c1b: "mm:
revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive
stack consumption early during execve fully stopped by da029c11e6
("exec: Limit arg stack to at most 75% of _STK_LIM").  So we should be
safe and any attack should be impractical.  On the other hand this is
just too subtle assumption so it can break quite easily and hard to
spot.

I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
fundamentally dangerous.  Moreover it shouldn't be even needed.  We are
at the early process stage and so there shouldn't be unrelated mappings
(except for stack and loader) existing so mmap for a given address should
succeed even without MAP_FIXED.  Something is terribly wrong if this is
not the case and we should rather fail than silently corrupt the
underlying mapping.

Address this issue by changing MAP_FIXED to the newly added
MAP_FIXED_NOREPLACE.  This will mean that mmap will fail if there is an
existing mapping clashing with the requested one without clobbering it.

[mhocko@suse.com: fix build]
[akpm@linux-foundation.org: coding-style fixes]
[avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC]
  Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Michal Hocko
a4ff8e8620 mm: introduce MAP_FIXED_NOREPLACE
Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.

This has started as a follow up discussion [3][4] resulting in the
runtime failure caused by hardening patch [5] which removes MAP_FIXED
from the elf loader because MAP_FIXED is inherently dangerous as it
might silently clobber an existing underlying mapping (e.g.  stack).
The reason for the failure is that some architectures enforce an
alignment for the given address hint without MAP_FIXED used (e.g.  for
shared or file backed mappings).

One way around this would be excluding those archs which do alignment
tricks from the hardening [6].  The patch is really trivial but it has
been objected, rightfully so, that this screams for a more generic
solution.  We basically want a non-destructive MAP_FIXED.

The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
address but unlike MAP_FIXED it fails with EEXIST if the given range
conflicts with an existing one.  The flag is introduced as a completely
new one rather than a MAP_FIXED extension because of the backward
compatibility.  We really want a never-clobber semantic even on older
kernels which do not recognize the flag.  Unfortunately mmap sucks
wrt flags evaluation because we do not EINVAL on unknown flags.  On
those kernels we would simply use the traditional hint based semantic so
the caller can still get a different address (which sucks) but at least
not silently corrupt an existing mapping.  I do not see a good way
around that.  Except we won't export expose the new semantic to the
userspace at all.

It seems there are users who would like to have something like that.
Jemalloc has been mentioned by Michael Ellerman [7]

Florian Weimer has mentioned the following:
: glibc ld.so currently maps DSOs without hints.  This means that the kernel
: will map right next to each other, and the offsets between them a completely
: predictable.  We would like to change that and supply a random address in a
: window of the address space.  If there is a conflict, we do not want the
: kernel to pick a non-random address. Instead, we would try again with a
: random address.

John Hubbard has mentioned CUDA example
: a) Searches /proc/<pid>/maps for a "suitable" region of available
: VA space.  "Suitable" generally means it has to have a base address
: within a certain limited range (a particular device model might
: have odd limitations, for example), it has to be large enough, and
: alignment has to be large enough (again, various devices may have
: constraints that lead us to do this).
:
: This is of course subject to races with other threads in the process.
:
: Let's say it finds a region starting at va.
:
: b) Next it does:
:     p = mmap(va, ...)
:
: *without* setting MAP_FIXED, of course (so va is just a hint), to
: attempt to safely reserve that region. If p != va, then in most cases,
: this is a failure (almost certainly due to another thread getting a
: mapping from that region before we did), and so this layer now has to
: call munmap(), before returning a "failure: retry" to upper layers.
:
:     IMPROVEMENT: --> if instead, we could call this:
:
:             p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
:
:         , then we could skip the munmap() call upon failure. This
:         is a small thing, but it is useful here. (Thanks to Piotr
:         Jaroszynski and Mark Hairgrove for helping me get that detail
:         exactly right, btw.)
:
: c) After that, CUDA suballocates from p, via:
:
:      q = mmap(sub_region_start, ... MAP_FIXED ...)
:
: Interestingly enough, "freeing" is also done via MAP_FIXED, and
: setting PROT_NONE to the subregion. Anyway, I just included (c) for
: general interest.

Atomic address range probing in the multithreaded programs in general
sounds like an interesting thing to me.

The second patch simply replaces MAP_FIXED use in elf loader by
MAP_FIXED_NOREPLACE.  I believe other places which rely on MAP_FIXED
should follow.  Actually real MAP_FIXED usages should be docummented
properly and they should be more of an exception.

[1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
[4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
[5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
[6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
[7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au

This patch (of 2):

MAP_FIXED is used quite often to enforce mapping at the particular range.
The main problem of this flag is, however, that it is inherently dangerous
because it unmaps existing mappings covered by the requested range.  This
can cause silent memory corruptions.  Some of them even with serious
security implications.  While the current semantic might be really
desiderable in many cases there are others which would want to enforce the
given range but rather see a failure than a silent memory corruption on a
clashing range.  Please note that there is no guarantee that a given range
is obeyed by the mmap even when it is free - e.g.  arch specific code is
allowed to apply an alignment.

Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
behavior.  It has the same semantic as MAP_FIXED wrt.  the given address
request with a single exception that it fails with EEXIST if the requested
address is already covered by an existing mapping.  We still do rely on
get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
check for a conflicting vma after it returns.

The flag is introduced as a completely new one rather than a MAP_FIXED
extension because of the backward compatibility.  We really want a
never-clobber semantic even on older kernels which do not recognize the
flag.  Unfortunately mmap sucks wrt.  flags evaluation because we do not
EINVAL on unknown flags.  On those kernels we would simply use the
traditional hint based semantic so the caller can still get a different
address (which sucks) but at least not silently corrupt an existing
mapping.  I do not see a good way around that.

[mpe@ellerman.id.au: fix whitespace]
[fail on clashing range with EEXIST as per Florian Weimer]
[set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Evans <jasone@google.com>
Cc: David Goldblatt <davidtgoldblatt@gmail.com>
Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Valentin Vidic
de99626c2e include/linux/kfifo.h: fix comment
Clean up unusual formatting in the note about locking.

Link: http://lkml.kernel.org/r/20180324002630.13046-1-Valentin.Vidic@CARNet.hr
Signed-off-by: Valentin Vidic <Valentin.Vidic@CARNet.hr>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Sean Young <sean@mess.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:38 -07:00
Davidlohr Bueso
23c8cec8cf ipc/msg: introduce msgctl(MSG_STAT_ANY)
There is a permission discrepancy when consulting msq ipc object
metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
command.  The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the msq metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it.  Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs).  Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new MSG_STAT_ANY command such that the msq ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Davidlohr Bueso
a280d6dc77 ipc/sem: introduce semctl(SEM_STAT_ANY)
There is a permission discrepancy when consulting shm ipc object
metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
command.  The later does permission checks for the object vs S_IRUGO.
As such there can be cases where EACCESS is returned via syscall but the
info is displayed anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the sma metadata), this behavior goes way back and showing
all the objects regardless of the permissions was most likely an
overlook - so we are stuck with it.  Furthermore, modifying either the
syscall or the procfs file can cause userspace programs to break (ie
ipcs).  Some applications require getting the procfs info (without root
privileges) and can be rather slow in comparison with a syscall -- up to
500x in some reported cases for shm.

This patch introduces a new SEM_STAT_ANY command such that the sem ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Robert Kettler <robert.kettler@outlook.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Davidlohr Bueso
c21a6970ae ipc/shm: introduce shmctl(SHM_STAT_ANY)
Patch series "sysvipc: introduce STAT_ANY commands", v2.

The following patches adds the discussed (see [1]) new command for shm
as well as for sems and msq as they are subject to the same
discrepancies for ipc object permission checks between the syscall and
via procfs.  These new commands are justified in that (1) we are stuck
with this semantics as changing syscall and procfs can break userland;
and (2) some users can benefit from performance (for large amounts of
shm segments, for example) from not having to parse the procfs
interface.

Once merged, I will submit the necesary manpage updates.  But I'm thinking
something like:

: diff --git a/man2/shmctl.2 b/man2/shmctl.2
: index 7bb503999941..bb00bbe21a57 100644
: --- a/man2/shmctl.2
: +++ b/man2/shmctl.2
: @@ -41,6 +41,7 @@
:  .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
:  .\"	attaches to a segment that has already been marked for deletion.
:  .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
: +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
:  .\"
:  .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
:  .SH NAME
: @@ -242,6 +243,18 @@ However, the
:  argument is not a segment identifier, but instead an index into
:  the kernel's internal array that maintains information about
:  all shared memory segments on the system.
: +.TP
: +.BR SHM_STAT_ANY " (Linux-specific)"
: +Return a
: +.I shmid_ds
: +structure as for
: +.BR SHM_STAT .
: +However, the
: +.I shm_perm.mode
: +is not checked for read access for
: +.IR shmid ,
: +resembing the behaviour of
: +/proc/sysvipc/shm.
:  .PP
:  The caller can prevent or allow swapping of a shared
:  memory segment with the following \fIcmd\fP values:
: @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
:  kernel's internal array recording information about all
:  shared memory segments.
:  (This information can be used with repeated
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
:  operations to obtain information about all shared memory segments
:  on the system.)
:  A successful
: @@ -328,7 +341,7 @@ isn't accessible.
:  \fIshmid\fP is not a valid identifier, or \fIcmd\fP
:  is not a valid command.
:  Or: for a
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
:  operation, the index value specified in
:  .I shmid
:  referred to an array slot that is currently unused.

This patch (of 3):

There is a permission discrepancy when consulting shm ipc object metadata
between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command.  The
later does permission checks for the object vs S_IRUGO.  As such there can
be cases where EACCESS is returned via syscall but the info is displayed
anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the shm metadata), this behavior goes way back and showing all
the objects regardless of the permissions was most likely an overlook - so
we are stuck with it.  Furthermore, modifying either the syscall or the
procfs file can cause userspace programs to break (ie ipcs).  Some
applications require getting the procfs info (without root privileges) and
can be rather slow in comparison with a syscall -- up to 500x in some
reported cases.

This patch introduces a new SHM_STAT_ANY command such that the shm ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

[1] https://lkml.org/lkml/2017/12/19/220

Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Robert Kettler <robert.kettler@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Kees Cook
c31dbb146d exec: pin stack limit during exec
Since the stack rlimit is used in multiple places during exec and it can
be changed via other threads (via setrlimit()) or processes (via
prlimit()), the assumption that the value doesn't change cannot be made.
This leads to races with mm layout selection and argument size
calculations.  This changes the exec path to use the rlimit stored in
bprm instead of in current.  Before starting the thread, the bprm stack
rlimit is stored back to current.

Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
Fixes: 64701dee41 ("exec: Use sane stack rlimit under secureexec")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Reported-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Brad Spengler <spender@grsecurity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Greg KH <greg@kroah.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Kees Cook
b838383133 exec: introduce finalize_exec() before start_thread()
Provide a final callback into fs/exec.c before start_thread() takes
over, to handle any last-minute changes, like the coming restoration of
the stack limit.

Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Greg KH <greg@kroah.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Kees Cook
8f2af155b5 exec: pass stack rlimit into mm layout functions
Patch series "exec: Pin stack limit during exec".

Attempts to solve problems with the stack limit changing during exec
continue to be frustrated[1][2].  In addition to the specific issues
around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
other places during exec where the stack limit is used and is assumed to
be unchanging.  Given the many places it gets used and the fact that it
can be manipulated/raced via setrlimit() and prlimit(), I think the only
way to handle this is to move away from the "current" view of the stack
limit and instead attach it to the bprm, and plumb this down into the
functions that need to know the stack limits.  This series implements
the approach.

[1] 04e35f4495 ("exec: avoid RLIMIT_STACK races with prlimit()")
[2] 779f4e1c6c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
[3] to security@kernel.org, "Subject: existing rlimit races?"

This patch (of 3):

Since it is possible that the stack rlimit can change externally during
exec (either via another thread calling setrlimit() or another process
calling prlimit()), provide a way to pass the rlimit down into the
per-architecture mm layout functions so that the rlimit can stay in the
bprm structure instead of sitting in the signal structure until exec is
finalized.

Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:37 -07:00
Alexey Dobriyan
0965232035 seq_file: allocate seq_file from kmem_cache
For fine-grained debugging and usercopy protection.

Link: http://lkml.kernel.org/r/20180310085027.GA17121@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:36 -07:00
Kees Cook
2cfe0d3009 task_struct: only use anon struct under randstruct plugin
The original intent for always adding the anonymous struct in
task_struct was to make sure we had compiler coverage.

However, this caused pathological padding of 40 bytes at the start of
task_struct.  Instead, move the anonymous struct to being only used when
struct layout randomization is enabled.

Link: http://lkml.kernel.org/r/20180327213609.GA2964@beast
Fixes: 29e48ce87f ("task_struct: Allow randomized")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:35 -07:00
Alexey Dobriyan
3ea056c504 uts: create "struct uts_namespace" from kmem_cache
So "struct uts_namespace" can enjoy fine-grained SLAB debugging and
usercopy protection.

I'd prefer shorter name "utsns" but there is "user_namespace" already.

Link: http://lkml.kernel.org/r/20180228215158.GA23146@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:35 -07:00
Kees Cook
bc4f2f5469 taint: add taint for randstruct
Since the randstruct plugin can intentionally produce extremely unusual
kernel structure layouts (even performance pathological ones), some
maintainers want to be able to trivially determine if an Oops is coming
from a randstruct-built kernel, so as to keep their sanity when
debugging.  This adds the new flag and initializes taint_mask
immediately when built with randstruct.

Link: http://lkml.kernel.org/r/1519084390-43867-4-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:35 -07:00
Kees Cook
47d4b263a2 taint: convert to indexed initialization
This converts to using indexed initializers instead of comments, adds a
comment on why the taint flags can't be an enum, and make sure that no
one forgets to update the taint_flags when adding new bits.

Link: http://lkml.kernel.org/r/1519084390-43867-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:34 -07:00
Andrei Vagin
d1be35cb6f proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps
seq_put_decimal_ull_w(m, str, val, width) prints a decimal number with a
specified minimal field width.

It is equivalent of seq_printf(m, "%s%*d", str, width, val), but it
works much faster.

== test_smaps.py
  num = 0
  with open("/proc/1/smaps") as f:
          for x in xrange(10000):
                  data = f.read()
                  f.seek(0, 0)
==

== Before patch ==
  $ time python test_smaps.py
  real    0m4.593s
  user    0m0.398s
  sys     0m4.158s

== After patch ==
  $ time python test_smaps.py
  real    0m3.828s
  user    0m0.413s
  sys     0m3.408s

$ perf -g record python test_smaps.py
== Before patch ==
-   79.01%     3.36%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 75.65% show_smap.isra.33
      + 48.85% seq_printf
      + 15.75% __walk_page_range
      + 9.70% show_map_vma.isra.23
        0.61% seq_puts

== After patch ==
-   75.51%     4.62%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 70.88% show_smap.isra.33
      + 24.82% seq_put_decimal_ull_w
      + 19.78% __walk_page_range
      + 12.74% seq_printf
      + 11.08% show_map_vma.isra.23
      + 1.68% seq_puts

[akpm@linux-foundation.org: fix drivers/of/unittest.c build]
Link: http://lkml.kernel.org/r/20180212074931.7227-1-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:33 -07:00
Andrei Vagin
0e3dc01914 procfs: add seq_put_hex_ll to speed up /proc/pid/maps
seq_put_hex_ll() prints a number in hexadecimal notation and works
faster than seq_printf().

== test.py
  num = 0
  with open("/proc/1/maps") as f:
          while num < 10000 :
                  data = f.read()
                  f.seek(0, 0)
                 num = num + 1
==

== Before patch ==
  $  time python test.py

  real	0m1.561s
  user	0m0.257s
  sys	0m1.302s

== After patch ==
  $ time python test.py

  real	0m0.986s
  user	0m0.279s
  sys	0m0.707s

$ perf -g record python test.py:

== Before patch ==
-   67.42%     2.82%  python   [kernel.kallsyms] [k] show_map_vma.isra.22
   - 64.60% show_map_vma.isra.22
      - 44.98% seq_printf
         - seq_vprintf
            - vsnprintf
               + 14.85% number
               + 12.22% format_decode
                 5.56% memcpy_erms
      + 15.06% seq_path
      + 4.42% seq_pad
   + 2.45% __GI___libc_read

== After patch ==
-   47.35%     3.38%  python   [kernel.kallsyms] [k] show_map_vma.isra.23
   - 43.97% show_map_vma.isra.23
      + 20.84% seq_path
      - 15.73% show_vma_header_prefix
           10.55% seq_put_hex_ll
         + 2.65% seq_put_decimal_ull
           0.95% seq_putc
      + 6.96% seq_pad
   + 2.94% __GI___libc_read

[avagin@openvz.org: use unsigned int instead of int where it is suitable]
  Link: http://lkml.kernel.org/r/20180214025619.4005-1-avagin@openvz.org
[avagin@openvz.org: v2]
  Link: http://lkml.kernel.org/r/20180117082050.25406-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20180112185812.7710-1-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:32 -07:00
Joonsoo Kim
bad8c6c0b1 mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE
Patch series "mm/cma: manage the memory of the CMA area by using the
ZONE_MOVABLE", v2.

0. History

This patchset is the follow-up of the discussion about the "Introduce
ZONE_CMA (v7)" [1].  Please reference it if more information is needed.

1. What does this patch do?

This patch changes the management way for the memory of the CMA area in
the MM subsystem.  Currently the memory of the CMA area is managed by
the zone where their pfn is belong to.  However, this approach has some
problems since MM subsystem doesn't have enough logic to handle the
situation that different characteristic memories are in a single zone.
To solve this issue, this patch try to manage all the memory of the CMA
area by using the MOVABLE zone.  In MM subsystem's point of view,
characteristic of the memory on the MOVABLE zone and the memory of the
CMA area are the same.  So, managing the memory of the CMA area by using
the MOVABLE zone will not have any problem.

2. Motivation

There are some problems with current approach.  See following.  Although
these problem would not be inherent and it could be fixed without this
conception change, it requires many hooks addition in various code path
and it would be intrusive to core MM and would be really error-prone.
Therefore, I try to solve them with this new approach.  Anyway,
following is the problems of the current implementation.

o CMA memory utilization

First, following is the freepage calculation logic in MM.

 - For movable allocation: freepage = total freepage
 - For unmovable allocation: freepage = total freepage - CMA freepage

Freepages on the CMA area is used after the normal freepages in the zone
where the memory of the CMA area is belong to are exhausted.  At that
moment that the number of the normal freepages is zero, so

 - For movable allocation: freepage = total freepage = CMA freepage
 - For unmovable allocation: freepage = 0

If unmovable allocation comes at this moment, allocation request would
fail to pass the watermark check and reclaim is started.  After reclaim,
there would exist the normal freepages so freepages on the CMA areas
would not be used.

FYI, there is another attempt [2] trying to solve this problem in lkml.
And, as far as I know, Qualcomm also has out-of-tree solution for this
problem.

Useless reclaim:

There is no logic to distinguish CMA pages in the reclaim path.  Hence,
CMA page is reclaimed even if the system just needs the page that can be
usable for the kernel allocation.

Atomic allocation failure:

This is also related to the fallback allocation policy for the memory of
the CMA area.  Consider the situation that the number of the normal
freepages is *zero* since the bunch of the movable allocation requests
come.  Kswapd would not be woken up due to following freepage
calculation logic.

- For movable allocation: freepage = total freepage = CMA freepage

If atomic unmovable allocation request comes at this moment, it would
fails due to following logic.

- For unmovable allocation: freepage = total freepage - CMA freepage = 0

It was reported by Aneesh [3].

Useless compaction:

Usual high-order allocation request is unmovable allocation request and
it cannot be served from the memory of the CMA area.  In compaction,
migration scanner try to migrate the page in the CMA area and make
high-order page there.  As mentioned above, it cannot be usable for the
unmovable allocation request so it's just waste.

3. Current approach and new approach

Current approach is that the memory of the CMA area is managed by the
zone where their pfn is belong to.  However, these memory should be
distinguishable since they have a strong limitation.  So, they are
marked as MIGRATE_CMA in pageblock flag and handled specially.  However,
as mentioned in section 2, the MM subsystem doesn't have enough logic to
deal with this special pageblock so many problems raised.

New approach is that the memory of the CMA area is managed by the
MOVABLE zone.  MM already have enough logic to deal with special zone
like as HIGHMEM and MOVABLE zone.  So, managing the memory of the CMA
area by the MOVABLE zone just naturally work well because constraints
for the memory of the CMA area that the memory should always be
migratable is the same with the constraint for the MOVABLE zone.

There is one side-effect for the usability of the memory of the CMA
area.  The use of MOVABLE zone is only allowed for a request with
GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
only allowed for this gfp flag.  Before this patchset, a request with
GFP_MOVABLE can use them.  IMO, It would not be a big issue since most
of GFP_MOVABLE request also has GFP_HIGHMEM flag.  For example, file
cache page and anonymous page.  However, file cache page for blockdev
file is an exception.  Request for it has no GFP_HIGHMEM flag.  There is
pros and cons on this exception.  In my experience, blockdev file cache
pages are one of the top reason that causes cma_alloc() to fail
temporarily.  So, we can get more guarantee of cma_alloc() success by
discarding this case.

Note that there is no change in admin POV since this patchset is just
for internal implementation change in MM subsystem.  Just one minor
difference for admin is that the memory stat for CMA area will be
printed in the MOVABLE zone.  That's all.

4. Result

Following is the experimental result related to utilization problem.

8 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16

<Before>
  CMA area:               0 MB            512 MB
  Elapsed-time:           92.4		186.5
  pswpin:                 82		18647
  pswpout:                160		69839

<After>
  CMA        :            0 MB            512 MB
  Elapsed-time:           93.1		93.4
  pswpin:                 84		46
  pswpout:                183		92

akpm: "kernel test robot" reported a 26% improvement in
vm-scalability.throughput:
http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop

[1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
[2]: https://lkml.org/lkml/2014/10/15/623
[3]: http://www.spinics.net/lists/linux-mm/msg100562.html

Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:32 -07:00
Joonsoo Kim
d3cda2337b mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request
Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
important to reserve.  When ZONE_MOVABLE is used, this problem would
theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
allocation request which is mainly used for page cache and anon page
allocation.  So, fix it by setting 0 to
sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM].

And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
makes code complex.  For example, if there is highmem system, following
reserve ratio is activated for *NORMAL ZONE* which would be easyily
misleading people.

 #ifdef CONFIG_HIGHMEM
 32
 #endif

This patch also fixes this situation by defining
sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to
right place.

Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Tony Lindgren <tony@atomide.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:32 -07:00
Michal Hocko
94723aafb9 mm: unclutter THP migration
THP migration is hacked into the generic migration with rather
surprising semantic.  The migration allocation callback is supposed to
check whether the THP can be migrated at once and if that is not the
case then it allocates a simple page to migrate.  unmap_and_move then
fixes that up by spliting the THP into small pages while moving the head
page to the newly allocated order-0 page.  Remaning pages are moved to
the LRU list by split_huge_page.  The same happens if the THP allocation
fails.  This is really ugly and error prone [1].

I also believe that split_huge_page to the LRU lists is inherently wrong
because all tail pages are not migrated.  Some callers will just work
around that by retrying (e.g.  memory hotplug).  There are other pfn
walkers which are simply broken though.  e.g. madvise_inject_error will
migrate head and then advances next pfn by the huge page size.
do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
will simply split the THP before migration if the THP migration is not
supported then falls back to single page migration but it doesn't handle
tail pages if the THP migration path is not able to allocate a fresh THP
so we end up with ENOMEM and fail the whole migration which is a
questionable behavior.  Page compaction doesn't try to migrate large
pages so it should be immune.

This patch tries to unclutter the situation by moving the special THP
handling up to the migrate_pages layer where it actually belongs.  We
simply split the THP page into the existing list if unmap_and_move fails
with ENOMEM and retry.  So we will _always_ migrate all THP subpages and
specific migrate_pages users do not have to deal with this case in a
special way.

[1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com

Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:32 -07:00
Michal Hocko
666feb21a0 mm, migrate: remove reason argument from new_page_t
No allocation callback is using this argument anymore.  new_page_node
used to use this parameter to convey node_id resp.  migration error up
to move_pages code (do_move_page_to_node_array).  The error status never
made it into the final status field and we have a better way to
communicate node id to the status field now.  All other allocation
callbacks simply ignored the argument so we can drop it finally.

[mhocko@suse.com: fix migration callback]
  Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
[akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
[mhocko@kernel.org: fix build]
  Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:32 -07:00
Johannes Weiner
e27be240df mm: memcg: make sure memory.events is uptodate when waking pollers
Commit a983b5ebee ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") added per-cpu drift to all memory cgroup stats
and events shown in memory.stat and memory.events.

For memory.stat this is acceptable.  But memory.events issues file
notifications, and somebody polling the file for changes will be
confused when the counters in it are unchanged after a wakeup.

Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
MEMCG_OOM - are sufficiently rare and high-level that we don't need
per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
frequent, but they're counting invocations of reclaim, which is a
complex operation that touches many shared cachelines.

This splits memory.events from the generic VM events and tracks them in
their own, unbuffered atomic counters.  That's also cleaner, as it
eliminates the ugly enum nesting of VM and cgroup events.

[hannes@cmpxchg.org: "array subscript is above array bounds"]
  Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
Fixes: a983b5ebee ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Arnd Bergmann
9d8a463a70 mm/hmm: fix header file if/else/endif maze, again
The last fix was still wrong, as we need the inline dummy functions also
for the case that CONFIG_HMM is enabled but CONFIG_HMM_MIRROR is not:

  kernel/fork.o: In function `__mmdrop':
  fork.c:(.text+0x14f6): undefined reference to `hmm_mm_destroy'

This adds back the second copy of the dummy functions, hopefully
this time in the right place.

Link: http://lkml.kernel.org/r/20180404110236.804484-1-arnd@arndb.de
Fixes: 8900d06a277a ("mm/hmm: fix header file if/else/endif maze")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse
f88a1e90c6 mm/hmm: use device driver encoding for HMM pfn
Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
pfn shift value allowing them to define their own encoding for HMM pfn
that are fill inside the pfns array of the hmm_range struct.  With this
device driver can get pfn that match their own private encoding out of HMM
without having to do any conversion.

[rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
  Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
[rcampbell@nvidia.com: clarify fault logic for device private memory]
  Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Jérôme Glisse
2aee09d8c1 mm/hmm: change hmm_vma_fault() to allow write fault on page basis
This changes hmm_vma_fault() to not take a global write fault flag for a
range but instead rely on caller to populate HMM pfns array with proper
fault flag ie HMM_PFN_VALID if driver want read fault for that address or
HMM_PFN_VALID and HMM_PFN_WRITE for write.

Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
device private memory to be migrated back to system memory through page
fault.

This is more flexible API and it better reflects how device handles and
reports fault.

Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00