80184 Commits

Author SHA1 Message Date
Kirill A. Shutemov
c13985fa80 page-flags: define behavior of Xen-related flags on compound pages
PG_pinned and PG_savepinned are about page table's pages which are never
compound.

I'm not so sure about PG_foreign, but it seems we shouldn't see compound
pages there too.

Let's use PF_NO_COMPOUND for all of them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
dcb351cd09 page-flags: define behavior SL*B-related flags on compound pages
SL*B uses compound pages and marks head pages with PG_slab.
__SetPageSlab() and __ClearPageSlab() are never called for tail pages.

The same situation with PG_slob_free in SLOB allocator.

PF_NO_TAIL is appropriate for these flags.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
8cb38fabb6 page-flags: define behavior of LRU-related flags on compound pages
Only head pages are ever on LRU.  Let's use PF_HEAD policy to avoid any
confusion for all LRU-related flags.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
df8c94d13c page-flags: define behavior of FS/IO-related flags on compound pages
It seems we don't have compound page on FS/IO path currently.  Use
PF_NO_COMPOUND to catch if we have.

The odd exception is PG_dirty: sound uses compound pages and maps them
with PTEs.  PF_NO_COMPOUND triggers VM_BUG_ON() in set_page_dirty() on
handling shared fault.  Let's use PF_HEAD for PG_dirty.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
48c935ad88 page-flags: define PG_locked behavior on compound pages
lock_page() must operate on the whole compound page.  It doesn't make
much sense to lock part of compound page.  Change code to use head
page's PG_locked, if tail page is passed.

This patch also gets rid of custom helper functions --
__set_page_locked() and __clear_page_locked().  They are replaced with
helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG.  Tail pages to these
helper would trigger VM_BUG_ON().

SLUB uses PG_locked as a bit spin locked.  IIUC, tail pages should never
appear there.  VM_BUG_ON() is added to make sure that this assumption is
correct.

[akpm@linux-foundation.org: fix fs/cifs/file.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
95ad97554a page-flags: introduce page flags policies wrt compound pages
This patch adds a third argument to macros which create function
definitions for page flags.  This argument defines how page-flags
helpers behave on compound functions.

For now we define four policies:

 - PF_ANY: the helper function operates on the page it gets, regardless
   if it's non-compound, head or tail.

 - PF_HEAD: the helper function operates on the head page of the
   compound page if it gets tail page.

 - PF_NO_TAIL: only head and non-compond pages are acceptable for this
   helper function.

 - PF_NO_COMPOUND: only non-compound pages are acceptable for this
   helper function.

For now we use policy PF_ANY for all helpers, which matches current
behaviour.

We do not enforce the policy for TESTPAGEFLAG, because we have flags
checked for random pages all over the kernel.  Noticeable exception to
this is PageTransHuge() which triggers VM_BUG_ON() for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
0e6d31a733 page-flags: move code around
The preparation patch: we are going to use compound_head(), PageTail()
and PageCompound() to define page-flags helpers.

Let's define them before macros.

We cannot user PageHead() helper in PageCompound() as it's not yet
defined -- use test_bit(PG_head, &page->flags) instead.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
d8c1bdeb5d page-flags: trivial cleanup for PageTrans* helpers
Use TESTPAGEFLAG_FALSE() to get it a bit cleaner.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Linus Torvalds
4e5448a31d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "A quick set of bug fixes after there initial networking merge:

  1) Netlink multicast group storage allocator only was tested with
     nr_groups equal to 1, make it work for other values too.  From
     Matti Vaittinen.

  2) Check build_skb() return value in macb and hip04_eth drivers, from
     Weidong Wang.

  3) Don't leak x25_asy on x25_asy_open() failure.

  4) More DMA map/unmap fixes in 3c59x from Neil Horman.

  5) Don't clobber IP skb control block during GSO segmentation, from
     Konstantin Khlebnikov.

  6) ECN helpers for ipv6 don't fixup the checksum, from Eric Dumazet.

  7) Fix SKB segment utilization estimation in xen-netback, from David
     Vrabel.

  8) Fix lockdep splat in bridge addrlist handling, from Nikolay
     Aleksandrov"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  bgmac: Fix reversed test of build_skb() return value.
  bridge: fix lockdep addr_list_lock false positive splat
  net: smsc: Add support h8300
  xen-netback: free queues after freeing the net device
  xen-netback: delete NAPI instance when queue fails to initialize
  xen-netback: use skb to determine number of required guest Rx requests
  net: sctp: Move sequence start handling into sctp_transport_get_idx()
  ipv6: update skb->csum when CE mark is propagated
  net: phy: turn carrier off on phy attach
  net: macb: clear interrupts when disabling them
  sctp: support to lookup with ep+paddr in transport rhashtable
  net: hns: fixes no syscon error when init mdio
  dts: hisi: fixes no syscon fault when init mdio
  net: preserve IP control block during GSO segmentation
  fsl/fman: Delete one function call "put_device" in dtsec_config()
  hip04_eth: fix missing error handle for build_skb failed
  3c59x: fix another page map/single unmap imbalance
  3c59x: balance page maps and unmaps
  x25_asy: Free x25_asy on x25_asy_open() failure.
  mlxsw: fix SWITCHDEV_OBJ_ID_PORT_MDB
  ...
2016-01-15 13:33:12 -08:00
Linus Torvalds
f689b742f2 powerpc updates for 4.5
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
  - Optimise FP/VMX/VSX context switching from Anton Blanchard
 
  - Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica Gupta,
    Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling, Andrew Donnellan
  - Allow wrapper to work on non-english system from Laurent Vivier
  - Add rN aliases to the pt_regs_offset table from Rashmica Gupta
  - Fix module autoload for rackmeter & axonram drivers from Luis de Bethencourt
  - Include KVM guest test in all interrupt vectors from Paul Mackerras
  - Fix DSCR inheritance over fork() from Anton Blanchard
  - Make value-returning atomics & {cmp}xchg* & their atomic_ versions fully ordered from Boqun Feng
  - Print MSR TM bits in oops messages from Michael Neuling
  - Add TM signal return & invalid stack selftests from Michael Neuling
  - Limit EPOW reset event warnings from Vipin K Parashar
  - Remove the Cell QPACE code from Rashmica Gupta
  - Append linux_banner to exception information in xmon from Rashmica Gupta
  - Add selftest to check if VSRs are corrupted from Rashmica Gupta
  - Remove broken GregorianDay() from Daniel Axtens
  - Import Anton's context_switch2 benchmark into selftests from Michael Ellerman
  - Add selftest script to test HMI functionality from Daniel Axtens
  - Remove obsolete OPAL v2 support from Stewart Smith
  - Make enter_rtas() private from Michael Ellerman
  - PPR exception cleanups from Michael Ellerman
  - Add page soft dirty tracking from Laurent Dufour
  - Add support for Nvlink NPUs from Alistair Popple
  - Add support for kexec on 476fpe from Alistair Popple
  - Enable kernel CPU dlpar from sysfs from Nathan Fontenot
  - Copy only required pieces of the mm_context_t to the paca from Michael Neuling
  - Add a kmsg_dumper that flushes OPAL console output on panic from Russell Currey
  - Implement save_stack_trace_regs() to enable kprobe stack tracing from Steven Rostedt
  - Add HWCAP bits for Power9 from Michael Ellerman
  - Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
  - Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
  - scripts/recordmcount.pl: support data in text section on powerpc from Ulrich Weigand
  - Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
 
  - cxl: Fix possible idr warning when contexts are released from Vaibhav Jain
  - cxl: use correct operator when writing pcie config space values from Andrew Donnellan
  - cxl: Fix DSI misses when the context owning task exits from Vaibhav Jain
  - cxl: fix build for GCC 4.6.x from Brian Norris
  - cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
  - cxl: Enable PCI device ID for future IBM CXL adapter from Uma Krishnan
 
  - Freescale updates from Scott: Highlights include moving QE code out of
    arch/powerpc (to be shared with arm), device tree updates, and minor fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWmIxeAAoJEFHr6jzI4aWAA+cQAIXAw4WfVWJ2V4ZK+1eKfB57
 fdXG71PuXG+WYIWy71ly8keLHdzzD1NQ2OUB64bUVRq202nRgVc15ZYKRJ/FE/sP
 SkxaQ2AG/2kI2EflWshOi0Lu9qaZ+LMHJnszIqE/9lnGSB2kUI/cwsSXgziiMKXR
 XNci9v14SdDd40YV/6BSZXoxApwyq9cUbZ7rnzFLmz4hrFuKmB/L3LABDF8QcpH7
 sGt/YaHGOtqP0UX7h5KQTFLGe1OPvK6NWixSXeZKQ71ED6cho1iKUEOtBA9EZeIN
 QM5JdHFWgX8MMRA0OHAgidkSiqO38BXjmjkVYWoIbYz7Zax3ThmrDHB4IpFwWnk3
 l7WBykEXY7KEqpZzbh0GFGehZWzVZvLnNgDdvpmpk/GkPzeYKomBj7ZZfm3H1yGD
 BTHPwuWCTX+/K75yEVNO8aJO12wBg7DRl4IEwBgqhwU8ga4FvUOCJkm+SCxA1Dnn
 qlpS7qPwTXNIEfKMJcxp5X0KiwDY1EoOotd4glTN0jbeY5GEYcxe+7RQ302GrYxP
 zcc8EGLn8h6BtQvV3ypNHF5l6QeTW/0ZlO9c236tIuUQ5gQU39SQci7jQKsYjSzv
 BB1XdLHkbtIvYDkmbnr1elbeJCDbrWL9rAXRUTRyfuCzaFWTfZmfVNe8c8qwDMLk
 TUxMR/38aI7bLcIQjwj9
 =R5bX
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Core:
   - Ground work for the new Power9 MMU from Aneesh Kumar K.V
   - Optimise FP/VMX/VSX context switching from Anton Blanchard

  Misc:
   - Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica
     Gupta, Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling,
     Andrew Donnellan
   - Allow wrapper to work on non-english system from Laurent Vivier
   - Add rN aliases to the pt_regs_offset table from Rashmica Gupta
   - Fix module autoload for rackmeter & axonram drivers from Luis de
     Bethencourt
   - Include KVM guest test in all interrupt vectors from Paul Mackerras
   - Fix DSCR inheritance over fork() from Anton Blanchard
   - Make value-returning atomics & {cmp}xchg* & their atomic_ versions
     fully ordered from Boqun Feng
   - Print MSR TM bits in oops messages from Michael Neuling
   - Add TM signal return & invalid stack selftests from Michael Neuling
   - Limit EPOW reset event warnings from Vipin K Parashar
   - Remove the Cell QPACE code from Rashmica Gupta
   - Append linux_banner to exception information in xmon from Rashmica
     Gupta
   - Add selftest to check if VSRs are corrupted from Rashmica Gupta
   - Remove broken GregorianDay() from Daniel Axtens
   - Import Anton's context_switch2 benchmark into selftests from
     Michael Ellerman
   - Add selftest script to test HMI functionality from Daniel Axtens
   - Remove obsolete OPAL v2 support from Stewart Smith
   - Make enter_rtas() private from Michael Ellerman
   - PPR exception cleanups from Michael Ellerman
   - Add page soft dirty tracking from Laurent Dufour
   - Add support for Nvlink NPUs from Alistair Popple
   - Add support for kexec on 476fpe from Alistair Popple
   - Enable kernel CPU dlpar from sysfs from Nathan Fontenot
   - Copy only required pieces of the mm_context_t to the paca from
     Michael Neuling
   - Add a kmsg_dumper that flushes OPAL console output on panic from
     Russell Currey
   - Implement save_stack_trace_regs() to enable kprobe stack tracing
     from Steven Rostedt
   - Add HWCAP bits for Power9 from Michael Ellerman
   - Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
   - Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
   - scripts/recordmcount.pl: support data in text section on powerpc
     from Ulrich Weigand
   - Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand

  cxl:
   - cxl: Fix possible idr warning when contexts are released from
     Vaibhav Jain
   - cxl: use correct operator when writing pcie config space values
     from Andrew Donnellan
   - cxl: Fix DSI misses when the context owning task exits from Vaibhav
     Jain
   - cxl: fix build for GCC 4.6.x from Brian Norris
   - cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
   - cxl: Enable PCI device ID for future IBM CXL adapter from Uma
     Krishnan

  Freescale:
   - Freescale updates from Scott: Highlights include moving QE code out
     of arch/powerpc (to be shared with arm), device tree updates, and
     minor fixes"

* tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (149 commits)
  powerpc/module: Handle R_PPC64_ENTRY relocations
  scripts/recordmcount.pl: support data in text section on powerpc
  powerpc/powernv: Fix OPAL_CONSOLE_FLUSH prototype and usages
  powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff
  powerpc/mm: Fix _PAGE_PTE breaking swapoff
  cxl: Enable PCI device ID for future IBM CXL adapter
  cxl: use -Werror only with CONFIG_PPC_WERROR
  cxl: fix build for GCC 4.6.x
  powerpc: Add HWCAP bits for Power9
  powerpc/powernv: Reserve PE#0 on NPU
  powerpc/powernv: Change NPU PE# assignment
  powerpc/powernv: Fix update of NVLink DMA mask
  powerpc/powernv: Remove misleading comment in pci.c
  powerpc: Implement save_stack_trace_regs() to enable kprobe stack tracing
  powerpc: Fix build break due to paca mm_context_t changes
  cxl: Fix DSI misses when the context owning task exits
  MAINTAINERS: Update Scott Wood's e-mail address
  powerpc/powernv: Fix minor off-by-one error in opal_mce_check_early_recovery()
  powerpc: Fix style of self-test config prompts
  powerpc/powernv: Only delay opal_rtc_read() retry when necessary
  ...
2016-01-15 13:18:47 -08:00
Linus Torvalds
37cea93b99 VFIO updates for v4.5-rc1
- Fixes in AMD xgbe reset, spapr structure padding, type 1 flags
    (Dan Carpenter, Alexey Kardashevskiy, Pierre Morel)
  - Re-introduce no-iommu mode, with a user this time (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWmVq0AAoJECObm247sIsiAZQP/2oPDw+KyCApPlIRIr7dQ2Xu
 TdHmGed1QlWaXdt5YWutaglIhxwUZppLtfQrwM9gPsRZygfZ2A/lhx8F5+i+8eib
 kN1CUKX+HV9hl8mDQLRmjwjuFEQ9bwXaOFaq+TBpTki+T/8ENW/69IvMa0oFttQi
 CYiCsAveADMmxzjOpfKCszpKaEn2v4zpghDmuvdCNbu7IZOT/X7BNG9YUJ9e7F+m
 /x1L9LhC7VllnLmJb/8gyCU0omoW0Z/DzrQQ0DgH9KZSRgU/9f2Sfj1Y6WytRAwJ
 xBMCnjElkJxl8QjKQ3nrdoYyu+hTgm/qYpQ2PlGDybVKYhiUlQ4/on9GUY5I5iFc
 L3vQIkY3zmruMrdA/syAB9b3ku0lFelvbUW4aJ/1OME9rpmgEkvzEJmJ931cbrL1
 vLt2wbSDAbiW3sHKMEA70tcKbWWSpXX6e0QlBLtTYFwuhtUSDMKffkqfokTQrMuw
 Hy0+FCCXYcckxp3qRQXfzpp9c4vM5dJ6bHTVMhZ6Akvhl+BQb4m5J1WpYxYqCqBK
 oJ0u92bGHOBvcXwxZ8KwkP+tjClFxe2mTpmPWfSZ0rj3Z5rHLfK8eCEQ/KgQGiGI
 7F2pnxdeAYlhgEULAh+9VtfsH3tEZCpEY68dLZAyBGn2kEAijtgOMaCeX7+GahVz
 nV89Yczp3Wyaa70idiQS
 =Z9Kq
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.5-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Fixes in AMD xgbe reset, spapr structure padding, type 1 flags (Dan
   Carpenter, Alexey Kardashevskiy, Pierre Morel)

 - Re-introduce no-iommu mode, with a user this time (Alex Williamson)

* tag 'vfio-v4.5-rc1' of git://github.com/awilliam/linux-vfio:
  vfio/iommu_type1: make use of info.flags
  vfio: Include No-IOMMU mode
  vfio: Add explicit alignments in vfio_iommu_spapr_tce_create
  VFIO: platform: reset: fix a warning message condition
2016-01-15 13:01:01 -08:00
Linus Torvalds
cc80fe0eef Smaller bugfixes and cleanup, including a fix for a failures of
kerberized NFSv4.1 mounts, and Scott Mayhew's work addressing ACK storms
 that can affect some high-availability NFS setups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWmVZHAAoJECebzXlCjuG+Cn4P/3zwSuwIeLuv9b89vzFXU8Xv
 AbBWHk7WkFXJQGTKdclYjwxqU+l15D5lYHCae1cuD5eXdviraxXf7EcnqrMhJUc0
 oRiQx0rAwlEkKUAVrxGCFP7WKjlX3TsEBV6wPpTCP3BEMzTPDEeaDek7+hICFkLF
 9a/miEXAopm3jxP7WNmXEkdKpFEHklDDwtv6Av7iIKCW6+7XCGp7Prqo4NQKAKp6
 hjE+nvt2HiD06MZhUeyb14cn6547smzt1rbSfK4IB4yHMwLyaoqPrT7ekDh9LDrE
 uGgo+Y2PBbEcTAE6tJ88EjZx7cMCFPn0te+eKPgnpPy9RqrNqSxj5N/b7JAecKgW
 a/09BtvFOoYs8fO5ovqeRY5THrE3IRyMIwn4gt7fCYaaAbG3dwGKG1uklTAVXtb1
 95DkhOb8He2VhOCCoJ6ybbTnRfjB6b/cv7ZuEGlQfvTE+BtU3Jj9I76ruWFhb3zd
 HM1dRI20UfwL/0Y8yYhZ+/rje9SSk2jOmVgSCqY9hnCmEqOqOdUU0X/uumIWaBym
 zfGx9GIM0jQuYVdLQRXtJJbUgJUUN3MilGyU5wx7YoXip5guqTalXqAdQpShzXeW
 s1ATYh/mY5X9ig51KogkkVlm9bXDQAzJBAnDRpLtJZqy5Cgkrj9RSu0ExN1Rmlhw
 LKQCddBQxUSWJ+XWycgK
 =G7V3
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Smaller bugfixes and cleanup, including a fix for a failures of
  kerberized NFSv4.1 mounts, and Scott Mayhew's work addressing ACK
  storms that can affect some high-availability NFS setups"

* tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linux:
  nfsd: add new io class tracepoint
  nfsd: give up on CB_LAYOUTRECALLs after two lease periods
  nfsd: Fix nfsd leaks sunrpc module references
  lockd: constify nlmsvc_binding structure
  lockd: use to_delayed_work
  nfsd: use to_delayed_work
  Revert "svcrdma: Do not send XDR roundup bytes for a write chunk"
  lockd: Register callbacks on the inetaddr_chain and inet6addr_chain
  nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain
  sunrpc: Add a function to close temporary transports immediately
  nfsd: don't base cl_cb_status on stale information
  nfsd4: fix gss-proxy 4.1 mounts for some AD principals
  nfsd: fix unlikely NULL deref in mach_creds_match
  nfsd: minor consolidation of mach_cred handling code
  nfsd: helper for dup of possibly NULL string
  svcrpc: move some initialization to common code
  nfsd: fix a warning message
  nfsd: constify nfsd4_callback_ops structure
  nfsd: recover: constify nfsd4_client_tracking_ops structures
  svcrdma: Do not send XDR roundup bytes for a write chunk
2016-01-15 12:49:44 -08:00
Linus Torvalds
3c28c9ccaf md updates for 4.5
Mostly clustered-raid1 and raid5 journal updates.
 one Y2038 fix and other minor stuff.
 
 One patch removes me from the MAINTAINERS file and adds a record of
 my md maintainership to Credits.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWmJEhAAoJEDnsnt1WYoG5raQQAI9lBrHO+Q8C8RImPsemLX0X
 ypjH38XUwEwKNYYfsCVI7PKAqCl7r8ITzY054gKsU0iHAfqLQlEN8aMz0v0fJQhg
 Msb7utrEMQ0UERwNcc+3J78ffFAdWrkHVd64Ley0h/pizFPSlL0K2RuIGTBc9sGX
 Hz2Ci11Ch7FdK7C/Zl7I6tK1pkthu3hBXYEZyg1GngRRhZEJj2U7mBmy1E37NA72
 o66B5r5FSlnIA8MAo/EAViCxtMJKBPRWU/WnkMhOJ1Yyw/FwMpbM2prLBLtYFqwF
 OLZOLuDUHY5HxdX2U+3R0hBzF78aozcH6od60SWg7wOmI/IkXYiYFujlxMd132FE
 OT+aa+UHHDEkATTSyt98OmxIkQ8uqKiNsSYqBk9lpNAPtmEbhqRX4RAOdrqP0G83
 DX7iyZpAK4YhB4BkJxMtNdSIOnss1TwfOdKyvoBZYmY6bTKh7p+dpw4cvIjV4VDi
 p6+BUQdJQ7mHRLV9QI4IuG52AJO8cRGc1OVvqLEMzO8uZlpyxX9nJrSqeP/dKKfa
 pJ5pYssilXEeKCDnODGqSRdt9aU4ENDW/oIkAW2U3cnSHUwBoMLF1WJ+M3Atbm+s
 i3/iDp26SnSiHM+DVHije5v0OGOroYdJwKDIFWToElcfc9Q5IDHU+KP8oeuPqqOS
 WA08l+zj+ahfP7Yu1DUC
 =gl+r
 -----END PGP SIGNATURE-----

Merge tag 'md/4.5' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "Mostly clustered-raid1 and raid5 journal updates.  one Y2038 fix and
  other minor stuff.

  One patch removes me from the MAINTAINERS file and adds a record of my
  md maintainership to Credits"

Many thanks to Neil, who has been around for a _looong_ time.

* tag 'md/4.5' of git://neil.brown.name/md: (26 commits)
  md/raid: only permit hot-add of compatible integrity profiles
  Remove myself as MD Maintainer, and add to Credits.
  raid5-cache: handle journal hotadd in quiesce
  MD: add journal with array suspended
  md: set MD_HAS_JOURNAL in correct places
  md: Remove 'ready' field from mddev.
  md: remove unnecesary md_new_event_inintr
  raid5: allow r5l_io_unit allocations to fail
  raid5-cache: use a mempool for the metadata block
  raid5-cache: use a bio_set
  raid5-cache: add journal hot add/remove support
  drivers: md: use ktime_get_real_seconds()
  md: avoid warning for 32-bit sector_t
  raid5-cache: free meta_page earlier
  raid5-cache: simplify r5l_move_io_unit_list
  md: update comment for md_allow_write
  md-cluster: update comments for MD_CLUSTER_SEND_LOCKED_ALREADY
  md-cluster: Protect communication with mutexes
  md-cluster: Defer MD reloading to mddev->thread
  md-cluster: update the documentation
  ...
2016-01-15 12:28:00 -08:00
Linus Torvalds
4b43ea2a7c regulator: Update for v4.5
Aside from a fix for a spurious warning (which caused more problems than
 it fixed in the fixing really) this is all driver updates, including new
 drivers for Dialog PV88060/90 and TI LM363x and TPS65086 devices.  The
 qcom_smd driver has had PM8916 and PMA8084 support added.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWlUWiAAoJECTWi3JdVIfQtywH/0joibyUbL34arXcPNCUUhlo
 UnhTgT1kfOoS01ODt9gU49X4/8SRL9HdjcrIV8yTz4Ldb8by8Rd3ZrkO//9zs7BF
 5PIqIzEw6FT/Ovsj8Tc35irccMSrwtTIGRrA8zrM/WtWQWcIUWFxMw+j/lewIvvH
 jHZNTTLL5XemObSTfvzhbS6PZ8RcvRVXLNJjZYYO5xAEHNg3i5Hxao5p3QKzVJ8m
 P5n4jSiTY/pVePcs9dzzQTGjlzkE11KX7DBxnmXOSPZsArSaRIC53DCmqu/JoG+o
 3JJBkmRx7IJb0651hclFaUl8Wl56FSst+zigeAj82b301xpVR30iG3xiwYK/AT8=
 =NQ6n
 -----END PGP SIGNATURE-----

Merge tag 'regulator-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator updates from Mark Brown:
 "Aside from a fix for a spurious warning (which caused more problems
  than it fixed in the fixing really) this is all driver updates,
  including new drivers for Dialog PV88060/90 and TI LM363x and TPS65086
  devices.  The qcom_smd driver has had PM8916 and PMA8084 support
  added"

* tag 'regulator-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (36 commits)
  regulator: core: remove some dead code
  regulator: core: use dev_to_rdev
  regulator: lp872x: Get rid of duplicate reference to DVS GPIO
  regulator: lp872x: Add missing of_match in regulators descriptions
  regulator: axp20x: Fix GPIO LDO enable value for AXP22x
  regulator: lp8788: constify regulator_ops structures
  regulator: wm8*: constify regulator_ops structures
  regulator: da9*: constify regulator_ops structures
  regulator: mt6311: Use REGCACHE_RBTREE
  regulator: tps65917/palmas: Add bypass ops for LDOs with bypass capability
  regulator: qcom-smd: Add support for PMA8084
  regulator: qcom-smd: Add PM8916 support
  soc: qcom: documentation: Update SMD/RPM Docs
  regulator: pv88090: logical vs bitwise AND typo
  regulator: pv88090: Fix irq leak
  regulator: pv88090: new regulator driver
  regulator: wm831x-ldo: Use platform_register/unregister_drivers()
  regulator: wm831x-dcdc: Use platform_register/unregister_drivers()
  regulator: lp8788-ldo: Use platform_register/unregister_drivers()
  regulator: core: Fix nested locking of supplies
  ...
2016-01-15 12:14:47 -08:00
Eric Dumazet
34ae6a1aa0 ipv6: update skb->csum when CE mark is propagated
When a tunnel decapsulates the outer header, it has to comply
with RFC 6080 and eventually propagate CE mark into inner header.

It turns out IP6_ECN_set_ce() does not correctly update skb->csum
for CHECKSUM_COMPLETE packets, triggering infamous "hw csum failure"
messages and stack traces.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-15 15:07:23 -05:00
Linus Torvalds
1d3671df72 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fixes and quota cleanups from Jan Kara:
 "Several UDF fixes and some minor quota cleanups"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Check output buffer length when converting name to CS0
  udf: Prevent buffer overrun with multi-byte characters
  quota: constify qtree_fmt_operations structures
  udf: avoid uninitialized variable use
  udf: Fix lost indirect extent block
  udf: Factor out code for creating indirect extent
  udf: limit the maximum number of indirect extents in a row
  udf: limit the maximum number of TD redirections
  fs: make quota/dquot.c explicitly non-modular
  fs: make quota/netlink.c explicitly non-modular
2016-01-15 11:51:51 -08:00
Linus Torvalds
875fc4f5dd Merge branch 'akpm' (patches from Andrew)
Merge first patch-bomb from Andrew Morton:

 - A few hotfixes which missed 4.4 becasue I was asleep.  cc'ed to
   -stable

 - A few misc fixes

 - OCFS2 updates

 - Part of MM.  Including pretty large changes to page-flags handling
   and to thp management which have been buffered up for 2-3 cycles now.

  I have a lot of MM material this time.

[ It turns out the THP part wasn't quite ready, so that got dropped from
  this series  - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
  zsmalloc: reorganize struct size_class to pack 4 bytes hole
  mm/zbud.c: use list_last_entry() instead of list_tail_entry()
  zram/zcomp: do not zero out zcomp private pages
  zram: pass gfp from zcomp frontend to backend
  zram: try vmalloc() after kmalloc()
  zram/zcomp: use GFP_NOIO to allocate streams
  mm: add tracepoint for scanning pages
  drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
  mm/page_isolation: use macro to judge the alignment
  mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
  mm: rework virtual memory accounting
  include/linux/memblock.h: fix ordering of 'flags' argument in comments
  mm: move lru_to_page to mm_inline.h
  Documentation/filesystems: describe the shared memory usage/accounting
  memory-hotplug: don't BUG() in register_memory_resource()
  hugetlb: make mm and fs code explicitly non-modular
  mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
  mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
  mm: make sure isolate_lru_page() is never called for tail page
  vmstat: make vmstat_updater deferrable again and shut down on idle
  ...
2016-01-15 11:41:44 -08:00
Konstantin Khlebnikov
9207f9d45b net: preserve IP control block during GSO segmentation
Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-15 14:35:24 -05:00
Linus Torvalds
7d1fc01afc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  floppy: make local variable non-static
  exynos: fixes an incorrect header guard
  dt-bindings: fixes some incorrect header guards
  cpufreq-dt: correct dead link in documentation
  cpufreq: ARM big LITTLE: correct dead link in documentation
  treewide: Fix typos in printk
  Documentation: filesystem: Fix typo in fs/eventfd.c
  fs/super.c: use && instead of & for warn_on condition
  Documentation: fix sysfs-ptp
  lib: scatterlist: fix Kconfig description
2016-01-14 17:04:19 -08:00
Linus Torvalds
0f0836b7eb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching updates from Jiri Kosina:

 - RO/NX attribute fixes for patch module relocations from Josh
   Poimboeuf.  As part of this effort, module.c has been cleaned up as
   well and livepatching is piggy-backing on this cleanup.  Rusty is OK
   with this whole lot going through livepatching tree.

 - symbol disambiguation support from Chris J Arges.  That series is
   also

        Reviewed-by: Miroslav Benes <mbenes@suse.cz>

   but this came in only after I've alredy pushed out.  Didn't want to
   rebase because of that, hence I am mentioning it here.

 - symbol lookup fix from Miroslav Benes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: Cleanup module page permission changes
  module: keep percpu symbols in module's symtab
  module: clean up RO/NX handling.
  module: use a structure to encapsulate layout.
  gcov: use within_module() helper.
  module: Use the same logic for setting and unsetting RO/NX
  livepatch: function,sympos scheme in livepatch sysfs directory
  livepatch: add sympos as disambiguator field to klp_reloc
  livepatch: add old_sympos as disambiguator field to klp_func
2016-01-14 16:38:02 -08:00
Linus Torvalds
c2848f2eef Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID updates from Jiri Kosina:

 - appoint Benjamin Tissoires as co-maintainer / designated reviewer

 - sysfs report_descriptor visibility fix for unclaimed devices, from
   Andy Lutomirski

 - suspend/resume fixes for Sony driver from Frank Praznik

 - IRQ deadlock fix from Ioan-Adrian Ratiu

 - hid-i2c fixes affecting (at least) Yoga 900 from Mika Westerberg and
   Srinivas Pandruvada

 - a lot of new device support (especially, but not limited to, Wacom)
   and assorted small misc fixes

 - almost complete G920 support; the only bit that is missing is
   switching the device to HID mode automatically; Simon Wood and Michal
   Maly are working on it.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (46 commits)
  Revert "INPUT: xpad: switch Logitech G920 Wheel into HID mode"
  HID: sensor-hub: Add quirk for Lenovo Yoga 900 with ITE Chips
  HID: Add new PID for Microchip Pick16F1454
  HID: wacom: Use correct report to query pen ID from INTUOSHT2 devices
  HID: i2c-hid: Prevent sending reports from racing with device reset
  HID: use kobj_to_dev()
  HID: wiimote: use dev_to_wii()
  HID: add a new helper to_hid_driver()
  HID: use to_hid_device()
  HID: move to_hid_device() to hid.h
  HID: usbhid: use to_usb_device
  HID: corsair: Convert to use module_hid_driver
  HID: input: ignore the battery in OKLICK Laser BTmouse
  HID: wacom: Fix pad button range for CINTIQ_COMPANION_2
  HID: wacom: Fix touchring value reporting
  HID: wacom: Report 'strip2' values in ABS_RY
  HID: wacom: Limit touchstrip data to 13 bits
  HID: wacom: bitwise vs logical ORs
  HID: wacom: Apply lowres quirk to BAMBOO_TOUCH devices
  HID: enable hid device to suspend/resume asynchronously
  ...
2016-01-14 16:20:42 -08:00
Linus Torvalds
75f26df6ae NFS client updates for Linux 4.5
Highlights include:
 
 Stable fixes:
 - Fix a regression in the SunRPC socket polling code
 - Fix the attribute cache revalidation code
 - Fix race in __update_open_stateid()
 - Fix an lo->plh_block_lgets imbalance in layoutreturn
 - Fix an Oopsable typo in ff_mirror_match_fh()
 
 Features:
 - pNFS layout recall performance improvements.
 - pNFS/flexfiles: Support server-supplied layoutstats sampling period
 
 Bugfixes + cleanups:
 - NFSv4: Don't perform cached access checks before we've OPENed the file
 - Fix starvation issues with background flushes
 - Reclaim writes should be flushed as unstable writes if there are already
   entries in the commit lists
 - Various bugfixes from Chuck to fix NFS/RDMA send queue ordering problems
 - Ensure that we propagate fatal layoutget errors back to the application
 - Fixes for sundry flexfiles layoutstats bugs
 - Fix files/flexfiles to not cache invalidated layouts in the DS commit buckets
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWmAvPAAoJEGcL54qWCgDysScP/jnaRdQO+VTXTMtKcPiR7ujd
 LBcx3lrI1jsLYjlrBguTh9ROGt0maX1TAu/rsLuo4j/0wQC6dsQw+vFjfkI4CzSn
 4htK0f4hNjA29iOAjMaziAzsQW9eJ97Nn0HU4XD43OeK7PGh5e93Xk26Va4cO18P
 PqSam+FJoXpUSEWOzNzDwjTZTt4Voo3yJDqDTa8dU0x8c1qBktslo2n0WCntBxMn
 IbEDdBEIaUZmYCNhu2Sq1SLwYPatLg1Orfq3quMFJjzEeUbd0lVQno4C1fjjuACt
 DNXUgZDH0uR3U3naMXrdkqQ02GHEY9G0CO4a6q0Evsbm15wQuY6GMioxR0+ll7rX
 TeZGBUMq3cRFDR+/m1gTBZFjo7BUPE9LKXUazINVaoaJMYqpFunhI8V31ghx8/z8
 0kiracIEPXaIGmQ5S151+IDETpw9nntipCzdnduVefB2EAfXPeDzF7uFQPm+mvgx
 R4YuAFrlbcIZ/lZRYy5z6Fj3KLnytSOjzgXC5daxPQVt92QumQTQ6HC5jL25zVKb
 KOeSWHrFel7M+miL96ERvcS2vi+IDzPH9YbE9YTWbLW9LMBOYQKsukf1aaV9CwC4
 9OiNMYGQIGtmjbzIOlRcpVTAsXj+P6UVuwCfGTpQOm1Qa1fDbU+xSLkc62gg3WRa
 3E/3RMr1iXD8u1Kiz8hb
 =RBmi
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - Fix a regression in the SunRPC socket polling code
   - Fix the attribute cache revalidation code
   - Fix race in __update_open_stateid()
   - Fix an lo->plh_block_lgets imbalance in layoutreturn
   - Fix an Oopsable typo in ff_mirror_match_fh()

  Features:
   - pNFS layout recall performance improvements.
   - pNFS/flexfiles: Support server-supplied layoutstats sampling period

  Bugfixes + cleanups:
   - NFSv4: Don't perform cached access checks before we've OPENed the
     file
   - Fix starvation issues with background flushes
   - Reclaim writes should be flushed as unstable writes if there are
     already entries in the commit lists
   - Various bugfixes from Chuck to fix NFS/RDMA send queue ordering
     problems
   - Ensure that we propagate fatal layoutget errors back to the
     application
   - Fixes for sundry flexfiles layoutstats bugs
   - Fix files/flexfiles to not cache invalidated layouts in the DS
     commit buckets"

* tag 'nfs-for-4.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
  NFS: Fix a compile warning about unused variable in nfs_generic_pg_pgios()
  NFSv4: Fix a compile warning about no prototype for nfs4_ioctl()
  NFS: Use wait_on_atomic_t() for unlock after readahead
  SUNRPC: Fixup socket wait for memory
  NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments
  NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures
  NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid()
  NFSv4.1/pNFS: Fix a race in initiate_file_draining()
  NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout
  NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode
  NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids
  NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn()
  NFS: Relax requirements in nfs_flush_incompatible
  NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid
  NFS: Allow multiple commit requests in flight per file
  NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs
  SUNRPC: Fix a missing break in rpc_anyaddr()
  pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
  NFS: Fix attribute cache revalidation
  NFS: Ensure we revalidate attributes before using execute_ok()
  ...
2016-01-14 16:08:23 -08:00
Ebru Akagunduz
7d2eba0557 mm: add tracepoint for scanning pages
This patch series makes swapin readahead up to a certain number to gain
more thp performance and adds tracepoint for khugepaged_scan_pmd,
collapse_huge_page, __collapse_huge_page_isolate.

This patch series was written to deal with programs that access most,
but not all, of their memory after they get swapped out.  Currently
these programs do not get their memory collapsed into THPs after the
system swapped their memory out, while they would get THPs before
swapping happened.

This patch series was tested with a test program, it allocates 400MB of
memory, writes to it, and then sleeps.  I force the system to swap out
all.  Afterwards, the test program touches the area by writing and
leaves a piece of it without writing.  This shows how much swap in
readahead made by the patch.

Test results:

                        After swapped out
-------------------------------------------------------------------
              | Anonymous | AnonHugePages | Swap      | Fraction  |
-------------------------------------------------------------------
With patch    | 90076 kB    | 88064 kB    | 309928 kB |    %99    |
-------------------------------------------------------------------
Without patch | 194068 kB | 192512 kB     | 205936 kB |    %99    |
-------------------------------------------------------------------

                        After swapped in
-------------------------------------------------------------------
              | Anonymous | AnonHugePages | Swap      | Fraction  |
-------------------------------------------------------------------
With patch    | 201408 kB | 198656 kB     | 198596 kB |    %98    |
-------------------------------------------------------------------
Without patch | 292624 kB | 192512 kB     | 107380 kB |    %65    |
-------------------------------------------------------------------

This patch (of 3):

Using static tracepoints, data of functions is recorded.  It is good to
automatize debugging without doing a lot of changes in the source code.

This patch adds tracepoint for khugepaged_scan_pmd, collapse_huge_page
and __collapse_huge_page_isolate.

[dan.carpenter@oracle.com: add a missing tab]
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Joshua Clayton
543dfb2df8 mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
Running sparse on drivers/staging/lustre results in dozens of warnings:
include/linux/gfp.h:281:41: warning: odd constant _Bool cast (400000
becomes 1)

Use "!!" to explicitly convert to bool and get rid of the warning.

Signed-off-by: Joshua Clayton <stillcompiling@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Konstantin Khlebnikov
8463833590 mm: rework virtual memory accounting
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
testing the RLIMIT_DATA value to figure out if we're allowed to assign
new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
commited that RLIMIT_DATA in a form it's implemented now doesn't do
anything useful because most of user-space libraries use mmap() syscall
for dynamic memory allocations.

Linus suggested to convert RLIMIT_DATA rlimit into something suitable
for anonymous memory accounting.  But in this patch we go further, and
the changes are bundled together as:

 * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
 * replace mm->shared_vm with better defined mm->data_vm
 * account anonymous executable areas as executable
 * account file-backed growsdown/up areas as stack
 * drop struct file* argument from vm_stat_account
 * enforce RLIMIT_DATA for size of data areas

This way code looks cleaner: now code/stack/data classification depends
only on vm_flags state:

 VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
 VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
 VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)

The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
"shared", but that might be strange beast like readonly-private or VM_IO
area.

 - RLIMIT_AS            limits whole address space "VmSize"
 - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
 - RLIMIT_DATA          now limits "VmData"

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Kees Cook <keescook@google.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Florian Fainelli
d30b5545bd include/linux/memblock.h: fix ordering of 'flags' argument in comments
for_each_free_mem_range() and for_each_free_mem_range_reverse() both
accept a 'flags' argument, the comment surrounding the macro placed the
'flags' documentation at the very end, while 'flags' is in fact the 3rd
argument to the macro, so let's preserve natural ordering here.

Fixes: fc6daaf931518 ("mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Geliang Tang
d72ee91113 mm: move lru_to_page to mm_inline.h
Move lru_to_page() from internal.h to mm_inline.h.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Christoph Lameter
0eb77e9880 vmstat: make vmstat_updater deferrable again and shut down on idle
Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51 ("vmstat: do not use deferrable delayed work for
vmstat_update").  This in turn can cause multiple interruptions of the
applications because the vmstat updater may run at

Make vmstate_update deferrable again and provide a function that folds
the differentials when the processor is going to idle mode thus
addressing the issue of the above commit in a clean way.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51 ("vmstat: do not use deferrable delayed work for vmstat_update")
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
ef12947c9c mm: memcontrol: switch to the updated jump-label API
According to <linux/jump_label.h> the direct use of struct static_key is
deprecated.  Update the socket and slab accounting code accordingly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reported-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
8e8ae64524 mm: memcontrol: hook up vmpressure to socket pressure
Let the networking stack know when a memcg is under reclaim pressure so
that it can clamp its transmit windows accordingly.

Whenever the reclaim efficiency of a cgroup's LRU lists drops low enough
for a MEDIUM or HIGH vmpressure event to occur, assert a pressure state
in the socket and tcp memory code that tells it to curb consumption
growth from sockets associated with said control group.

Traditionally, vmpressure reports for the entire subtree of a memcg
under pressure, which drops useful information on the individual groups
reclaimed.  However, it's too late to change the userinterface, so add a
second reporting mode that reports on the level of reclaim instead of at
the level of pressure, and use that report for sockets.

vmpressure events are naturally edge triggered, so for hysteresis assert
socket pressure for a second to allow for subsequent vmpressure events
to occur before letting the socket code return to normal.

This will likely need finetuning for a wider variety of workloads, but
for now stick to the vmpressure presets and keep hysteresis simple.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
f7e1cb6ec5 mm: memcontrol: account socket memory in unified hierarchy memory controller
Socket memory can be a significant share of overall memory consumed by
common workloads.  In order to provide reasonable resource isolation in
the unified hierarchy, this type of memory needs to be included in the
tracking/accounting of a cgroup under active memory resource control.

Overhead is only incurred when a non-root control group is created AND
the memory controller is instructed to track and account the memory
footprint of that group.  cgroup.memory=nosocket can be specified on the
boot commandline to override any runtime configuration and forcibly
exclude socket memory from active memory resource control.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
80e95fe0fd mm: memcontrol: generalize the socket accounting jump label
The unified hierarchy memory controller is going to use this jump label
as well to control the networking callbacks.  Move it to the memory
controller code and give it a more generic name.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
baac50bbc3 net: tcp_memcontrol: simplify linkage between socket and page counter
There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future.  Remove the indirection and link
sockets directly to their owning memory cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
e805605c72 net: tcp_memcontrol: sanitize tcp memory accounting callbacks
There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit.  This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds.  After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable.  However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either.  However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed).  When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately.  Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
80f23124f5 net: tcp_memcontrol: simplify the per-memcg limit access
tcp_memcontrol replicates the global sysctl_mem limit array per cgroup,
but it only ever sets these entries to the value of the memory_allocated
page_counter limit.  Use the latter directly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
af95d7df40 net: tcp_memcontrol: remove dead per-memcg count of allocated sockets
The number of allocated sockets is used for calculations in the soft
limit phase, where packets are accepted but the socket is under memory
pressure.
 Since there is no soft limit phase in tcp_memcontrol, and memory
pressure is only entered when packets are already dropped, this is
actually dead code.  Remove it.

As this is the last user of parent_cg_proto(), remove that too.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
3d596f7b90 net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label
Move the jump-label from sock_update_memcg() and sock_release_memcg() to
the callsite, and so eliminate those function calls when socket
accounting is not enabled.

This also eliminates the need for dummy functions because the calls will
be optimized away if the Kconfig options are not enabled.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
931f3f4beb net: tcp_memcontrol: remove bogus hierarchy pressure propagation
When a cgroup currently breaches its socket memory limit, it enters
memory pressure mode for itself and its *ancestors*.  This throttles
transmission in unrelated sibling and cousin subtrees that have nothing
to do with the breached limit.

On the contrary, breaching a limit should make that group and its
*children* enter memory pressure mode.  But this happens already, albeit
lazily: if an ancestor limit is breached, siblings will enter memory
pressure on their own once the next packet arrives for them.

So no additional hierarchy code is needed.  Remove the bogus stuff.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
8c2c2358b2 net: tcp_memcontrol: properly detect ancestor socket pressure
When charging socket memory, the code currently checks only the local
page counter for excess to determine whether the memcg is under socket
pressure.  But even if the local counter is fine, one of the ancestors
could have breached its limit, which should also force this child to
enter socket pressure.  This currently doesn't happen.

Fix this by using page_counter_try_charge() first.  If that fails, it
means that either the local counter or one of the ancestors are in
excess of their limit, and the child should enter socket pressure.

Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
7d828602e5 mm: memcontrol: export root_mem_cgroup
A later patch will need this symbol in files other than memcontrol.c, so
export it now and replace mem_cgroup_root_css at the same time.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Alexander Kuleshov
8c9c1701c7 mm/memblock: introduce for_each_memblock_type()
We already have the for_each_memblock() macro in <linux/memblock.h>
which provides ability to iterate over memblock regions of a known type.
The for_each_memblock() macro allows us to pass the pointer to the
struct memblock_type, instead we need to pass name of the type.

This patch introduces a new macro for_each_memblock_type() which allows
us iterate over memblock regions with the given type when the type is
unknown.

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner
a8d0143730 mm: page_alloc: generalize the dirty balance reserve
The dirty balance reserve that dirty throttling has to consider is
merely memory not available to userspace allocations.  There is nothing
writeback-specific about it.  Generalize the name so that it's reusable
outside of that context.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Michal Hocko
c20cd45eb0 mm: allow GFP_{FS,IO} for page_cache_read page cache allocation
page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page.  This means that mapping_gfp_mask is used as the
base for the gfp_mask.  Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues.  page_cache_read is called
from the vm_operations_struct::fault() context during the page fault.
This context doesn't need the reclaim protection normally.

ceph and ocfs2 which call filemap_fault from their fault handlers seem
to be OK because they are not taking any fs lock before invoking generic
implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
reclaim recursion POV because this lock serializes truncate and punch
hole with the page faults and it doesn't get involved in the reclaim.

There is simply no reason to deliberately use a weaker allocation
context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS protection
might be even harmful.  There is a push to fail GFP_NOFS allocations
rather than loop within allocator indefinitely with a very limited
reclaim ability.  Once we start failing those requests the OOM killer
might be triggered prematurely because the page cache allocation failure
is propagated up the page fault path and end up in
pagefault_out_of_memory.

We cannot play with mapping_gfp_mask directly because that would be racy
wrt.  parallel page faults and it might interfere with other users who
really rely on NOFS semantic from the stored gfp_mask.  The mask is also
inode proper so it would even be a layering violation.  What we can do
instead is to push the gfp_mask into struct vm_fault and allow fs layer
to overwrite it should the callback need to be called with a different
allocation context.

Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
because this should be safe from the page fault path normally.  Why do
we care about mapping_gfp_mask at all then? Because this doesn't hold
only reclaim protection flags but it also might contain zone and
movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
to respect those.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Daniel Cashman
d07e22597d mm: mmap: add new /proc tunable for mmap_base ASLR
Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack.  This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.

The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems.  The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy.  This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.

The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:

  http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html

The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device.  The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece).  With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.

The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.

When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.

This patch (of 4):

ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures.  This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations.  This may not be an issue on all
platforms.  Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.

Signed-off-by: Daniel Cashman <dcashman@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vladimir Davydov
9ee11ba425 memcg: do not allow to disable tcp accounting after limit is set
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED
and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former
is never cleared while the latter can be cleared by unsetting the limit.
This allows to disable tcp socket accounting for new sockets after it
was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still
guaranteeing that memcg_socket_limit_enabled static key will be
decremented on memcg destruction.

This functionality looks dubious, because it is not clear what a use
case would be.  By enabling tcp accounting a user accepts the price.  If
they then find the performance degradation unacceptable, they can always
restart their workload with tcp accounting disabled.  It does not seem
there is any need to flip it while the workload is running.

Besides, it contradicts to how kmem accounting API works: writing
whatever to memory.kmem.limit_in_bytes enables kmem accounting for the
cgroup in question, after which it cannot be disabled.  Therefore one
might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just
enables socket accounting w/o limiting it, which might be useful by
itself, but it isn't true.

Since this API peculiarity is not documented anywhere, I propose to drop
it.  This will allow to simplify the code by dropping cg_proto->flags.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
David Rientjes
244d63ee34 mm, vmalloc: remove VM_VPAGES
VM_VPAGES is unnecessary, it's easier to check is_vmalloc_addr() when
reading /proc/vmallocinfo.

[akpm@linux-foundation.org: remove VM_VPAGES reference via kvfree()]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Jerome Marchand
eca56ff906 mm, shmem: add internal shmem resident memory accounting
Currently looking at /proc/<pid>/status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages
are mapped to /dev/zero), even though their implication in actual memory
use is quite different.

The internal accounting currently counts shmem pages together with
regular files.  As a preparation to extend the userspace interfaces,
this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
shmem pages separately from MM_FILEPAGES.  The next patch will expose it
to userspace - this patch doesn't change the exported values yet, by
adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
used before.  The only user-visible change after this patch is the OOM
killer message that separates the reported "shmem-rss" from "file-rss".

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vlastimil Babka
48131e03ca mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings
Following the previous patch, further reduction of /proc/pid/smaps cost
is possible for private writable shmem mappings with unpopulated areas
where the page walk invokes the .pte_hole function.  We can use radix
tree iterator for each such area instead of calling find_get_entry() in
a loop.  This is possible at the extra maintenance cost of introducing
another shmem function shmem_partial_swap_usage().

To demonstrate the diference, I have measured this on a process that
creates a private writable 2GB mapping of a partially swapped out
/dev/shm/file (which cannot employ the optimizations from the prvious
patch) and doesn't populate it at all.  I time how long does it take to
cat /proc/pid/smaps of this process 100 times.

Before this patch:

real    0m3.831s
user    0m0.180s
sys     0m3.212s

After this patch:

real    0m1.176s
user    0m0.180s
sys     0m0.684s

The time is similar to the case where a radix tree iterator is employed
on the whole mapping.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vlastimil Babka
6a15a37097 mm, proc: reduce cost of /proc/pid/smaps for shmem mappings
The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we
consult the radix tree for each pte_none entry, so the overal complexity
is O(n*log(n)).

We can reduce this significantly for mappings that cannot contain COWed
pages, because then we can either use the statistics tha shmem object
itself tracks (if the mapping contains the whole object, or the swap
usage of the whole object is zero), or use the radix tree iterator,
which is much more effective than repeated find_get_entry() calls.

This patch therefore introduces a function shmem_swap_usage(vma) and
makes /proc/pid/smaps use it when possible.  Only for writable private
mappings of shmem objects (i.e.  tmpfs files) with the shmem object
itself (partially) swapped outwe have to resort to the find_get_entry()
approach.

Hopefully such mappings are relatively uncommon.

To demonstrate the diference, I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and
time how long does it take to cat /proc/pid/smaps of this process 100
times.

Private writable mapping of a /dev/shm/file (the most complex case):

real    0m3.831s
user    0m0.180s
sys     0m3.212s

Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).

real    0m1.351s
user    0m0.096s
sys     0m0.768s

Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

real    0m0.935s
user    0m0.128s
sys     0m0.344s

Private anonymous mapping:

real    0m0.949s
user    0m0.116s
sys     0m0.348s

The cost is now much closer to the private anonymous mapping case, unless
the shmem mapping is private and writable.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Yaowei Bai
5b80287a65 mm/mmzone.c: memmap_valid_within() can be boolean
Make memmap_valid_within return bool due to this particular function
only using either one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00