commit 731bd6a93a upstream.
For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This
(math_state_restore()) can be a blocking call and hence we enable
interrupts (which were originally disabled when the exception happened),
allocate memory and disable interrupts etc.
But the eager-fpu mode, call's the same math_state_restore() from
kernel_fpu_end(). The assumption being that tsk_used_math() is always
set for the eager-fpu mode and thus avoid the code path of enabling
interrupts, allocating fpu state using blocking call and disable
interrupts etc.
But the below issue was noticed by Maarten Baert, Nate Eldredge and
few others:
If a user process dumps core on an ecrypt fs while aesni-intel is loaded,
we get a BUG() in __find_get_block() complaining that it was called with
interrupts disabled; then all further accesses to our ecrypt fs hang
and we have to reboot.
The aesni-intel code (encrypting the core file that we are writing) needs
the FPU and quite properly wraps its code in kernel_fpu_{begin,end}(),
the latter of which calls math_state_restore(). So after kernel_fpu_end(),
interrupts may be disabled, which nobody seems to expect, and they stay
that way until we eventually get to __find_get_block() which barfs.
For eager fpu, most the time, tsk_used_math() is true. At few instances
during thread exit, signal return handling etc, tsk_used_math() might
be false.
In kernel_fpu_end(), for eager-fpu, call math_state_restore()
only if tsk_used_math() is set. Otherwise, don't bother. Kernel code
path which cleared tsk_used_math() knows what needs to be done
with the fpu state.
Reported-by: Maarten Baert <maarten-baert@hotmail.com>
Reported-by: Nate Eldredge <nate@thatsmathematics.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <sbsiddha@gmail.com>
Link: http://lkml.kernel.org/r/1391410583.3801.6.camel@europa
Cc: George Spelvin <linux@horizon.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 596f3142d2 upstream.
We always disable cr8 intercept in its handler, but only re-enable it
if handling KVM_REQ_EVENT, so there can be a window where we do not
intercept cr8 writes, which allows an interrupt to disrupt a higher
priority task.
Fix this by disabling intercepts in the same function that re-enables
them when needed. This fixes BSOD in Windows 2008.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 847d7970de upstream.
For systems with multiple servers and routed fabric, all
northbridges get assigned to the first server. Fix this by also
using the node reported from the PCI bus. For single-fabric
systems, the northbriges are on PCI bus 0 by definition, which
are on NUMA node 0 by definition, so this is invarient on most
systems.
Tested on fam10h and fam15h single and multi-fabric systems and
candidate for stable.
Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1394710981-3596-1-git-send-email-daniel@numascale.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b01d4e6893 upstream.
It's an enum, not a #define, you can't use it in asm files.
Introduced in commit 5fa10196bd ("x86: Ignore NMIs that come in during
early boot"), and sadly I didn't compile-test things like I should have
before pushing out.
My weak excuse is that the x86 tree generally doesn't introduce stupid
things like this (and the ARM pull afterwards doesn't cause me to do a
compile-test either, since I don't cross-compile).
Cc: Don Zickus <dzickus@redhat.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5fa10196bd upstream.
Don Zickus reports:
A customer generated an external NMI using their iLO to test kdump
worked. Unfortunately, the machine hung. Disabling the nmi_watchdog
made things work.
I speculated the external NMI fired, caused the machine to panic (as
expected) and the perf NMI from the watchdog came in and was latched.
My guess was this somehow caused the hang.
----
It appears that the latched NMI stays latched until the early page
table generation on 64 bits, which causes exceptions to happen which
end in IRET, which re-enable NMI. Therefore, ignore NMIs that come in
during early execution, until we have proper exception handling.
Reported-and-tested-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1394221143-29713-1-git-send-email-dzickus@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 26e61e8939 upstream.
Vince "Super Tester" Weaver reported a new round of syscall fuzzing (Trinity) failures,
with perf WARN_ON()s triggering. He also provided traces of the failures.
This is I think the relevant bit:
> pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_disable: x86_pmu_disable
> pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_state: Events: {
> pec_1076_warn-2804 [000] d... 147.926156: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null))
> pec_1076_warn-2804 [000] d... 147.926158: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926159: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926160: x86_pmu_state: n_events: 1, n_added: 0, n_txn: 1
> pec_1076_warn-2804 [000] d... 147.926161: x86_pmu_state: Assignment: {
> pec_1076_warn-2804 [000] d... 147.926162: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926163: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926166: collect_events: Adding event: 1 (ffff880119ec8800)
So we add the insn:p event (fd[23]).
At this point we should have:
n_events = 2, n_added = 1, n_txn = 1
> pec_1076_warn-2804 [000] d... 147.926170: collect_events: Adding event: 0 (ffff8800c9e01800)
> pec_1076_warn-2804 [000] d... 147.926172: collect_events: Adding event: 4 (ffff8800cbab2c00)
We try and add the {BP,cycles,br_insn} group (fd[3], fd[4], fd[15]).
These events are 0:cycles and 4:br_insn, the BP event isn't x86_pmu so
that's not visible.
group_sched_in()
pmu->start_txn() /* nop - BP pmu */
event_sched_in()
event->pmu->add()
So here we should end up with:
0: n_events = 3, n_added = 2, n_txn = 2
4: n_events = 4, n_added = 3, n_txn = 3
But seeing the below state on x86_pmu_enable(), the must have failed,
because the 0 and 4 events aren't there anymore.
Looking at group_sched_in(), since the BP is the leader, its
event_sched_in() must have succeeded, for otherwise we would not have
seen the sibling adds.
But since neither 0 or 4 are in the below state; their event_sched_in()
must have failed; but I don't see why, the complete state: 0,0,1:p,4
fits perfectly fine on a core2.
However, since we try and schedule 4 it means the 0 event must have
succeeded! Therefore the 4 event must have failed, its failure will
have put group_sched_in() into the fail path, which will call:
event_sched_out()
event->pmu->del()
on 0 and the BP event.
Now x86_pmu_del() will reduce n_events; but it will not reduce n_added;
giving what we see below:
n_event = 2, n_added = 2, n_txn = 2
> pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_enable: x86_pmu_enable
> pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_state: Events: {
> pec_1076_warn-2804 [000] d... 147.926179: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null))
> pec_1076_warn-2804 [000] d... 147.926181: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926182: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: n_events: 2, n_added: 2, n_txn: 2
> pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: Assignment: {
> pec_1076_warn-2804 [000] d... 147.926186: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800)
> pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: 1->0 tag: 1 config: 1 (ffff880119ec8800)
> pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: }
> pec_1076_warn-2804 [000] d... 147.926190: x86_pmu_enable: S0: hwc->idx: 33, hwc->last_cpu: 0, hwc->last_tag: 1 hwc->state: 0
So the problem is that x86_pmu_del(), when called from a
group_sched_in() that fails (for whatever reason), and without x86_pmu
TXN support (because the leader is !x86_pmu), will corrupt the n_added
state.
Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/r/20140221150312.GF3104@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c091c71ad2 upstream.
GFP_ATOMIC is not a single gfp flag, but a macro which expands to the other
flags, where meaningful is the LACK of __GFP_WAIT flag. To check if caller
wants to perform an atomic allocation, the code must test for a lack of the
__GFP_WAIT flag. This patch fixes the issue introduced in v3.5-rc1.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1b385cbdd7 upstream.
Commit e504c9098e (kvm, vmx: Fix lazy FPU on nested guest, 2013-11-13)
highlighted a real problem, but the fix was subtly wrong.
nested_read_cr0 is the CR0 as read by L2, but here we want to look at
the CR0 value reflecting L1's setup. In other words, L2 might think
that TS=0 (so nested_read_cr0 has the bit clear); but if L1 is actually
running it with TS=1, we should inject the fault into L1.
The effective value of CR0 in L2 is contained in vmcs12->guest_cr0, use
it.
Fixes: e504c9098e
Reported-by: Kashyap Chamarty <kchamart@redhat.com>
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Kashyap Chamarty <kchamart@redhat.com>
Tested-by: Anthoine Bourgeois <bourgeois@bertin.fr>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a08d3b3b99 upstream.
The problem occurs when the guest performs a pusha with the stack
address pointing to an mmio address (or an invalid guest physical
address) to start with, but then extending into an ordinary guest
physical address. When doing repeated emulated pushes
emulator_read_write sets mmio_needed to 1 on the first one. On a
later push when the stack points to regular memory,
mmio_nr_fragments is set to 0, but mmio_is_needed is not set to 0.
As a result, KVM exits to userspace, and then returns to
complete_emulated_mmio. In complete_emulated_mmio
vcpu->mmio_cur_fragment is incremented. The termination condition of
vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments is never achieved.
The code bounces back and fourth to userspace incrementing
mmio_cur_fragment past it's buffer. If the guest does nothing else it
eventually leads to a a crash on a memcpy from invalid memory address.
However if a guest code can cause the vm to be destroyed in another
vcpu with excellent timing, then kvm_clear_async_pf_completion_queue
can be used by the guest to control the data that's pointed to by the
call to cancel_work_item, which can be used to gain execution.
Fixes: f78146b0f9
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 87fbb2ac60 upstream.
When the conversion was made to remove stop machine and use the breakpoint
logic instead, the modification of the function graph caller is still
done directly as though it was being done under stop machine.
As it is not converted via stop machine anymore, there is a possibility
that the code could be layed across cache lines and if another CPU is
accessing that function graph call when it is being updated, it could
cause a General Protection Fault.
Convert the update of the function graph caller to use the breakpoint
method as well.
Cc: H. Peter Anvin <hpa@zytor.com>
Fixes: 08d636b6d4 "ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4640c7ee9b upstream.
If CONFIG_X86_SMAP is disabled, smap_violation() tests for conditions
which are incorrect (as the AC flag doesn't matter), causing spurious
faults.
The dynamic disabling of SMAP (nosmap on the command line) is fine
because it disables X86_FEATURE_SMAP, therefore causing the
static_cpu_has() to return false.
Found by Fengguang Wu's test system.
[ v3: move all predicates into smap_violation() ]
[ v2: use IS_ENABLED() instead of #ifdef ]
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Link: http://lkml.kernel.org/r/20140213124550.GA30497@localhost
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 03bbd596ac upstream.
If SMAP support is not compiled into the kernel, don't enable SMAP in
CR4 -- in fact, we should clear it, because the kernel doesn't contain
the proper STAC/CLAC instructions for SMAP support.
Found by Fengguang Wu's test system.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Link: http://lkml.kernel.org/r/20140213124550.GA30497@localhost
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a9c8e4beee upstream.
Steven Noonan forwarded a users report where they had a problem starting
vsftpd on a Xen paravirtualized guest, with this in dmesg:
BUG: Bad page map in process vsftpd pte:8000000493b88165 pmd:e9cc01067
page:ffffea00124ee200 count:0 mapcount:-1 mapping: (null) index:0x0
page flags: 0x2ffc0000000014(referenced|dirty)
addr:00007f97eea74000 vm_flags:00100071 anon_vma:ffff880e98f80380 mapping: (null) index:7f97eea74
CPU: 4 PID: 587 Comm: vsftpd Not tainted 3.12.7-1-ec2 #1
Call Trace:
dump_stack+0x45/0x56
print_bad_pte+0x22e/0x250
unmap_single_vma+0x583/0x890
unmap_vmas+0x65/0x90
exit_mmap+0xc5/0x170
mmput+0x65/0x100
do_exit+0x393/0x9e0
do_group_exit+0xcc/0x140
SyS_exit_group+0x14/0x20
system_call_fastpath+0x1a/0x1f
Disabling lock debugging due to kernel taint
BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:0 val:-1
BUG: Bad rss-counter state mm:ffff880e9ca60580 idx:1 val:1
The issue could not be reproduced under an HVM instance with the same
kernel, so it appears to be exclusive to paravirtual Xen guests. He
bisected the problem to commit 1667918b64 ("mm: numa: clear numa
hinting information on mprotect") that was also included in 3.12-stable.
The problem was related to how xen translates ptes because it was not
accounting for the _PAGE_NUMA bit. This patch splits pte_present to add
a pteval_present helper for use by xen so both bare metal and xen use
the same code when checking if a PTE is present.
[mgorman@suse.de: wrote changelog, proposed minor modifications]
[akpm@linux-foundation.org: fix typo in comment]
Reported-by: Steven Noonan <steven@uplinklabs.net>
Tested-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f98b7a772a upstream.
There was a large performance regression that was bisected to
commit 611ae8e3 ("x86/tlb: enable tlb flush range support for
x86"). This patch simply changes the default balance point
between a local and global flush for IvyBridge.
In the interest of allowing the tests to be reproduced, this
patch was tested using mmtests 0.15 with the following
configurations
configs/config-global-dhp__tlbflush-performance
configs/config-global-dhp__scheduler-performance
configs/config-global-dhp__network-performance
Results are from two machines
Ivybridge 4 threads: Intel(R) Core(TM) i3-3240 CPU @ 3.40GHz
Ivybridge 8 threads: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Page fault microbenchmark showed nothing interesting.
Ebizzy was configured to run multiple iterations and threads.
Thread counts ranged from 1 to NR_CPUS*2. For each thread count,
it ran 100 iterations and each iteration lasted 10 seconds.
Ivybridge 4 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 6395.44 ( 0.00%) 6789.09 ( 6.16%)
Mean 2 7012.85 ( 0.00%) 8052.16 ( 14.82%)
Mean 3 6403.04 ( 0.00%) 6973.74 ( 8.91%)
Mean 4 6135.32 ( 0.00%) 6582.33 ( 7.29%)
Mean 5 6095.69 ( 0.00%) 6526.68 ( 7.07%)
Mean 6 6114.33 ( 0.00%) 6416.64 ( 4.94%)
Mean 7 6085.10 ( 0.00%) 6448.51 ( 5.97%)
Mean 8 6120.62 ( 0.00%) 6462.97 ( 5.59%)
Ivybridge 8 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 7336.65 ( 0.00%) 7787.02 ( 6.14%)
Mean 2 8218.41 ( 0.00%) 9484.13 ( 15.40%)
Mean 3 7973.62 ( 0.00%) 8922.01 ( 11.89%)
Mean 4 7798.33 ( 0.00%) 8567.03 ( 9.86%)
Mean 5 7158.72 ( 0.00%) 8214.23 ( 14.74%)
Mean 6 6852.27 ( 0.00%) 7952.45 ( 16.06%)
Mean 7 6774.65 ( 0.00%) 7536.35 ( 11.24%)
Mean 8 6510.50 ( 0.00%) 6894.05 ( 5.89%)
Mean 12 6182.90 ( 0.00%) 6661.29 ( 7.74%)
Mean 16 6100.09 ( 0.00%) 6608.69 ( 8.34%)
Ebizzy hits the worst case scenario for TLB range flushing every
time and it shows for these Ivybridge CPUs at least that the
default choice is a poor on. The patch addresses the problem.
Next was a tlbflush microbenchmark written by Alex Shi at
http://marc.info/?l=linux-kernel&m=133727348217113 . It
measures access costs while the TLB is being flushed. The
expectation is that if there are always full TLB flushes that
the benchmark would suffer and it benefits from range flushing
There are 320 iterations of the test per thread count. The
number of entries is randomly selected with a min of 1 and max
of 512. To ensure a reasonably even spread of entries, the full
range is broken up into 8 sections and a random number selected
within that section.
iteration 1, random number between 0-64
iteration 2, random number between 64-128 etc
This is still a very weak methodology. When you do not know
what are typical ranges, random is a reasonable choice but it
can be easily argued that the opimisation was for smaller ranges
and an even spread is not representative of any workload that
matters. To improve this, we'd need to know the probability
distribution of TLB flush range sizes for a set of workloads
that are considered "common", build a synthetic trace and feed
that into this benchmark. Even that is not perfect because it
would not account for the time between flushes but there are
limits of what can be reasonably done and still be doing
something useful. If a representative synthetic trace is
provided then this benchmark could be revisited and the shift values retuned.
Ivybridge 4 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 10.50 ( 0.00%) 10.50 ( 0.03%)
Mean 2 17.59 ( 0.00%) 17.18 ( 2.34%)
Mean 3 22.98 ( 0.00%) 21.74 ( 5.41%)
Mean 5 47.13 ( 0.00%) 46.23 ( 1.92%)
Mean 8 43.30 ( 0.00%) 42.56 ( 1.72%)
Ivybridge 8 threads
3.13.0-rc7 3.13.0-rc7
vanilla altshift-v3
Mean 1 9.45 ( 0.00%) 9.36 ( 0.93%)
Mean 2 9.37 ( 0.00%) 9.70 ( -3.54%)
Mean 3 9.36 ( 0.00%) 9.29 ( 0.70%)
Mean 5 14.49 ( 0.00%) 15.04 ( -3.75%)
Mean 8 41.08 ( 0.00%) 38.73 ( 5.71%)
Mean 13 32.04 ( 0.00%) 31.24 ( 2.49%)
Mean 16 40.05 ( 0.00%) 39.04 ( 2.51%)
For both CPUs, average access time is reduced which is good as
this is the benchmark that was used to tune the shift values in
the first place albeit it is now known *how* the benchmark was
used.
The scheduler benchmarks were somewhat inconclusive. They
showed gains and losses and makes me reconsider how stable those
benchmarks really are or if something else might be interfering
with the test results recently.
Network benchmarks were inconclusive. Almost all results were
flat except for netperf-udp tests on the 4 thread machine.
These results were unstable and showed large variations between
reboots. It is unknown if this is a recent problems but I've
noticed before that netperf-udp results tend to vary.
Based on these results, changing the default for Ivybridge seems
like a logical choice.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Alex Shi <alex.shi@linaro.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-cqnadffh1tiqrshthRj3Esge@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 24f91eba18 upstream.
The SOFT_DIRTY bit shows that the content of memory was changed after a
defined point in the past. mprotect() doesn't change the content of
memory, so it must not change the SOFT_DIRTY bit.
This bug causes a malfunction: on the first iteration all pages are
dumped. On other iterations only pages with the SOFT_DIRTY bit are
dumped. So if the SOFT_DIRTY bit is cleared from a page by mistake, the
page is not dumped and its content will be restored incorrectly.
This patch does nothing with _PAGE_SWP_SOFT_DIRTY, becase pte_modify()
is called only for present pages.
Fixes commit 0f8975ec4d ("mm: soft-dirty bits for user memory changes
tracking").
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 51c71a3bba upstream.
The user has the option of disabling the platform driver:
00:02.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
which is used to unplug the emulated drivers (IDE, Realtek 8169, etc)
and allow the PV drivers to take over. If the user wishes
to disable that they can set:
xen_platform_pci=0
(in the guest config file)
or
xen_emul_unplug=never
(on the Linux command line)
except it does not work properly. The PV drivers still try to
load and since the Xen platform driver is not run - and it
has not initialized the grant tables, most of the PV drivers
stumble upon:
input: Xen Virtual Keyboard as /devices/virtual/input/input5
input: Xen Virtual Pointer as /devices/virtual/input/input6M
------------[ cut here ]------------
kernel BUG at /home/konrad/ssd/konrad/linux/drivers/xen/grant-table.c:1206!
invalid opcode: 0000 [#1] SMP
Modules linked in: xen_kbdfront(+) xenfs xen_privcmd
CPU: 6 PID: 1389 Comm: modprobe Not tainted 3.13.0-rc1upstream-00021-ga6c892b-dirty #1
Hardware name: Xen HVM domU, BIOS 4.4-unstable 11/26/2013
RIP: 0010:[<ffffffff813ddc40>] [<ffffffff813ddc40>] get_free_entries+0x2e0/0x300
Call Trace:
[<ffffffff8150d9a3>] ? evdev_connect+0x1e3/0x240
[<ffffffff813ddd0e>] gnttab_grant_foreign_access+0x2e/0x70
[<ffffffffa0010081>] xenkbd_connect_backend+0x41/0x290 [xen_kbdfront]
[<ffffffffa0010a12>] xenkbd_probe+0x2f2/0x324 [xen_kbdfront]
[<ffffffff813e5757>] xenbus_dev_probe+0x77/0x130
[<ffffffff813e7217>] xenbus_frontend_dev_probe+0x47/0x50
[<ffffffff8145e9a9>] driver_probe_device+0x89/0x230
[<ffffffff8145ebeb>] __driver_attach+0x9b/0xa0
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145eb50>] ? driver_probe_device+0x230/0x230
[<ffffffff8145cf1c>] bus_for_each_dev+0x8c/0xb0
[<ffffffff8145e7d9>] driver_attach+0x19/0x20
[<ffffffff8145e260>] bus_add_driver+0x1a0/0x220
[<ffffffff8145f1ff>] driver_register+0x5f/0xf0
[<ffffffff813e55c5>] xenbus_register_driver_common+0x15/0x20
[<ffffffff813e76b3>] xenbus_register_frontend+0x23/0x40
[<ffffffffa0015000>] ? 0xffffffffa0014fff
[<ffffffffa001502b>] xenkbd_init+0x2b/0x1000 [xen_kbdfront]
[<ffffffff81002049>] do_one_initcall+0x49/0x170
.. snip..
which is hardly nice. This patch fixes this by having each
PV driver check for:
- if running in PV, then it is fine to execute (as that is their
native environment).
- if running in HVM, check if user wanted 'xen_emul_unplug=never',
in which case bail out and don't load any PV drivers.
- if running in HVM, and if PCI device 5853:0001 (xen_platform_pci)
does not exist, then bail out and not load PV drivers.
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=ide-disks',
then bail out for all PV devices _except_ the block one.
Ditto for the network one ('nics').
- (v2) if running in HVM, and if the user wanted 'xen_emul_unplug=unnecessary'
then load block PV driver, and also setup the legacy IDE paths.
In (v3) make it actually load PV drivers.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it
Reported-by: Anthony PERARD <anthony.perard@citrix.com>
Reported-and-Tested-by: Fabio Fantoni <fabio.fantoni@m2r.biz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v2: Add extra logic to handle the myrid ways 'xen_emul_unplug'
can be used per Ian and Stefano suggestion]
[v3: Make the unnecessary case work properly]
[v4: s/disks/ide-disks/ spotted by Fabio]
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> [for PCI parts]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3b56496865 upstream.
This adds the workaround for erratum 793 as a precaution in case not
every BIOS implements it. This addresses CVE-2013-6885.
Erratum text:
[Revision Guide for AMD Family 16h Models 00h-0Fh Processors,
document 51810 Rev. 3.04 November 2013]
793 Specific Combination of Writes to Write Combined Memory Types and
Locked Instructions May Cause Core Hang
Description
Under a highly specific and detailed set of internal timing
conditions, a locked instruction may trigger a timing sequence whereby
the write to a write combined memory type is not flushed, causing the
locked instruction to stall indefinitely.
Potential Effect on System
Processor core hang.
Suggested Workaround
BIOS should set MSR
C001_1020[15] = 1b.
Fix Planned
No fix planned
[ hpa: updated description, fixed typo in MSR name ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20140114230711.GS29865@pd.tnic
Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 77f01bdfa5 upstream.
When Hyper-V hypervisor leaves are present, KVM must relocate
its own leaves at 0x40000100, because Windows does not look for
Hyper-V leaves at indices other than 0x40000000. In this case,
the KVM features are at 0x40000101, but the old code would always
look at 0x40000001.
Fix by using kvm_cpuid_base(). This also requires making the
function non-inline, since kvm_cpuid_base() is static.
Fixes: 1085ba7f55
Cc: mtosatti@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1c300a4077 upstream.
It is unnecessary to go through hypervisor_cpuid_base every time
a leaf is found (which will be every time a feature is requested
after the next patch).
Fixes: 1085ba7f55
Cc: mtosatti@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a7f84f03f6 upstream.
Current code check boot service region with kernel text region by:
start+size >= __pa_symbol(_text)
The end of the above region should be start + size - 1 instead.
I see this problem in ovmf + Fedora 19 grub boot:
text start: 1000000 md start: 800000 md size: 800000
Signed-off-by: Dave Young <dyoung@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull perf fixes from Ingo Molnar:
- an s2ram related fix on AMD systems
- a perf fault handling bug that is relatively old but which has become
much easier to trigger in v3.13 after commit e00b12e64b ("perf/x86:
Further optimize copy_from_user_nmi()")
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h
x86, mm, perf: Allow recursive faults from interrupts
Pull networking fixes from David Miller:
1) The value choosen for the new SO_MAX_PACING_RATE socket option on
parisc was very poorly choosen, let's fix it while we still can.
From Eric Dumazet.
2) Our generic reciprocal divide was found to handle some edge cases
incorrectly, part of this is encoded into the BPF as deep as the JIT
engines themselves. Just use a real divide throughout for now.
From Eric Dumazet.
3) Because the initial lookup is lockless, the TCP metrics engine can
end up creating two entries for the same lookup key. Fix this by
doing a second lookup under the lock before we actually create the
new entry. From Christoph Paasch.
4) Fix scatter-gather list init in usbnet driver, from Bjørn Mork.
5) Fix unintended 32-bit truncation in cxgb4 driver's bit shifting.
From Dan Carpenter.
6) Netlink socket dumping uses the wrong socket state for timewait
sockets. Fix from Neal Cardwell.
7) Fix netlink memory leak in ieee802154_add_iface(), from Christian
Engelmayer.
8) Multicast forwarding in ipv4 can overflow the per-rule reference
counts, causing all multicast traffic to cease. Fix from Hannes
Frederic Sowa.
9) via-rhine needs to stop all TX queues when it resets the device,
from Richard Weinberger.
10) Fix RDS per-cpu accesses broken by the this_cpu_* conversions. From
Gerald Schaefer.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
s390/bpf,jit: fix 32 bit divisions, use unsigned divide instructions
parisc: fix SO_MAX_PACING_RATE typo
ipv6: simplify detection of first operational link-local address on interface
tcp: metrics: Avoid duplicate entries with the same destination-IP
net: rds: fix per-cpu helper usage
e1000e: Fix compilation warning when !CONFIG_PM_SLEEP
bpf: do not use reciprocal divide
be2net: add dma_mapping_error() check for dma_map_page()
bnx2x: Don't release PCI bars on shutdown
net,via-rhine: Fix tx_timeout handling
batman-adv: fix batman-adv header overhead calculation
qlge: Fix vlan netdev features.
net: avoid reference counter overflows on fib_rules in multicast forwarding
dm9601: add USB IDs for new dm96xx variants
MAINTAINERS: add virtio-dev ML for virtio
ieee802154: Fix memory leak in ieee802154_add_iface()
net: usbnet: fix SG initialisation
inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets
cxgb4: silence shift wrapping static checker warning
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJS174dAAoJEBvWZb6bTYbywN0P/jiaK+ySw4YTxGggkRhi1WHy
KnELWnPTpA1duucioBS8KSZDXN0SSf4KLio/ArOJee/uU9loY4HWMwpKjBTZkEU4
lWI0WIagllxqBnxfFgdAVS6B/EUqie3qulZ8Hi72OoAEzUnLDsGAIIBzR5fdb1jg
xkmdtPB2Lg+c+Lsuhg0MJegoymoLhEF82k/rBUEAzpjsL5N9DUt4n2rVLbECkfzm
kLC31epho+PaSAmepAk5dbfh1H7ea4OV5HZeyyuIZqGVUgc95OU7VV3eIK+pI7Vu
C6PvHaOnF6VHvnMN4tFSDRvADsb9TnxQN9zYiC10opjzdTwUiA+Y+jbIjyRkAtNY
JfDRSIh3+w93K0sv5eICBp3cXcy584I+N5VC3EY0imv4DZNvlQfEWu1t9qSBhaAQ
DQnIRhY1BpfK4Uu9MVBu8lTeo0VuefupFZ3vqlEQJ8ht2jgnYdQQMOaPNuzOP7NS
WbOoDWGVfpfWN09mCV9OJ9WKjxO0nUeS4/xBqWIhrQKQY3oSvdjxIbqw7E3lj4Sp
0uwzGqbwcJanYR0IYNmBfuyRtwk8SqgvIGGLtG4cnlAAXuwaG/v/+FpVCTCstB9n
Ljxp9Y8dqAyTjrdw4Irp2rsIhbJpW6Ob4zcVeAfcNWjut+DBhbWl6VSe0k48A06h
gO+vkx+qAg7FiF/vnRVN
=ifJc
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fix from Paolo Bonzini:
"Fix for a brown paper bag bug. Thanks to Drew Jones for noticing"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: x86: fix apic_base enable check
On AMD family 10h we see following error messages while waking up from
S3 for all non-boot CPUs leading to a failed IBS initialization:
Enabling non-boot CPUs ...
smpboot: Booting Node 0 Processor 1 APIC 0x1
[Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu
perf: IBS APIC setup failed on cpu #1
process: Switch to broadcast mode on CPU1
CPU1 is up
...
ACPI: Waking up from system sleep state S3
Reason for this is that during suspend the LVT offset for the IBS
vector gets lost and needs to be reinialized while resuming.
The offset is read from the IBSCTL msr. On family 10h the offset needs
to be 1 as offset 0 is used for the MCE threshold interrupt, but
firmware assings it for IBS to 0 too. The kernel needs to reprogram
the vector. The msr is a readonly node msr, but a new value can be
written via pci config space access. The reinitialization is
implemented for family 10h in setup_ibs_ctl() which is forced during
IBS setup.
This patch fixes IBS setup after waking up from S3 by adding
resume/supend hooks for the boot cpu which does the offset
reinitialization.
Marking it as stable to let distros pick up this fix.
Signed-off-by: Robert Richter <rric@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> v3.2..
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Waiman managed to trigger a PMI while in a emulate_vsyscall() fault,
the PMI in turn managed to trigger a fault while obtaining a stack
trace. This triggered the sig_on_uaccess_error recursive fault logic
and killed the process dead.
Fix this by explicitly excluding interrupts from the recursive fault
logic.
Reported-and-Tested-by: Waiman Long <waiman.long@hp.com>
Fixes: e00b12e64b ("perf/x86: Further optimize copy_from_user_nmi()")
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140110200603.GJ7572@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull locking fixes from Ingo Molnar:
"Two fixes from lockdep coverage of seqlocks, which fix deadlocks on
lockdep-enabled ARM systems"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched_clock: Disable seqlock lockdep usage in sched_clock()
seqlock: Use raw_ prefix instead of _no_lockdep
At first Jakub Zawadzki noticed that some divisions by reciprocal_divide
were not correct. (off by one in some cases)
http://www.wireshark.org/~darkjames/reciprocal-buggy.c
He could also show this with BPF:
http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
The reciprocal divide in linux kernel is not generic enough,
lets remove its use in BPF, as it is not worth the pain with
current cpus.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Daniel Borkmann <dxchgb@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Matt Evans <matt@ozlabs.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e66d2ae7c6 moved the assignment
vcpu->arch.apic_base = value above a condition with
(vcpu->arch.apic_base ^ value), causing that check
to always fail. Use old_value, vcpu->arch.apic_base's
old value, in the condition instead.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus disliked the _no_lockdep() naming, so instead
use the more-consistent raw_* prefix to the non-lockdep
enabled seqcount methods.
This also adds raw_ methods for the write operations
as well, which will be utilized in a following patch.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Krzysztof Hałasa <khalasa@piap.pl>
Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Willy Tarreau <w@1wt.eu>
Link: http://lkml.kernel.org/r/1388704274-5278-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before we do an EMMS in the AMD FXSAVE information leak workaround we
need to clear any pending exceptions, otherwise we trap with a
floating-point exception inside this code.
Reported-by: halfdog <me@halfdog.net>
Tested-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/CA%2B55aFxQnY_PCG_n4=0w-VG=YLXL-yr7oMxyy0WU2gCBAf3ydg@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Function tracing callbacks expect to have the ftrace_ops that registered it
passed to them, not the address of the variable that holds the ftrace_ops
that registered it.
Use a mov instead of a lea to store the ftrace_ops into the parameter
of the function tracing callback.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20131113152004.459787f9@gandalf.local.home
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v3.8+
Three reasons for doing this: 1. arch.walk_mmu points to arch.mmu anyway
in case nested EPT wasn't in use. 2. this aligns VMX with SVM. But 3. is
most important: nested_cpu_has_ept(vmcs12) queries the VMCS page, and if
one guest VCPU manipulates the page of another VCPU in L2, we may be
fooled to skip over the nested_ept_uninit_mmu_context, leaving mmu in
nested state. That can crash the host later on if nested_ept_get_cr3 is
invoked while L1 already left vmxon and nested.current_vmcs12 became
NULL therefore.
Cc: stable@kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Update arch.apic_base before triggering recalculate_apic_map. Otherwise
the recalculation will work against the previous state of the APIC and
will fail to build the correct map when an APIC is hardware-enabled
again.
This fixes a regression of 1e08ec4a13.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Pull x86 fixes from Peter Anvin:
"There is a small EFI fix and a big power regression fix in this batch.
My queue also had a fix for downing a CPU when there are insufficient
number of IRQ vectors available, but I'm holding that one for now due
to recent bug reports"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efi: Don't select EFI from certain special ACPI drivers
x86 idle: Repair large-server 50-watt idle-power regression
Linux 3.10 changed the timing of how thread_info->flags is touched:
x86: Use generic idle loop
(7d1a941731)
This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.
Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.
Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.
While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.
Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com
Signed-off-by: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Base pages are unmapped and flushed from cache and TLB during normal
page migration and replaced with a migration entry that causes any
parallel NUMA hinting fault or gup to block until migration completes.
THP does not unmap pages due to a lack of support for migration entries
at a PMD level. This allows races with get_user_pages and
get_user_pages_fast which commit 3f926ab945 ("mm: Close races between
THP migration and PMD numa clearing") made worse by introducing a
pmd_clear_flush().
This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against
THP migration and properly account for the NUMA hinting fault. On the
migration side the page table lock is taken for each PTE update.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler fixes from Ingo Molnar:
"Three fixes for scheduler crashes, each triggers in relatively rare,
hardware environment dependent situations"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Rework sched_fair time accounting
math64: Add mul_u64_u32_shr()
sched: Remove PREEMPT_NEED_RESCHED from generic code
sched: Initialize power_orig for overlapping groups
Pull x86 fixes from Peter Anvin:
"This is a pretty small batch:
The biggest single change is to stop using EFI time services on 32-bit
platforms. This matches our current behavior on 64-bit platforms as
we already had ruled them out there as being too unreliable. Turns
out that affects 32-bit platforms, too.
One NULL pointer fix for SGI UV.
Two minor build fixes, one of which only affects icc and the other
which affects icc and future versions or nonstandard default settings
of gcc"
* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Don't use (U)EFI time services on 32 bit
x86, build, icc: Remove uninitialized_var() from compiler-intel.h
x86/UV: Fix NULL pointer dereference in uv_flush_tlb_others() if the 'nobau' boot option is used
x86, build: Pass in additional -mno-mmx, -mno-sse options
from Google for reporting them.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJSqi7GAAoJEBvWZb6bTYbyy94P/jdBo/J+4zxujJNDfw9D15xP
81/ByzZ1qxAZhrKKCqlOMWEYIOhEV6sjoJayMMIPkV0i9aYfOl3N4OUTGx8xuDhl
eIIQDRQdnFmqi69R2inBTxFYb8uGngsJwGF0iuiIImg/gJvoIAfywFADFPPUbtRP
BQQ69IHSCR/rblGVW3hyio7Y/dFtE4dqNYKTH7pamkSVdCz4j3FdVPz+COcXMsc+
wOhphbe0zRnrq8MmwsqMXKefSJtihD34wx+M85tiltGKXx4Jumi3eQcfFTnMCbH1
loA6fGLztXuyul5kpkaLdvoYgvxZDueZ7pO0OO1Wqh60T6OyDRqc/jKohdbzI/g3
/2OCZ7P8yHgxJb1tLAZBr3aWwCQtRhlF8O6eP+bBPQo8Di5Z6xYHDVggvLpHCE7f
KRQy1V1ooXbZ1UoytqA0QauCXURUb1jC+tzuZvZzcJN6oFojY8ojL1oVLlW0iDt6
WYzS6YAmIo5jeJ2qvP42dLG8n4kijkQ1gQgBsI8rfsDOYGXJe8TWu7O2aD1rs8Jz
d7aPgL+zz8K7wwZgG+U2PTjzkDOuyjRbhNEi7jrCVio6hxvvdQARiLsi+0Q+QUjF
Xk0iiSsseCBcFWj6sDnTPn10YnnXyIj6eDM1OImdd+/2VVqnUIiqzwpUsNr3yzVc
a+bZbYEsCUP0MwqlmCcA
=auYv
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Four security fixes for KVM on x86. Thanks to Andrew Honig and Lars
Bull from Google for reporting them"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
KVM: x86: Convert vapic synchronization to _cached functions (CVE-2013-6368)
KVM: x86: Fix potential divide by 0 in lapic (CVE-2013-6367)
KVM: Improve create VCPU parameter (CVE-2013-4587)
A guest can cause a BUG_ON() leading to a host kernel crash.
When the guest writes to the ICR to request an IPI, while in x2apic
mode the following things happen, the destination is read from
ICR2, which is a register that the guest can control.
kvm_irq_delivery_to_apic_fast uses the high 16 bits of ICR2 as the
cluster id. A BUG_ON is triggered, which is a protection against
accessing map->logical_map with an out-of-bounds access and manages
to avoid that anything really unsafe occurs.
The logic in the code is correct from real HW point of view. The problem
is that KVM supports only one cluster with ID 0 in clustered mode, but
the code that has the bug does not take this into account.
Reported-by: Lars Bull <larsbull@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In kvm_lapic_sync_from_vapic and kvm_lapic_sync_to_vapic there is the
potential to corrupt kernel memory if userspace provides an address that
is at the end of a page. This patches concerts those functions to use
kvm_write_guest_cached and kvm_read_guest_cached. It also checks the
vapic_address specified by userspace during ioctl processing and returns
an error to userspace if the address is not a valid GPA.
This is generally not guest triggerable, because the required write is
done by firmware that runs before the guest. Also, it only affects AMD
processors and oldish Intel that do not have the FlexPriority feature
(unless you disable FlexPriority, of course; then newer processors are
also affected).
Fixes: b93463aa59 ('KVM: Accelerated apic support')
Reported-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Under guest controllable circumstances apic_get_tmcct will execute a
divide by zero and cause a crash. If the guest cpuid support
tsc deadline timers and performs the following sequence of requests
the host will crash.
- Set the mode to periodic
- Set the TMICT to 0
- Set the mode bits to 11 (neither periodic, nor one shot, nor tsc deadline)
- Set the TMICT to non-zero.
Then the lapic_timer.period will be 0, but the TMICT will not be. If the
guest then reads from the TMCCT then the host will perform a divide by 0.
This patch ensures that if the lapic_timer.period is 0, then the division
does not occur.
Reported-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce mul_u64_u32_shr() as proposed by Andy a while back; it
allows using 64x64->128 muls on 64bit archs and recent GCC
which defines __SIZEOF_INT128__ and __int128.
(This new method will be used by the scheduler.)
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: fweisbec@gmail.com
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-hxjoeuzmrcaumR0uZwjpe2pv@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While hunting a preemption issue with Alexander, Ben noticed that the
currently generic PREEMPT_NEED_RESCHED stuff is horribly broken for
load-store architectures.
We currently rely on the IPI to fold TIF_NEED_RESCHED into
PREEMPT_NEED_RESCHED, but when this IPI lands while we already have
a load for the preempt-count but before the store, the store will erase
the PREEMPT_NEED_RESCHED change.
The current preempt-count only works on load-store archs because
interrupts are assumed to be completely balanced wrt their preempt_count
fiddling; the previous preempt_count load will match the preempt_count
state after the interrupt and therefore nothing gets lost.
This patch removes the PREEMPT_NEED_RESCHED usage from generic code and
pushes it into x86 arch code; the generic code goes back to relying on
TIF_NEED_RESCHED.
Boot tested on x86_64 and compile tested on ppc64.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-and-Tested-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131128132641.GP10022@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
UEFI time services are often broken once we're in virtual mode. We were
already refusing to use them on 64-bit systems, but it turns out that
they're also broken on some 32-bit firmware, including the Dell Venue.
Disable them for now, we can revisit once we have the 1:1 mappings code
incorporated.
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
Link: http://lkml.kernel.org/r/1385754283-2464-1-git-send-email-matthew.garrett@nebula.com
Cc: <stable@vger.kernel.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The SGI UV tlb shootdown code panics the system with a NULL
pointer deference if 'nobau' is specified on the boot
commandline.
uv_flush_tlb_other() gets called for every flush, whether the
BAU is disabled or not. It should not be keeping the s_enters
statistic while the BAU is disabled.
The panic occurs because during initialization
init_per_cpu_tunables() does not set the bcp->statp pointer if
'nobau' was specified.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: <stable@vger.kernel.org> # 3.12.x
Link: http://lkml.kernel.org/r/E1VnzBi-0005yF-MU@eag09.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>