dgit.raspbian.org Git

x86/HVM: fix AMD ECS handling for Fam10

The involved comparison was, very likely inadvertently, converted from
>= to > when making changes unrelated to the actual family range.

Fixes: 9841eb71ea87 ("x86/cpuid: Drop a guests cached x86 family and model information")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 5d515b1c296ebad6889748ea1e49e063453216a3
master date: 2020-04-01 12:28:30 +0200

x86/ucode/amd: Fix potential buffer overrun with equiv table handling

find_equiv_cpu_id() loops until it finds a 0 installed_cpu entry. Well formed
AMD microcode containers have this property.

Extend the checking in install_equiv_cpu_table() to reject tables which don't
have a sentinal at the end.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1f97b6b9f1b5978659c5735954c37c130e7bb151
master date: 2020-03-27 13:13:26 +0000

x86/ucode: Fix error paths in apply_microcode()

In the unlikley case that patch application completes, but the resutling
revision isn't expected, sig->rev doesn't get updated to match reality.

It will get adjusted the next time collect_cpu_info() gets called, but in the
meantime Xen might operate on a stale value. Nothing good will come of this.

Rewrite the logic to always update the stashed revision, before worrying about
whether the attempt was a success or failure.

Take the opportunity to make the printk() messages as consistent as possible.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: d2a0a96cf76603b2e2b87c3ce80c3f9d098327d4
master date: 2020-03-26 18:57:45 +0000

x86/shim: fix ballooning up the guest

args.preempted is meaningless here as it doesn't signal whether the
hypercall was preempted before. Use start_extent instead which is
correct (as long as the hypercall was invoked in a "normal" way).

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 76dbabb59eeaa78e9f57407e5b15a6606488333e
master date: 2020-03-18 12:55:54 +0100

x86/vPMU: don't blindly assume IA32_PERF_CAPABILITIES MSR exists

Just like VMX'es lbr_tsx_fixup_check() the respective CPUID bit should
be consulted first.

Reported-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 15c39c7c913f26fba40231e103ce1ffa6101e7c9
master date: 2020-02-26 17:35:48 +0100

AMD/IOMMU: fix off-by-one in amd_iommu_get_paging_mode() callers

amd_iommu_get_paging_mode() expects a count, not a "maximum possible"
value. Prior to b4f042236ae0 dropping the reference, the use of our mis-
named "max_page" in amd_iommu_domain_init() may have lead to such a
misunderstanding. In an attempt to avoid such confusion in the future,
rename the function's parameter and - while at it - convert it to an
inline function.

Also replace a literal 4 by an expression tying it to a wider use
constant, just like amd_iommu_quarantine_init() does.

Fixes: ea38867831da ("x86 / iommu: set up a scratch page in the quarantine domain")
Fixes: b4f042236ae0 ("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b75b3c62fe4afe381c6f74a07f614c0b39fe2f5d
master date: 2020-03-16 11:24:29 +0100

x86/msr: Virtualise MSR_PLATFORM_ID properly

This is an Intel-only, read-only MSR related to microcode loading. Expose it
in similar circumstances as the PATCHLEVEL MSR.

This should have been alongside c/s 013896cb8b2 "x86/msr: Fix handling of
MSR_AMD_PATCHLEVEL/MSR_IA32_UCODE_REV"

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 691265f96097d4fe3e46ff4267451d49b30143e6
master date: 2020-02-20 17:29:50 +0000

VT-d: check all of an RMRR for being E820-reserved

Checking just the first and last page is not sufficient (and redundant
for single-page regions). As we don't need to care about IA64 anymore,
use an x86-specific function to get this done without looping over each
individual page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: d6573bc6e6b7d95bb9de8471a6bfd7048ebc50f3
master date: 2020-02-18 16:21:19 +0100

x86/time: report correct frequency of Xen PV clocksource

The value of the counter represents the number of nanoseconds
since host boot. That means the correct frequency is always 1GHz.

This inconsistency caused time to go slower in PV shim on most
platforms.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: c52bd545de461127f3ca67c48e8fef7145402035
master date: 2020-02-14 18:01:52 +0000

x86/shim: suspend and resume platform time correctly

Similarly to S3, platform time needs to be saved on guest suspend
and restored on resume respectively. This should account for expected
jumps in PV clock counter value after resume. time_suspend/resume()
are safe to use in PVH setting as is since any existing operations
with PIT/HPET that they do would simply be ignored if PIT/HPET is
not present.

Additionally, add resume callback for Xen PV clocksource to avoid
its breakage on migration.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a7a3ecd82e289a9a2ecc1d3b5128580e0b577cc7
master date: 2020-02-14 18:01:52 +0000

x86/smp: reset x2apic_enabled in smp_send_stop()

Just before smp_send_stop() re-enables interrupts when shutting down
for reboot or kexec, it calls __stop_this_cpu() which in turn calls
disable_local_APIC(), which puts the APIC back in to the mode Xen found
it in at boot.

If that means turning x2APIC off and going back into xAPIC mode, then
a timer interrupt occurring just after interrupts come back on will
lead to a GP# when apic_timer_interrupt() attempts to ack the IRQ
through the EOI register in x2APIC MSR 0x80b:

  (XEN) Executing kexec image on cpu0
  (XEN) ----[ Xen-4.14-unstable  x86_64  debug=n   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e008:[<ffff82d08026c139>] apic_timer_interrupt+0x29/0x40
  (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
  (XEN) rax: 0000000000000000   rbx: 00000000000000fa   rcx: 000000000000080b
  ...
  (XEN) Xen code around <ffff82d08026c139> (apic_timer_interrupt+0x29/0x40):
  (XEN)  c0 b9 0b 08 00 00 89 c2 <0f> 30 31 ff e9 0e c9 fb ff 0f 1f 40 00 66 2e 0f
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d08026c139>] R apic_timer_interrupt+0x29/0x40
  (XEN)    [<ffff82d080283825>] S do_IRQ+0x95/0x750
  ...
  (XEN)    [<ffff82d0802a0ad2>] S smp_send_stop+0x42/0xd0

We can't clear the global x2apic_enabled variable in disable_local_APIC()
itself because that runs on each CPU. Instead, correct it (by using
current_local_apic_mode()) in smp_send_stop() while interrupts are still
disabled immediately after calling __stop_this_cpu() for the boot CPU,
after all other CPUs have been stopped.

cf: d639bdd9bbe ("x86/apic: Disable the LAPIC later in smp_send_stop()")
    ... which didn't quite fix it completely.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8b1002ab037aeacdece7723c07ab35ca16c1e22e
master date: 2020-02-14 18:01:52 +0000

xen/pvh: Fix segment selector ABI

The written ABI states that %es will be set up, but libxc doesn't do so. In
practice, it breaks `rep movs` inside guests before they reload %es.

The written ABI doesn't mention %ss, but libxc does set it up. Having %ds
different to %ss is obnoxous to work with, as different registers have
different implicit segments.

Modify the spec to state that %ss is set up as a flat read/write segment.
This a) matches the Multiboot 1 spec, b) matches what is set up in practice,
and c) is the more sane behaviour for guests to use.

Fixes: 68e1183411b ('libxc: introduce a xc_dom_arch for hvm-3.0-x86_32 guests')
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/pvh: Adjust dom0's starting state

Fixes: b25fb1a04e "xen/pvh: Fix segment selector ABI"
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b25fb1a04e99cc03359eade1affb56ef0eee766f
master date: 2020-02-10 15:26:09 +0000
master commit: 6ee10313623c1f41fc72fe12372e176e744463c1
master date: 2020-02-11 11:04:26 +0000

xmalloc: guard against integer overflow

There are hypercall handling paths (EFI ones are what this was found
with) needing to allocate buffers of a caller specified size. This is
generally fine, as our page allocator enforces an upper bound on all
allocations. However, certain extremely large sizes could, when adding
in allocator overhead, result in an apparently tiny allocation size,
which would typically result in either a successful allocation, but a
severe buffer overrun when using that memory block, or in a crash right
in the allocator code.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: cf38b4926e2b55d1d7715cff5095a7444f5ed42d
master date: 2020-02-06 09:53:12 +0100

EFI: don't leak heap contents through XEN_EFI_get_next_variable_name

Commit 1f4eb9d27d0e ("EFI: fix getting EFI variable list on some
systems") switched to using the caller provided size for the copy-out
without making sure the copied buffer is properly scrubbed.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4783ee894f6bfb0f4deec9f1fe8e7faceafaa1a2
master date: 2020-02-06 09:52:33 +0100

EFI: re-check {get,set}-variable name strings after copying in

A malicious guest given permission to invoke XENPF_efi_runtime_call may
play with the strings underneath Xen sizing them and copying them in.
Guard against this by re-checking the copyied in data for consistency
with the initial sizing. At the same time also check that the actual
copy-in is in fact successful, and switch to the lighter weight non-
checking flavor of the function.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ad38db5852f0e30d90c93c6a62b754f2861549e0
master date: 2020-02-06 09:51:17 +0100

xen/x86: domctl: Don't leak data via XEN_DOMCTL_gethvmcontext

The HVM context may not fill up the full buffer passed by the caller.
While we report corectly the size of the context, we will still be
copying back the full size of the buffer.

As the buffer is allocated through xmalloc(), we will be copying some
bits from the previous allocation.

Only copy back the part of the buffer used by the HVM context to prevent
any leak.

Note that per XSA-72, this is not a security issue.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 41d8869003e96d8b7250ad1d0246371d6929aca6
master date: 2020-01-31 18:51:38 +0000

x86/suspend: disable watchdog before calling console_start_sync()

... and enable it after exiting S-state. Otherwise accumulated
output in serial buffer might easily trigger the watchdog if it's
still enabled after entering sync transmission mode.

The issue observed on machines which, unfortunately, generate non-0
output in CPU offline callbacks.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e08f5f56c9955d853c26c985b6fb1fb45d0355d
master date: 2020-01-29 15:06:10 +0100

x86/apic: fix disabling LVT0 in disconnect_bsp_APIC

The Intel SDM states:

"When an illegal vector value (0 to 15) is written to a LVT entry and
the delivery mode is Fixed (bits 8-11 equal 0), the APIC may signal an
illegal vector error, without regard to whether the mask bit is set or
whether an interrupt is actually seen on the input."

And that's exactly what's currently done in disconnect_bsp_APIC when
virt_wire_setup is true and LVT LINT0 is being masked. By writing only
APIC_LVT_MASKED Xen is actually setting the vector to 0 and the
delivery mode to Fixed (0), and hence it triggers an APIC error even
when the LVT entry is masked.

This would usually manifest when Xen is being shut down, as that's
where disconnect_bsp_APIC is called:

(XEN) APIC error on CPU0: 40(00)

Fix this by calling clear_local_APIC prior to setting the LVT LINT
registers which already clear LVT LINT0, and hence the troublesome
write can be avoided as the register is already cleared.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 782b48b7f7319c07b044606d67a60875e53dd05b
master date: 2020-01-29 14:47:00 +0100

VT-d: don't pass bridge devices to domain_context_mapping_one()

When passed a non-NULL pdev, the function does an owner check when it
finds an already existing context mapping. Bridges, however, don't get
passed through to guests, and hence their owner is always going to be
Dom0, leading to the assigment of all but one of the function of multi-
function PCI devices behind bridges to fail.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a4d457fd59f4ebfb524aec82cb6a3030087914ca
master date: 2020-01-22 16:39:58 +0100

x86/sm{e, a}p: do not enable SMEP/SMAP in PV shim by default on AMD

Due to AMD and Hygon being unable to selectively trap CR4 bit modifications
running 32-bit PV guest inside PV shim comes with significant performance
hit. Moreover, for SMEP in particular every time CR4.SMEP changes on context
switch to/from 32-bit PV guest, it gets trapped by L0 Xen which then
tries to perform global TLB invalidation for PV shim domain. This usually
results in eventual hang of a PV shim with at least several vCPUs.

Since the overall security risk is generally lower for shim Xen as it being
there more of a defense-in-depth mechanism, choose to disable SMEP/SMAP in
it by default on AMD and Hygon unless a user chose otherwise.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b05ec9263e56ef0784da766e829cfe08569d1d88
master date: 2020-01-17 16:18:20 +0100

x86/time: update TSC stamp on restore from deep C-state

If ITSC is not available on CPU (e.g if running nested as PV shim)
then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
all AMD and some old Intel processors. In which case TSC would need to
be restored on CPU from platform time by Xen upon exiting C-states.

As platform time might be behind the last TSC stamp recorded for the
current CPU, invariant of TSC stamp being always behind local TSC counter
is violated. This has an effect of get_s_time() going negative resulting
in eventual system hang or crash.

Fix this issue by updating local TSC stamp along with TSC counter write.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: bbf283f853f8c0e4d29248dd44d3b0e0abc07629
master date: 2020-01-17 16:11:20 +0100

IRQ: u16 is too narrow for an event channel number

FIFO event channels allow ports up to 2^17, so we need to use a wider
field in struct pirq. Move "masked" such that it may share the 8-byte
slot with struct arch_pirq on 64-bit arches, rather than leaving a
7-byte hole in all cases.

Take the opportunity and also add a comment regarding "arch" placement
within the structure.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Arm: fix build after 892b9dcebdb7

"IRQ: u16 is too narrow for an event channel number" introduced a use of
evetchn_port_t, but its typedef apparently surfaces indirectly here only
on x86.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 892b9dcebdb7f646657e11cfdd95a385107bbefa
master date: 2020-01-14 12:03:47 +0100
master commit: b4194711ffaffa5e63d986338fb8d4020fa6bad1
master date: 2020-01-14 16:06:27 +0100

x86: clear per cpu stub page information in cpu_smpboot_free()

cpu_smpboot_free() removes the stubs for the cpu going offline, but it
isn't clearing the related percpu variables. This will result in
crashes when a stub page is released due to all related cpus gone
offline and one of those cpus going online later.

Fix that by clearing stubs.addr and stubs.mfn in order to allocate a
new stub page when needed, irrespective of whether the CPU gets parked
or removed.

Fixes: 2e6c8f182c9c50 ("x86: distinguish CPU offlining from CPU removal")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Tao Xu <tao3.xu@intel.com>
master commit: 774901788c5614798931a1cb2e20dd8b885f97ab
master date: 2020-01-09 11:07:38 +0100

xen/arm: Place a speculation barrier sequence following an eret instruction

Some CPUs can speculate past an ERET instruction and potentially perform
speculative accesses to memory before processing the exception return.
Since the register state is often controlled by lower privilege level
at the point of an ERET, this could potentially be used as part of a
side-channel attack.

Newer CPUs may implement a new SB barrier instruction which acts
as an architected speculation barrier. For current CPUs, the sequence
DSB; ISB is known to prevent speculation.

The latter sequence is heavier than SB but it would never be executed
(this is speculation after all!).

Introduce a new macro 'sb' that could be used when a speculation barrier
is required. For now it is using dsb; isb but this could easily be
updated to cater SB in the future.

This is XSA-312.

Signed-off-by: Julien Grall <julien@xen.org>

lz4: fix system halt at boot kernel on x86_64

Sometimes, on x86_64, decompression fails with the following
error:

Decompressing Linux...

Decoding failed

-- System halted

This condition is not needed for a 64bit kernel(from commit d5e7caf):

if( ... ||
(op + COPYLENGTH) > oend)
goto _output_error

macro LZ4_SECURE_COPY() tests op and does not copy any data
when op exceeds the value.

added by analogy to lz4_uncompress_unknownoutputsize(...)

Signed-off-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
[Linux commit 99b7e93c95c78952724a9783de6c78def8fbfc3f]

The offending commit in our case is fcc17f96c277 ("LZ4 : fix the data
abort issue").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5d90ff79542ab9c6eebe5c315c68c196bcf353b9
master date: 2019-12-09 14:02:35 +0100

lz4: refine commit 9143a6c55ef7 for the 64-bit case

I clearly went too far there: While the LZ4_WILDCOPY() instances indeed
need prior guarding, LZ4_SECURECOPY() needs this only in the 32-bit case
(where it simply aliases LZ4_WILDCOPY()). "cpy" can validly point
(slightly) below "op" in these cases, due to

cpy = op + length - (STEPSIZE - 4);

where length can be as low as 0 and STEPSIZE is 8. However, instead of
removing the check via "#if !LZ4_ARCH64", refine it such that it would
also properly work in the 64-bit case, aborting decompression instead
of continuing on bogus input.

Reported-by: Mark Pryor <pryorm09@gmail.com>
Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Mark Pryor <pryorm09@gmail.com>
Tested-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2d7572cdfa4d481c1ca246aa1ce5239ccae7eb59
master date: 2019-12-09 14:01:25 +0100

x86/tlbflush: do not toggle the PGE CR4 bit unless necessary

When PCID is not available Xen does a full tlbflush by toggling the
PGE bit in CR4. This is not necessary if PGE is not enabled, since a
flush can be performed by writing to CR3 in that case.

Change the code in do_tlb_flush to only toggle the PGE bit in CR4 if
it's already enabled, otherwise do the tlb flush by writing to CR3.
This is relevant when running virtualized, since hypervisors don't
usually trap accesses to CR3 when using hardware assisted paging, but
do trap accesses to CR4 specially on AMD hardware, which makes such
accesses much more expensive.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b5087a31efee7a4e34c958b88671ac6669501b09
master date: 2019-12-03 14:15:35 +0100

x86: avoid HPET use on certain Intel platforms

Linux commit fc5db58539b49351e76f19817ed1102bf7c712d0 says

"Some Coffee Lake platforms have a skewed HPET timer once the SoCs entered
PC10, which in consequence marks TSC as unstable because HPET is used as
watchdog clocksource for TSC."

Follow this for Xen as well. Looking at its patch context made me notice
they have a pre-existing quirk for Bay Trail as well. The comment there,
however, points at a Cherry Trail document. Looking at the datasheets of
both, there appear to be similar issues, so go beyond Linux'es coverage
and exclude both. Also key the disable on the PCI IDs of the actual
affected devices, rather than those of 00:00.0.

Apply the workarounds only when the use of HPET was not explicitly
requested on the command line and when use of (deep) C-states was not
disabled.

Adjust a few types in touched or nearby code at the same time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5294a302c8441191d47888452958aea25243723
master date: 2019-12-03 14:14:44 +0100

gnttab: make sure grant map operations don't skip their IOMMU part

Two almost simultaneous mapping requests need to make sure that at the
completion of the earlier one IOMMU mappings (established explicitly
here in the PV case) have been put in place. Forever since the splitting
of the grant table lock a violation of this has been possible (using
simplified pin counts, as it doesn't matter whether we talk about read
or write mappings here):

initial state: act->pin = 0

vCPU A: progress the operation past the dropping of the locks after the
        act->pin updates (act->pin = 1, old_pin = 0, act_pin = 1)

vCPU B: progress the operation past the dropping of the locks after the
        act->pin updates (act->pin = 2, old_pin = 1, act_pin = 2)

vCPU B: (re-)acquire both gt locks, mapkind() returns 0, but both
        iommu_legacy_map() invocations get skipped due to non-zero
        old_pin

vCPU B: return to caller without IOMMU mapping

vCPU A: (re-)acquire both gt locks, mapkind() returns 0,
        iommu_legacy_map() gets invoked

With the locks dropped intermediately, whether to invoke
iommu_legacy_map() must depend on only the return value of mapkind()
and of course the kind of mapping request being processed, just like
is already the case in unmap_common().

Also fix the style of the adjacent comment, and correct a nearby one
still referring to a prior name of what is now mapkind().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 921f1f42260c7967bf18f8a143d39511d163c421
master date: 2019-12-03 14:13:40 +0100

x86/psr: fix bug which may cause crash

During test, we found a crash on Xen with below trace.
(XEN) Xen call trace:
(XEN)    [<ffff82d0802a065a>] R psr.c#l3_cdp_write_msr+0x1e/0x22
(XEN)    [<ffff82d0802a0858>] F psr.c#do_write_psr_msrs+0x6d/0x109
(XEN)    [<ffff82d08023e000>] F smp_call_function_interrupt+0x5a/0xac
(XEN)    [<ffff82d0802a2b89>] F call_function_interrupt+0x20/0x34
(XEN)    [<ffff82d080282c64>] F do_IRQ+0x175/0x6ae
(XEN)    [<ffff82d08038b8ba>] F common_interrupt+0x10a/0x120
(XEN)    [<ffff82d0802ec616>] F cpu_idle.c#acpi_idle_do_entry+0x9d/0xb1
(XEN)    [<ffff82d0802ecc01>] F cpu_idle.c#acpi_processor_idle+0x41d/0x626
(XEN)    [<ffff82d08027353b>] F domain.c#idle_loop+0xa5/0xa7
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 20:
(XEN) GENERAL PROTECTION FAULT
(XEN) [error_code=0000]
(XEN) ****************************************

The bug happens when CDP and MBA co-exist and MBA COS_MAX is bigger
than CDP COS_MAX. E.g. MBA has 8 COS registers but CDP only have 6.
When setting MBA throttling value for the 7th guest, the value array
would be:
    +------------------+------------------+--------------+
    | Data default val | Code default val | MBA throttle |
    +------------------+------------------+--------------+

Then, COS id 7 will be selected for writting the values. We should
avoid writting CDP data/code valules to COS id 7 MSR because it
exceeds the CDP COS_MAX.

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 42c8cdc039d6dc7d6aea8008bb24622eaf4b7bc8
master date: 2019-12-02 15:15:18 +0000

x86 / iommu: set up a scratch page in the quarantine domain

This patch introduces a new iommu_op to facilitate a per-implementation
quarantine set up, and then further code for x86 implementations
(amd and vtd) to set up a read-only scratch page to serve as the source
for DMA reads whilst a device is assigned to dom_io. DMA writes will
continue to fault as before.

The reason for doing this is that some hardware may continue to re-try
DMA (despite FLR) in the event of an error, or even BME being cleared, and
will fail to deal with DMA read faults gracefully. Having a scratch page
mapped will allow pending DMA reads to complete and thus such buggy
hardware will eventually be quiesced.

NOTE: These modifications are restricted to x86 implementations only as
      the buggy h/w I am aware of is only used with Xen in an x86
      environment. ARM may require similar code but, since I am not
      aware of the need, this patch does not modify any ARM implementation.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea38867831da67eed0e9c61672c8941016b63dd9
master date: 2019-11-29 18:27:54 +0000

xen/x86: vpmu: Unmap per-vCPU PMU page when the domain is destroyed

A guest will setup a shared page with the hypervisor for each vCPU via
XENPMU_init. The page will then get mapped in the hypervisor and only
released when XENPMU_finish is called.

This means that if the guest fails to invoke XENPMU_finish, e.g if it is
destroyed rather than cleanly shut down, the page will stay mapped in the
hypervisor. One of the consequences is the domain can never be fully
destroyed as a page reference is still held.

As Xen should never rely on the guest to correctly clean-up any
allocation in the hypervisor, we should also unmap such pages during the
domain destruction if there are any left.

We can re-use the same logic as in pvpmu_finish(). To avoid
duplication, move the logic in a new function that can also be called
from vpmu_destroy().

NOTE: - The call to vpmu_destroy() must also be moved from
        arch_vcpu_destroy() into domain_relinquish_resources() such that
        the reference on the mapped page does not prevent domain_destroy()
        (which calls arch_vcpu_destroy()) from being called.
      - Whilst it appears that vpmu_arch_destroy() is idempotent it is
        by no means obvious. Hence make sure the VPMU_CONTEXT_ALLOCATED
        flag is cleared at the end of vpmu_arch_destroy().
      - This is not an XSA because vPMU is not security supported (see
        XSA-163).

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: be18e39d2f69038804b27c30026754deaeefa543
master date: 2019-11-29 18:23:24 +0000

x86/svm: Write the correct %eip into the outgoing task

The TASK_SWITCH vmexit has fault semantics, and doesn't provide any NRIPs
assistance with instruction length. As a result, any instruction-induced task
switch has the outgoing task's %eip pointing at the instruction switch caused
the switch, rather than after it.

This causes callers of task gates to livelock (repeatedly execute the call/jmp
to enter the task), and any restartable task to become a nop after its first
use (the (re)entry state points at the iret used to exit the task).

32bit Windows in particular is known to use task gates for NMI handling, and
to use NMI IPIs.

In the task switch handler, distinguish instruction-induced from
interrupt/exception-induced task switches, and decode the instruction under
%rip to calculate its length.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1d758bc6d1a8c0f658a874470c349ee4e27aee46
master date: 2019-11-28 17:14:38 +0000

x86/svm: Always intercept ICEBP

ICEBP isn't handled well by SVM.

The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it.  As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.

To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.

We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).

Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
instruction intercept, which allows us to move %rip forwards appropriately
before the TASK_SWITCH intercept is hit.  This makes #DB-vectored switches
have consistent behaviour however the ICEBP #DB came about, and avoids special
cases in the TASK_SWITCH intercept.

This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: e2585f8c2e0d43d350503ff2b2be252adc6b7239
master date: 2019-11-28 17:14:38 +0000

x86/vtx: Fix fault semantics for early task switch failures

The VT-x task switch handler adds inst_len to %rip before calling
hvm_task_switch(), which is problematic in two ways:

1) Early faults (i.e. ones delivered in the context of the old task) get
    delivered with trap semantics, and break restartibility.

2) The addition isn't truncated to 32 bits.  In the corner case of a task
    switch instruction crossing the 4G->0 boundary taking an early fault (with
    trap semantics), a VMEntry failure will occur due to %rip being out of
    range.

Instead, pass the instruction length into hvm_task_switch() and write it into
the outgoing TSS only, leaving %rip in its original location.

For now, pass 0 on the SVM side.  This highlights a separate preexisting bug
which will be addressed in the following patch.

While adjusting call sites, drop the unnecessary uint16_t cast.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 943c74bc0ee5044a826e428a3b2ffbdf9a43628d
master date: 2019-11-28 17:14:38 +0000

x86/vmx: always sync PIR to IRR before vmentry

When using posted interrupts on Intel hardware it's possible that the
vCPU resumes execution with a stale local APIC IRR register because
depending on the interrupts to be injected vlapic_has_pending_irq
might not be called, and thus PIR won't be synced into IRR.

Fix this by making sure PIR is always synced to IRR in
hvm_vcpu_has_pending_irq regardless of what interrupts are pending.

Reported-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Joe Jin <joe.jin@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 56348df32bbc782e63b6e3fb978b80e015ae76e7
master date: 2019-11-28 11:58:25 +0100

x86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible

This hypercall can take a long time to finish because it attempts to
grab the `hostp2m' lock up to 1024 times. The accumulated wait for the
lock can take several seconds.

This can easily happen with a guest with 32 vcpus and plenty of RAM,
during localhost migration.

While the patch doesn't fix the problem with the lock contention and
the fact that the `hostp2m' lock is currently global (and not on a
single page), it is still an improvement to the hypercall. It will in
particular, down the road, allow dropping the arbitrary limit of 1024
entries per request.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 48599114d3ca24157c25f6684bb9322f6dca12bb
master date: 2019-11-26 14:16:09 +0100

x86/tss: Fix clang build following c/s 7888440625

Clang-3.5 from Debian Jessie fails with:

  smpboot.c:829:29: error: statement expression not allowed at file scope
          BUILD_BUG_ON(sizeof(this_cpu(tss_page)) != PAGE_SIZE);
                              ^
  /local/xen.git/xen/include/asm/percpu.h:14:7: note: expanded from macro
          'this_cpu'
      (*RELOC_HIDE(&per_cpu__##var, get_cpu_info()->per_cpu_offset))
        ^
  /local/xen.git/xen/include/xen/compiler.h:98:3: note: expanded from macro
          'RELOC_HIDE'
    ({ unsigned long __ptr;                       \
    ^
  /local/xen.git/xen/include/xen/lib.h:26:53: note: expanded from macro
          'BUILD_BUG_ON'
  #define BUILD_BUG_ON(cond) ((void)BUILD_BUG_ON_ZERO(cond))
                                                      ^
  /local/xen.git/xen/include/xen/lib.h:25:57: note: expanded from macro
          'BUILD_BUG_ON_ZERO'
  #define BUILD_BUG_ON_ZERO(cond) sizeof(struct { int:-!!(cond); })
                                                          ^
  1 error generated.
  /local/xen.git/xen/Rules.mk:202: recipe for target 'smpboot.o' failed

This is obviously a compiler bug because the BUILD_BUG_ON() is not at file
scope.  However, it can be worked around by using a local variable.

Spotted by Gitlab CI.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 1722da6c0c6f6b7b320bdd239c46c0cb1048f804
master date: 2019-08-14 12:04:20 +0100

x86: Don't increase ApicIdCoreSize past 7

Changeset ca2eee92df44 ("x86, hvm: Expose host core/HT topology to HVM
guests") attempted to "fake up" a topology which would induce guest
operating systems to not treat vcpus as sibling hyperthreads.  This
involved actually reporting hyperthreading as available, but giving
vcpus every other ApicId; which in turn led to doubling the ApicIds
per core by bumping the ApicIdCoreSize by one.  In particular, Ryzen
3xxx series processors, and reportedly EPYC "Rome" cpus -- have an
ApicIdCoreSize of 7; the "fake" topology increases this to 8.

Unfortunately, Windows running on modern AMD hardware -- including
Ryzen 3xxx series processors, and reportedly EPYC "Rome" cpus --
doesn't seem to cope with this value being higher than 7.  (Linux
guests have so far continued to cope.)

A "proper" fix is complicated and it's too late to fix it either for
4.13, or to backport to supported branches.  As a short-term fix,
limit this value to 7.

This does mean that a Linux guest, booted on such a system without
this change, and then migrating to a system with this change, with
more than 64 vcpus, would see an apparent topology change.  This is a
low enough risk in practice that enabling this limit unilaterally, to
allow other guests to boot without manual intervention, is worth it.

Reported-by: Steven Haigh <netwiz@crc.id.au>
Reported-by: Andreas Kinzler <hfp@posteo.de>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c79c129a6db2220c1089e0ce5fa49e7298b1d3e
master date: 2019-11-26 10:33:52 +0000

AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables

update_paging_mode() has multiple bugs:

1) Booting with iommu=debug will cause it to inform you that that it called
    without the pdev_list lock held.
2) When growing by more than a single level, it leaks the newly allocated
    table(s) in the case of a further error.

Furthermore, the choice of default level for a domain has issues:

1) All HVM guests grow from 2 to 3 levels during construction because of the
    position of the VRAM just below the 4G boundary, so defaulting to 2 is a
    waste of effort.
2) The limit for PV guests doesn't take memory hotplug into account, and
    isn't dynamic at runtime like HVM guests.  This means that a PV guest may
    get RAM which it can't map in the IOMMU.

The dynamic height is a property unique to AMD, and adds a substantial
quantity of complexity for what is a marginal performance improvement.  Remove
the complexity by removing the dynamic height.

PV guests now get 3 or 4 levels based on any hotplug regions in the host.
This only makes a difference for hardware which previously had all RAM below
the 512G boundary, and a hotplug region above.

HVM guests now get 4 levels (which will be sufficient until 256TB guests
become a thing), because we don't currently have the information to know when
3 would be safe to use.

The overhead of this extra level is not expected to be noticeable.  It costs
one page (4k) per domain, and one extra IO-TLB paging structure cache entry
which is very hot and less likely to be evicted.

This is XSA-311.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b4f042236ae0bb6725b3e8dd40af5a2466a6f971
master date: 2019-12-11 14:55:32 +0100

x86/mm: relinquish_memory: Grab an extra type ref when setting PGT_partial

The PGT_partial bit in page->type_info holds both a type count and a
general ref count.  During domain tear-down, when free_page_type()
returns -ERESTART, relinquish_memory() correctly handles the general
ref count, but fails to grab an extra type count when setting
PGT_partial.  When this bit is eventually cleared, type_count underflows
and triggers the following BUG in page_alloc.c:free_domheap_pages():

    BUG_ON((pg[i].u.inuse.type_info & PGT_count_mask) != 0);

As far as we can tell, this page underflow cannot be exploited any any
other way: The page can't be used as a pagetable by the dying domain
because it's dying; it can't be used as a pagetable by any other
domain since it belongs to the dying domain; and ownership can't
transfer to any other domain without hitting the BUG_ON() in
free_domheap_pages().

(steal_page() won't work on a page in this state, since it requires
PGC_allocated to be set, and PGC_allocated will already have been
cleared.)

Fix this by grabbing an extra type ref if setting PGT_partial in
relinquish_memory.

This is part of XSA-310.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 66bdc16aeed8ddb2ae724adc5ea6bde0dea78c3d
master date: 2019-12-11 14:55:08 +0100

x86/mm: alloc/free_lN_table: Retain partial_flags on -EINTR

When validating or de-validating pages (in alloc_lN_table and
free_lN_table respectively), the `partial_flags` local variable is
used to keep track of whether the "current" PTE started the entire
operation in a "may be partial" state.

One of the patches in XSA-299 addressed the fact that it is possible
for a previously-partially-validated entry to subsequently be found to
have invalid entries (indicated by returning -EINVAL); in which case
page->partial_flags needs to be set to indicate that the current PTE
may have the partial bit set (and thus _put_page_type() should be
called with PTF_partial_set).

Unfortunately, the patches in XSA-299 assumed that once
put_page_from_lNe() returned -ERESTART on a page, it was not possible
for it to return -EINTR.  This turns out to be true for
alloc_lN_table() and free_lN_table, but not for _get_page_type() and
_put_page_type(): both can return -EINTR when called on pages with
PGT_partial set.  In these cases, the pages PGT_partial will still be
set; failing to set partial_flags appropriately may allow an attacker
to do a privilege escalation similar to those described in XSA-299.

Fix this by always copying the local partial_flags variable into
page->partial_flags when exiting early.

NB that on the "get" side, no adjustment to nr_validated_entries is
needed: whether pte[i] is partially validated or entirely
un-validated, we want nr_validated_entries = i.  On the "put" side,
however, we need to adjust nr_validated_entries appropriately: if
pte[i] is entirely validated, we want nr_validated_entries = i + 1; if
pte[i] is partially validated, we want nr_validated_entries = i.

This is part of XSA-310.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4e70f4476c0c543559f971faecdd5f1300cddb0a
master date: 2019-12-11 14:54:43 +0100

x86/mm: Set old_guest_table when destroying vcpu pagetables

Changeset 6c4efc1eba ("x86/mm: Don't drop a type ref unless you held a
ref to begin with"), part of XSA-299, changed the calling discipline
of put_page_type() such that if put_page_type() returned -ERESTART
(indicating a partially de-validated page), subsequent calls to
put_page_type() must be called with PTF_partial_set.  If called on a
partially de-validated page but without PTF_partial_set, Xen will
BUG(), because to do otherwise would risk opening up the kind of
privilege escalation bug described in XSA-299.

One place this was missed was in vcpu_destroy_pagetables().
put_page_and_type_preemptible() is called, but on -ERESTART, the
entire operation is simply restarted, causing put_page_type() to be
called on a partially de-validated page without PTF_partial_set.  The
result was that if such an operation were interrupted, Xen would hit a
BUG().

Fix this by having vcpu_destroy_pagetables() consistently pass off
interrupted de-validations to put_old_page_type():
- Unconditionally clear references to the page, even if
  put_page_and_type failed
- Set old_guest_table and old_guest_table_partial appropriately

While here, do some refactoring:

- Move clearing of arch.cr3 to the top of the function

- Now that clearing is unconditional, move the unmap to the same
   conditional as the l4tab mapping.  This also allows us to reduce
   the scope of the l4tab variable.

- Avoid code duplication by looping to drop references on
   guest_table_user

This is part of XSA-310.

Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ececa12b2c4c8e4433e4f9be83f5c668ae36fe08
master date: 2019-12-11 14:54:13 +0100

x86/mm: Don't reset linear_pt_count on partial validation

"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).

XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.

Unfortunately, the original commit introducing this reset this count
when resuming validation of a partially-validated pagetable, dropping
some "linear_pt_entry" counts.

On debug builds on systems where guests used this feature, this might
lead to crashes that look like this:

Assertion 'oc > 0' failed at mm.c:874

Worse, if an attacker could engineer such a situation to occur, they
might be able to make loops or other abitrary chains of linear
pagetables, leading to the denial-of-service situation outlined in
XSA-240.

This is XSA-309.

Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7473efd12fb7a6548f5303f1f4c5cb521543a813
master date: 2019-12-11 14:10:27 +0100

x86/vtx: Work around SingleStep + STI/MovSS VMEntry failures

See patch comment for technical details.

Concerning the timeline, this was first discovered in the aftermath of
XSA-156 which caused #DB to be intercepted unconditionally, but only in
its SingleStep + STI form which is restricted to privileged software.

After working with Intel and identifying the problematic vmentry check,
this workaround was suggested, and the patch was posted in an RFC
series. Outstanding work for that series (not breaking Introspection)
is still pending, and this fix from it (which wouldn't have been good
enough in its original form) wasn't committed.

A vmentry failure was reported to xen-devel, and debugging identified
this bug in its SingleStep + MovSS form by way of INT1, which does not
involve the use of any privileged instructions, and proving this to be a
security issue.

This is XSA-308

Reported-by: Håkon Alstadheim <hakon@alstadheim.priv.no>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1d3eb8259804e5bec991a3462d69ba6bd80bb40e
master date: 2019-12-11 14:09:30 +0100

x86+Arm32: make find_next_{,zero_}bit() have well defined behavior

These functions getting used with the 2nd and 3rd arguments being equal
wasn't well defined: Arm64 reliably returns the value of the 2nd
argument in this case, while on x86 for bitmaps up to 64 bits wide the
return value was undefined (due to the undefined behavior of a shift of
a value by the number of bits it's wide) when the incoming value was 64.
On Arm32 an actual out of bounds access would happen when the
size/offset value is a multiple of 32; if this access doesn't fault, the
return value would have been sufficiently correct afaict.

Make the functions consistently tolerate the last two arguments being
equal (and in fact the 3rd argument being greater or equal to the 2nd),
in favor of finding and fixing all the use sites that violate the
original more strict assumption.

This is XSA-307.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien@xen.org>
master commit: 7442006b9f0940fb36f1f8470a416ec836e0d2ce
master date: 2019-12-11 14:06:18 +0100

update Xen version to 4.11.4-pre

xen:arm: Populate arm64 image header

This patch adds image size and flags to XEN image header. It uses
those fields according to the updated Linux kernel image definition.

With this patch bootloader can now place XEN image anywhere in system
RAM at 2MB aligned address without to worry about relocation.
For instance, it fixes the XEN boot on Amlogic SoC where bootloader(U-BOOT)
always relocates the XEN image to an address range reserved for firmware data.

Signed-off-by: Amit Singh Tomar <amittomer25@gmail.com>
Reviewed-by: Andre Pryzwara <andre.przywara@arm.com>
Acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit 17bd254a508f4174fe0d56a9f1b9892b7649b4b9)

update Xen version to 4.11.3

IOMMU: default to always quarantining PCI devices

XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.

Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.

Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).

This is XSA-306.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: ba2ab00bbb8c74e311a252d816d68dee47c779a0
master date: 2019-11-26 14:15:01 +0100

x86/mm: Adjust linear uses / entries when a page loses validation

"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).

XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.

Additionally, XSA-299 introduced a mode whereby if a page was known to
have been only partially validated, _put_page_type() would be called
with PTF_partial_set, indicating that if the page had been
de-validated by someone else, the type count should be left alone.

Unfortunately, this change did not account for the required accounting
for linear page table uses and entries; in the case that a previously
partially-devalidated pagetable was fully-devalidated by someone else,
the linear_pt_counts are not updated.

This could happen in one of two places:

1. In the case a partially-devalidated page was re-validated by
someone else

2. During domain tear-down, when pages are force-invalidated while
leaving the type count intact.

The second could be ignored, since at that point the pages can no
longer be abused; but the first requires handling. Note however that
this would not be a security issue: having the counts be too high is
overly strict (i.e., will prevent a page from being used in a way
which is perfectly safe), but shouldn't cause any other issues.

Fix this by adjusting the linear counts when a page loses validation,
regardless of whether the de-validation completed or was only partial.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 77beba7c921a286c31a2a76f26500047f353614a
master date: 2019-11-25 10:58:27 +0000

x86/vvmx: Fix livelock with XSA-304 fix

It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.

Nested virt is experimental and not security supported, but is useful for
development purposes. In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.

Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting. Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.

When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.

Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 183f354e1430087879de071f0c7122e42703916e
master date: 2019-11-23 14:06:24 +0000

x86/livepatch: Prevent patching with active waitqueues

The safety of livepatching depends on every stack having been unwound, but
there is one corner case where this is not true. The Sharing/Paging/Monitor
infrastructure may use waitqueues, which copy the stack frame sideways and
longjmp() to a different vcpu.

This case is rare, and can be worked around by pausing the offending
domain(s), waiting for their rings to drain, then performing a livepatch.

In the case that there is an active waitqueue, fail the livepatch attempt with
-EBUSY, which is preforable to the fireworks which occur from trying to unwind
the old stack frame at a later point.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: ca4cd3668237d50a0b33b48e7de7f93d9475120d
master date: 2019-11-22 17:05:43 +0000

x86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode

Current code unconditionally prevents setting APIC_SPIV_FOCUS_DISABLED
regardless of the processor model, which is not correct according to
the specification.

This issue was discovered while trying to boot a pvshim with x2APIC
enabled.

Always allow setting APIC_SPIV_FOCUS_DISABLED: the local APIC
provided to guests is emulated by Xen, and as such doesn't depend on
the features found on the hardware processor. Note for example that
Xen offers x2APIC support to guests even when the underlying hardware
doesn't have such feature.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d7cd999faa1edf745a7597db811956cb882a5436
master date: 2019-11-22 17:52:59 +0100

xen: Add missing va_end() in hypercall_create_continuation()

The documentation requires va_start() to always be matched with a
corresponding va_end(). However, this is not the case in the path used
for bad format.

This was introduced by XSA-296.

Coverity-ID: 1488727
Fixes: 0bf9f8d3e3 ("xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()")
Signed-off-by: Julien Grall <julien@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Andrew Cooper <andrew.cooper3@citrix.com>
master commit: df7a19338a892b5cf585fd2bee8584cb15e0cace
master date: 2019-11-21 15:50:01 +0000

x86: fix race to build arch/x86/efi/relocs-dummy.o

With $(TARGET).efi depending on efi/relocs-dummy.o, arch/x86/Makefile
will attempt to build that object. This may result in a dependency file
being generated that has relocs-dummy.o depending on efi/relocs-dummy.S.

Then, when arch/x86/efi/Makefile tries to build relocs-dummy.o, well
efi/relocs-dummy.S doesn't exist.

Have only one makefile responsible for building relocs-dummy.o.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/Makefile: remove $(guard) use from $(TARGET).efi target

Following the patch 65d104984c04 ("x86: fix race to build
arch/x86/efi/relocs-dummy.o"), the error message
nm: 'efi/relocs-dummy.o': No such file"
started to appear on system which can't build the .efi target. This is
because relocs-dummy.o isn't built anymore.
The error is printed by the evaluation of VIRT_BASE and ALT_BASE which
aren't use anyway.

But, we don't need that file as we don't want to build `$(TARGET).efi'
anyway. On such system, $(guard) evaluate to the shell builtin ':',
which prevent any of the shell commands in `$(TARGET).efi' from been
executed.

Even if $(guard) is evaluated opon use, it depends on $(XEN_BUILD_PE)
which is evaluated at the assignment. So, we can replace $(guard) in
$(TARGET).efi by having two different rules depending on
$(XEN_BUILD_PE) instead.

The change with this patch is that none of the dependency of
$(TARGET).efi will be built if the linker doesn't support PE
and VIRT_BASE and ALT_BASE don't get evaluated anymore, so nm will not
complain about the missing relocs-dummy.o file anymore.

Since prelink-efi.o isn't built on system that can't build
$(TARGET).efi anymore, we can remove the $(guard) variable everywhere.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 65d104984c04e69234f77bd3b8f8c0ef85b3f7fa
master date: 2019-11-15 14:18:16 +0100
master commit: 7059afb202ff0d82a6fa94f7ef84e4bb3139914e
master date: 2019-11-20 17:12:12 +0100

x86emul: 16-bit XBEGIN does not truncate rIP

SDM rev 071 points out this fact explicitly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a72c508656c0a0fa573890b290064e6035971f86
master date: 2019-11-15 14:15:31 +0100

AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page

Unmapping a page which has never been mapped should be a no-op (note how
it already is in case there was no root page table allocated). There's
in particular no need to grow the number of page table levels in use,
and there's also no need to allocate intermediate page tables except
when needing to split a large page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad591454f069647c36a7daaa9ec23384c0263f0b
master date: 2019-11-12 11:08:34 +0100

x86/ioapic: fix clear_IO_APIC_pin write of raw entries

clear_IO_APIC_pin can be called after the iommu has been enabled, and
using raw reads and writes to modify IO-APIC entries that have been
setup to use interrupt remapping can lead to issues as some of the
fields have different meaning when the IO-APIC entry is setup to point
to an interrupt remapping table entry.

The following ASSERT in AMD IOMMU code triggers afterwards as a result
of the raw changes to IO-APIC entries performed by clear_IO_APIC_pin.

(XEN) [   10.082154] ENABLING IO-APIC IRQs
(XEN) [   10.087789]  -> Using new ACK method
(XEN) [   10.093738] Assertion 'get_rte_index(rte) == offset' failed at iommu_intr.c:328

Fix this by making sure that modifications to entries are performed in
non raw mode when fields are affected which may either have changed
meaning with interrupt remapping, or which may need mirroring into
IRTEs.

Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: dedcb1087dfeae0bbd9eea465a57f25b13e40585
master date: 2019-11-12 11:07:40 +0100

x86/shim: copy back the result of EVTCHNOP_status

The event channel data was not copied back to guest memory, fix this
by doing the copy.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0f45bbbc404e2d1257476f9caa6644c209ec2c90
master date: 2019-11-01 10:48:04 +0000

x86/vtx: Fixes to Haswell/Broadwell LBR TSX errata

Cross reference and list all errata, now that they are published.

These errata are specific to Haswell/Broadwell. They should have model and
vendor checks, as Intel isn't the only vendor to implement VT-x.

All affected models use the same MSR indicies, so these can be hard coded
rather than looking up and storing constant values.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: f51d4a19427674491eaecef85c551613450188c5
master date: 2019-10-29 19:27:40 +0000

x86/vtx: Corrections to BDF93 errata workaround

At the time of fixing c/s 20f1976b44, no obvious errata had been published,
and BDF14 looked like the most obvious candidate.  Subsequently, BDF93 has
been published and it is obviously this.

The erratum states that LER_TO_LIP is the only affected MSR.  The provisional
fix in Xen adjusted LER_FROM_LIP, but this is not correct.  The FROM MSRs are
intended to have TSX metadata, and for steppings with TSX enabled, it will
corrupt the value the guest sees, while for parts with TSX disabled, it is
redundant with FIXUP_TSX.  Drop the LER_FROM_LIP adjustment.

Replace BDF14 references with BDF93, drop the redundant 'bdw_erratum_' prefix,
and use an Intel vendor check, as other vendors implement VT-x.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1a3b393129c1dcfec418f9b0ee92d126c2ae8141
master date: 2019-10-29 19:27:40 +0000

x86: fix off-by-one in is_xen_fixed_mfn()

__2M_rwdata_end marks the first byte after the Xen image, not its last
byte. Subtract 1 to obtain the upper bound to compare against. (Note
that instead switching from <= to < is less desirable, as in principle
__pa() might return rubbish for addresses outside of the Xen image.)

Since the & needs to be dropped from the line in question, also drop it
from the adjacent one.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9633929824204ca7a6d60d083466de79993d60f1
master date: 2019-10-25 10:38:58 +0200

x86/tsc: update vcpu time info on guest TSC adjustments

If a HVM/PVH guest writes to MSR_IA32_TSC{_ADJUST} and thus changes
the value of the time stamp counter the vcpu time info must also be
updated, or the time calculated by the guest using the Xen PV clock
interface will be skewed.

Update the vcpu time info when the guest writes to either MSR_IA32_TSC
or MSR_IA32_TSC_ADJUST. This fixes lockups seen when running the
pv-shim on AMD hardware, since the shim will aggressively try to keep
TSCs in sync by periodically writing to MSR_IA32_TSC if the TSC is not
reliable.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7eee9c16d6405a1a1f2e8c6472923db842c90cfb
master date: 2019-10-23 17:01:56 +0100

x86/vvmx: Fix the use of RDTSCP when it is intercepted at L0

Linux has started using RDTSCP as of v5.1.  This has highlighted a bug in Xen,
where virtual vmexit simply gives up.

  (XEN) d1v1 Unhandled nested vmexit: reason 51
  (XEN) domain_crash called from vvmx.c:2671
  (XEN) Domain 1 (vcpu#1) crashed on cpu#2:

Handle RDTSCP in the virtual vmexit hander in the same was as RDTSC
intercepts.

Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Chris Brannon <cmb@prgmr.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 9257c218e56e9902b78662e5852d69329b9cc204
master date: 2019-10-23 16:43:48 +0100

x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel

See patch documentation and comments.

This is part of XSA-305 / CVE-2019-11135

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/tsx: Introduce tsx= to use MSR_TSX_CTRL when available

To protect against the TSX Async Abort speculative vulnerability, Intel have
released new microcode for affected parts which introduce the MSR_TSX_CTRL
control, which allows TSX to be turned off.  This will be architectural on
future parts.

Introduce tsx= to provide a global on/off for TSX, including its enumeration
via CPUID.  Provide stub virtualisation of this MSR, as it is not exposed to
guests at the moment.

VMs may have booted before microcode is loaded, or before hosts have rebooted,
and they still want to migrate freely.  A VM which booted seeing TSX can
migrate safely to hosts with TSX disabled - TSX will start unconditionally
aborting, but still behave in a manner compatible with the ABI.

The guest-visible behaviour is equivalent to late loading the microcode and
setting the RTM_DISABLE bit in the course of live patching.

This is part of XSA-305 / CVE-2019-11135

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vtx: Allow runtime modification of the exec-sp setting

See patch for details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

x86/vtx: Disable executable EPT superpages to work around CVE-2018-12207

CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation.  HVM guest
kernels can trigger this to DoS the host.

To mitigate, in affected hardware, all EPT superpages are marked NX.  When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored.  This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.

This does come with a workload-dependent performance overhead, caused by
increased TLB pressure.  Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vtd: Hide superpage support for SandyBridge IOMMUs

Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered. The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/arm64: Don't blindly unmask interrupts on trap without a change of level

Some of the traps without a change of the level (i.e. hypervisor ->
hypervisor) will unmask interrupts regardless the state of them in the
interrupted context.

One of the consequences is IRQ will be unmasked when receiving a
synchronous exception (used by WARN*()). This could result to unexpected
behavior such as deadlock (if a lock was shared with interrupts).

In a nutshell, interrupts should only be unmasked when it is safe to
do. Xen only unmask IRQ and Abort interrupts, so the logic can stay
simple:
    - hyp_error: All the interrupts are now kept masked. SError should
      be pretty rare and if ever happen then we most likely want to
      avoid any other interrupts to be generated. The potential main
      "caller" is during virtual SError synchronization on the exit
      path from the guest (see check_pending_vserror).

    - hyp_sync: The interrupts state is inherited from the interrupted
      context.

    - hyp_irq: All the interrupts but IRQ state are inherited from the
      interrupted context. IRQ is kept masked.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: 3ed885a8874003f6011460f4f46d1d130dd6b2db
master date: 2019-10-31 16:22:55 +0100

xen/arm32: Don't blindly unmask interrupts on trap without a change of level

Exception vectors will unmask interrupts regardless the state of them in
the interrupted context.

One of the consequences is IRQ will be unmasked when receiving an
undefined instruction exception (used by WARN*) from the hypervisor.
This could result to unexpected behavior such as deadlock (if a lock was
shared with interrupts).

In a nutshell, interrupts should only be unmasked when it is safe to do.
Xen only unmask IRQ and Abort interrupts, so the logic can stay simple.

As vectors exceptions may be shared between guest and hypervisor, we now
need to have a different policy for the interrupts.

On exception from hypervisor, each vector will select the list of
interrupts to inherit from the interrupted context. Any interrupts not
listed will be kept masked.

On exception from the guest, the Abort and IRQ will be unmasked
depending on the exact vector.

The interrupts will be kept unmasked when the vector cannot used by
either guest or hypervisor.

Note that each vector is not anymore preceded by ALIGN. This is fine
because the alignment is already bigger than what we need.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: 61b683571f0abd12395b1454cd055f2ad9bb3a37
master date: 2019-10-31 16:22:34 +0100

xen/arm32: entry: Fold the macro SAVE_ALL in the macro vector

Follow-up rework will require the macro vector to distinguish between
a trap from a guest vs while in the hypervisor.

The macro SAVE_ALL already has code to distinguish between the two and
it is only called by the vector macro. So fold the former into the
latter. This will help to avoid duplicating the check.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: a7b81b021ead23bffb5affcac05edfc0a84d129d
master date: 2019-10-31 16:21:33 +0100

xen/arm32: entry: Split __DEFINE_ENTRY_TRAP in two

The preprocessing macro __DEFINE_ENTRY_TRAP is used to generate trap
entry function. While the macro is fairly small today, follow-up patches
will increase the size signicantly.

In general, assembly macros are more readable as they allow you to name
parameters and avoid '\'. So the actual implementation of the trap is
now switched to an assembly macro.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: 6eeef7ecaeec002bb1da7e20c9cfaec5549bd940
master date: 2019-10-31 16:20:58 +0100

passthrough: quarantine PCI devices

When a PCI device is assigned to an untrusted domain, it is possible for
that domain to program the device to DMA to an arbitrary address. The
IOMMU is used to protect the host from malicious DMA by making sure that
the device addresses can only target memory assigned to the guest. However,
when the guest domain is torn down the device is assigned back to dom0,
thus allowing any in-flight DMA to potentially target critical host data.

This patch introduces a 'quarantine' for PCI devices using dom_io. When
the toolstack makes a device assignable (by binding it to pciback), it
will now also assign it to DOMID_IO and the device will only be assigned
back to dom0 when the device is made unassignable again. Whilst device is
assignable it will only ever transfer between dom_io and guest domains.
dom_io is actually only used as a sentinel domain for quarantining purposes;
it is not configured with any IOMMU mappings. Assignment to dom_io simply
means that the device's initiator (requestor) identifier is not present in
the IOMMU's device table and thus any DMA transactions issued will be
terminated with a fault condition.

In addition, a fix to assignment handling is made for VT-d.  Failure
during the assignment step should not lead to a device still being
associated with its prior owner. Hand the device to DomIO temporarily,
until the assignment step has completed successfully.  Remove the PI
hooks from the source domain then earlier as well.

Failure of the recovery reassign_device_ownership() may not go silent:
There e.g. may still be left over RMRR mappings in the domain assignment
to which has failed, and hence we can't allow that domain to continue
executing.

NOTE: This patch also includes one printk() cleanup; the
      "XEN_DOMCTL_assign_device: " tag is dropped in iommu_do_pci_domctl(),
      since similar printk()-s elsewhere also don't log such a tag.

This is XSA-302.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 319f9a0ba94c7db505cd5dd9cb0b037ab1aa8e12
master date: 2019-10-31 16:20:05 +0100

xen/arm: p2m: Don't check the return of p2m_get_root_pointer() with BUG_ON()

It turns out that the BUG_ON() was actually reachable with well-crafted
hypercalls. The BUG_ON() is here to prevent catch logical error, so
crashing Xen is a bit over the top.

While all the holes should now be fixed, it would be better to downgrade
the BUG_ON() to something less fatal to prevent any more DoS.

The BUG_ON() in p2m_get_entry() is now replaced by ASSERT_UNREACHABLE()
to catch mistake in debug build and return INVALID_MFN for production
build. The interface also requires to set page_order to give an idea of
the size of "hole". So 'level' is now set so we report a hole of size of
the an entry of the root page-table. This stays inline with what happen
when the GFN is higher than p2m->max_mapped_gfn.

The BUG_ON() in p2m_resolve_translation_fault() is now replaced by
ASSERT_UNREACHABLE() to catch mistake in debug build and just report a
fault for producion build.

This is part of XSA-301.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 31b4f4ab6634f85163656b470dffc6d974917853
master date: 2019-10-31 16:19:14 +0100

xen/arm: p2m: Avoid off-by-one check on p2m->max_mapped_gfn

The code base is using inconsistently the field p2m->max_mapped_gfn.
Some of the useres expect that p2m->max_guest_gfn contain the highest
mapped GFN while others expect highest + 1.

p2m->max_guest_gfn is set as highest + 1, because of that the sanity
check on the GFN in p2m_resolved_translation_fault() and
p2m_get_entry() can be bypassed when GFN == p2m->max_guest_gfn.

p2m_get_root_pointer(p2m->max_guest_gfn) may return NULL if it is
outside of address range supported and therefore the BUG_ON() could be
hit.

The current value hold in p2m->max_mapped_gfn is inconsistent with the
expectation of the common code (see domain_get_maximum_gpfn()) and also
the documentation of the field.

Rather than changing the check in p2m_translation_fault() and
p2m_get_entry(), p2m->max_mapped_gfn is now containing the highest
mapped GFN and the callers assuming "highest + 1" are now adjusted.

Take the opportunity to use 1UL rather than 1 as page_order could
theoritically big enough to overflow a 32-bit integer.

Lastly, the documentation of the field max_guest_gfn to reflect how it
is computed.

This is part of XSA-301.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 6e8e163b46d0823526f1afbbe6f66c668fc811d1
master date: 2019-10-31 16:18:38 +0100

xen/arm: p2m: Avoid aliasing guest physical frame

The P2M helpers implementation is quite lax and will end up to ignore
the unused top bits of a guest physical frame.

This effectively means that p2m_set_entry() will create a mapping for a
different frame (it is always equal to gfn & (mask unused bits)). Yet
p2m->max_mapped_gfn will be updated using the original frame.

At the moment, p2m_get_entry() and p2m_resolve_translation_fault()
assume that p2m_get_root_pointer() will always return a non-NULL pointer
when the GFN is smaller than p2m->max_mapped_gfn.

Unfortunately, because of the aliasing described above, it would be
possible to set p2m->max_mapped_gfn high enough so it covers frame that
would lead p2m_get_root_pointer() to return NULL.

As we don't sanity check the guest physical frame provided by a guest, a
malicious guest could craft a series of hypercalls that will hit the
BUG_ON() and therefore DoS Xen.

To prevent aliasing, the function p2m_get_root_pointer() is now reworked
to return NULL If any of the unused top bits are not zero. The caller
can then decide what's the appropriate action to do. Since the two paths
(i.e. P2M_ROOT_PAGES == 1 and P2M_ROOT_PAGES != 1) are now very
similarly, take the opportunity to consolidate them making the code a
bit simpler.

With this change, p2m_get_entry() will not try to insert a mapping as
the root pointer is invalid.

Note that root_table is now switch to unsigned long as unsigned int is
not enough to hold part of a GFN.

This is part of XSA-301.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 88aaf40eeff771c546ad3bbb02000171648a89f7
master date: 2019-10-31 16:17:33 +0100

x86/mm: Don't drop a type ref unless you held a ref to begin with

Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible.  This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately.  Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count.  During de-validation,
put_page_type() is called on partially validated entries.

Unfortunately, there are a number of issues with the current algorithm.

First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page.  Some examples are listed in the
appendix.

The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.

What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated

Fix this by telling put_page_type() which of the two activities you
intend.

When cleaning up a partial de/validation, take no action unless you
find a page partially validated.

If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.

In put_page_from_lNe, pass partial_flags on to _put_page_type().

old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries).  Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().

While here, delete stray trailing whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:

Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.

P1: PIN_L3_TABLE
  A -> PGT_l3_table | 1 | valid
  B -> PGT_l2_table | 1 | valid

P1: UNPIN_TABLE
  > Arrange to interrupt after B has been de-validated
  B:
    type_info -> PGT_l2_table | 0
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_enties -> (less than x)

P2: mod_l4_entry to point to A
  > Arrange for this to be interrupted while B is being validated
  B:
    type_info -> PGT_l2_table | 1 | partial
    (nr_validated_entires &c set as appropriate)
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_entries -> x
    partial_pte = 1

P3: mod_l3_entry some other unrelated l3 to point to B:
  B:
    type_info -> PGT_l2_table | 1

P1: Restart UNPIN_TABLE

At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3

A similar issue arises with old_guest_table.  Consider the following
scenario:

Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.

V1:  PIN_L2_TABLE(A)
  <Validate until we try to validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V1 -> old_guest_table = A
  <delayed>

V2: PIN_L2_TABLE(A)
  <Pick up where V1 left off, try to re-validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V2 -> old_guest_table = A
  <restart>
  put_old_guest_table()
    _put_page_type(A)
      A -> PGT_l2_table | 0

V1: <restart>
  put_old_guest_table()
    _put_page_type(A) # UNDERFLOW

Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
master commit: c40b33d72630dcfa506d6fd856532d6152cb97dc
master date: 2019-10-31 16:16:37 +0100

x86/mm: Fix nested de-validation on error

If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.

The invariant for partial pages should be:

* Entries [0, nr_validated_ptes) should be completely validated;
  put_page_type() will de-validate these.

* If [nr_validated_ptes] is partially validated, partial_flags should
  set PTF_partiaL_set.  put_page_type() will be called on this page to
  finish off devalidation, and the appropriate refcount adjustments
  will be done.

alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.

Unfortunately, this is mishandled.

Take the case where validating lNe[x] returns an error.

First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be.  nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1.  This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2.  (Any other page type will
fail.)

Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1.  When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x].  If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.

Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.

While here, add some safety catches:
- old_guest_table must point to the page contained in
  [nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table

If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question.  If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop.  Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3c15a2d8cc1981f369cc9542f028054d0dfb325b
master date: 2019-10-31 16:16:13 +0100

x86/mm: Properly handle linear pagetable promotion failures

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.

Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped.  (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.)  This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.

Fix this by setting page->partial_flags to the partial_flags local
variable.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix

Suppose A and B can both be promoted to L2 pages, and A[x] points to B.

V1: PIN_L2 B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY pointing something to A.
  In the process of validating A[x], grab an extra type / ref on B:
  B.type_count = 2 | PGT_validated
  B.count = 3 | PGC_allocated
  A.type_count = 1 | PGT_validated
  A.count = 2 | PGC_allocated

V1: UNPIN B.
  B.type_count = 1 | PGT_validate
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY removing the reference to A.
  De-validate A, down to A[x], which points to B.
  Drop the final type on B.  Arrange to be interrupted.
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = -1

V2: MOD_L3_ENTRY adds a reference to A.

At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
master commit: 2f126247ef49c2ba52bae29a2ab371059ede67c0
master date: 2019-10-31 16:15:48 +0100

x86/mm: Collapse PTF_partial_set and PTF_partial_general_ref into one

...now that they are equivalent. No functional change intended.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d28fe10c50e59569c050878226dcd95dc741810f
master date: 2019-10-31 16:15:11 +0100

x86/mm: Always retain a general ref on partial

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain.  A sketch is provided in
the appendix.

Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set.  (For clarity of
change, keep two separate flags.  These will be collapsed in a
subsequent changeset.)

This has two basic implications.  On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.

Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.

(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)

On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.

NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
  foreign domain

Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B.  B has
PGC_allocated set but no other general references.

V1:  PIN_L3 A.
  A is validated, B is validated.
  A.type_count = 1 | PGT_validated | PGT_pinned
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated (A[x] holds a general ref)

V1: UNPIN A.
  A begins de-validation.
  Arrange to be interrupted when i < x
  V1->old_guest_table = A
  V1->old_guest_table_ref_held = false
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = i < x
  B.type_count = 0
  B.count = 1 | PGC_allocated

V2: MOD_L4_ENTRY to point some l4e to A.
  Picks up re-validation of A.
  Arrange to be interrupted halfway through B's validation
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = PTF_partial_set

V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
  Validates B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated ("other l3e" holds a general ref)

V3: MOD_L3_ENTRY to clear l3e pointing to B.
  Devalidates B.
  B.type_count = 0
  B.count = 1 | PGC_allocated

V3: decrease_reservation(B)
  Clears PGC_allocated
  B.count = 0 => B is freed

B gets assigned to a different domain

V1: Restarts UNPIN of A
  put_old_guest_table(A)
    ...
      free_l3_table(A)

Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.

If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
master commit: 18b0ab697830a46ce3dacaf9210799322cb3732c
master date: 2019-10-31 16:14:36 +0100

x86/mm: Have alloc_l[23]_table clear partial_flags when preempting

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.

If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated.  This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.

Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].

In a sense, the real issue here is code duplication.  Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.

Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ff0b9a5d69b744a99e8bbeac820a985db5a3bf8e
master date: 2019-10-31 16:14:14 +0100

x86/mm: Rework get_page_and_type_from_mfn conditional

Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.

The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.

No functional change intended.

NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2aab06d742e13d7a9d248f1fc7f0ec62b295ada1
master date: 2019-10-31 16:13:23 +0100

x86/mm: Use flags for _put_page_type rather than a boolean

This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future. It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0121588ec0f6950ed65d906d860df49be2c8e655
master date: 2019-10-31 16:12:53 +0100

x86/mm: Separate out partial_pte tristate into individual flags

At the moment, partial_pte is a tri-state that contains two distinct bits
of information:

1. If zero, the pte at index [nr_validated_ptes] is un-validated.  If
   non-zero, the pte was last seen with PGT_partial set.

2. If positive, the pte at index [nr_validated_ptes] does not hold a
   general reference count.  If negative, it does.

To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).

Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`).  These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together.  In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.

NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.

Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1b6fa638d21006d3c0a3038132c6cb326d8bba08
master date: 2019-10-31 16:12:14 +0100

x86/mm: Don't re-set PGT_pinned on a partially de-validated page

When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.

This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated). However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.

This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.

Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART

In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.

While here, clean up some trainling whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bf656e02d8e7f49b484e2587aef4f18deda6e2ab
master date: 2019-10-31 16:11:46 +0100

x86/mm: L1TF checks don't leave a partial entry

On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.

However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it.  This causes problems in several places.

For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page.  If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.

Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial.  When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page.  This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.

For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set.  This will cause it to set partial_pte = 1.  If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.

Rather than indicating -ERESTART, indicate -EINTR.  This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).

mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3165ffef09e89d38f84d26051f606d2c1421aea3
master date: 2019-10-31 16:11:12 +0100

x86/PV: check GDT/LDT limits during emulation

Accesses beyond the LDT limit originating from emulation would trigger
the ASSERT() in pv_map_ldt_shadow_page(). On production builds such
accesses would cause an attempt to promote the touched page (offset from
the present LDT base address) to a segment descriptor one. If this
happens to succeed, guest user mode would be able to elevate its
privileges to that of the guest kernel. This is particularly easy when
there's no LDT at all, in which case the LDT base stored internally to
Xen is simply zero.

Also adjust the ASSERT() that was triggering: It was off by one to
begin with, and for production builds we also better use
ASSERT_UNREACHABLE() instead with suitable recovery code afterwards.

This is XSA-298.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93021cbe880a8013691a48d0febef8ed7d3e3ebd
master date: 2019-10-31 16:08:16 +0100

xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()

Since c/s 1d429034 "hypercall: update vcpu_op to take an unsigned vcpuid",
which incorrectly swapped 'i' for 'u' in the parameter type list, guests have
been able to hit the BUG() in next_args()'s default case.

Correct these back to 'i'.

In addition, make adjustments to prevent this class of issue from occurring in
the future - crashing Xen is not an appropriate form of parameter checking.

Capitalise NEXT_ARG() to catch all uses, to highlight that it is a macro doing
non-function-like things behind the scenes, and undef it when appropriate.
Implement a bad_fmt: block which prints an error, asserts unreachable, and
crashes the guest.

On the ARM side, drop all parameter checking of p. It is asymmetric with the
x86 side, and akin to expecting memcpy() or sprintf() to check their src/fmt
parameter before use. A caller passing "" or something other than a string
literal will be obvious during code review.

This is XSA-296.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>
master commit: 0bf9f8d3e399a0e1d2b717f71b4776172446184b
master date: 2019-10-31 16:07:11 +0100

xen/arm: mm: Flush the TLBs even if a mapping failed in create_xen_entries

At the moment, create_xen_entries will only flush the TLBs if the full
range has successfully been updated. This may lead to leave unwanted
entries in the TLBs if we fail to update some entries.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Andrii Anisov <andrii_anisov@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit a189ef027dbb7a3c0dfe566137f05c06d6685fb9)

xen/arm: fix nr_pdxs calculation

pfn_to_pdx expects an address, not a size, as a parameter. Specifically,
it expects the end address, then the masks calculations compensate for
any holes between start and end. Thus, we should pass the end address to
pfn_to_pdx.

The initial pdx is stored in frametable_base_pdx, so we can subtract the
result of pfn_to_pdx(start_address) from nr_pdxs; we know that we don't
need to cover any memory in the range 0-start in the frametable.

Remove the variable `nr_pages' because it is unused.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
CC: JBeulich@suse.com
(cherry picked from commit f51027be0688540aaab61513b06a8693a37e4c00)

xen/arm64: Correctly compute the virtual address in maddr_to_virt()

The helper maddr_to_virt() is used to translate a machine address to a
virtual address. To save some valuable address space, some part of the
machine address may be compressed.

In theory the PDX code is free to compress any bits so there are no
guarantee the machine index computed will be always greater than
xenheap_mfn_start. This would result to return a virtual address that is
not part of the direct map and trigger a crash at least on debug-build later
on because of the check in virt_to_page().

A recently reverted patch (see 1191156361 "xen/arm: fix mask calculation
in pdx_init_mask") allows the PDX to compress more bits and triggered a
crash on AMD Seattle Platform.

Avoid the crash by keeping track of the base PDX for the xenheap and use
it for computing the virtual address.

Note that virt_to_maddr() does not need to have similar modification as
it is using the hardware to translate the virtual address to a machine
address.

Take the opportunity to fix the ASSERT() as the direct map base address
correspond to the start of the RAM (this is not always 0).

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 612d476e74a314be514ee6a9744eea8db09d32e5)

xen/arm: vsmc: The function identifier is always 32-bit

On Arm64, the SMCCC function identifier is always stored in the first 32-bit
of x0 register. The rest of the bits are not defined and should be
ignored.

This means the variable funcid should be an uint32_t rather than
register_t.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7f4217cc60574866cb90d67d9750228c6b86c91e)

xen/arm: p2m: Free the p2m entry after flushing the IOMMU TLBs

When freeing a p2m entry, all the sub-tree behind it will also be freed.
This may include intermediate page-tables or any l3 entry requiring to
drop a reference (e.g for foreign pages). As soon as pages are freed,
they may be re-used by Xen or another domain. Therefore it is necessary
to flush *all* the TLBs beforehand.

While CPU TLBs will be flushed before freeing the pages, this is not
the case for IOMMU TLBs. This can be solved by moving the IOMMU TLBs
flush earlier in the code.

This wasn't considered as a security issue as device passthrough on Arm
is not security supported.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 671878779741b38c5f2363adceef8de2ce0b3945)

xen/arm: Don't use _end in is_xen_fixed_mfn()

virt_to_maddr() is using the hardware page-table walk instructions to
translate a virtual address to physical address. The function should
only be called on virtual address mapped.

_end points past the end of Xen binary and may not be mapped when the
binary size is page-aligned. This means virt_to_maddr() will not be able
to do the translation and therefore crash Xen.

Note there is also an off-by-one issue in this code, but the panic will
trump that.

Both issues can be fixed by using _end - 1 in the check.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 8dba9a81e7c62b8a7dbe023fffecd2e16cc20486)

xen/arm: setup: Calculate correctly the size of Xen

The current size of Xen is computed using _end - _start + 1. However,
_end is pointing one past the end of Xen, so the size of Xen is
off-by-one.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 08e2059facd78d5ffaf206ba06ac2017c4adeed4)

xen/arm: traps: Avoid using BUG_ON() to check guest state in advance_pc()

The condition of the BUG_ON() in advance_pc() is pretty wrong because
the bits [26:25] and [15:10] have a different meaning between AArch32
and AArch64 state.

On AArch32, they are used to store PSTATE.IT. On AArch64, they are RES0
or used for new feature (e.g ARMv8.0-SSBS, ARMv8.5-BTI).

This means a 64-bit guest will hit the BUG_ON() if it is trying to use
any of these features.

More generally, RES0 means that the bits is reserved for future use. So
crashing the host is definitely not the right solution.

In this particular case, we only need to know the guest was using 32-bit
Mode and the Thumb instructions. So replace the BUG_ON() by a proper
check.

Reported-by: Lukas Jünger <lukas.juenger@ice.rwth-aachen.de>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 72615f2e6b98e861c08abb1d2b194126013d54fe)

arm: gic-v3: deactivate interrupts during initialization

On i.MX8, we implemented partition reboot which means Cortex-A reboot
will not impact M4 cores and System control Unit core. However GICv3 is
not reset because we also need to support A72 Cluster reboot without
affecting A53 Cluster.

The gic-v3 controller is configured with EOImode to 1, so during xen
reboot, there is a function call "smp_call_function(halt_this_cpu, NULL, 0);"
but halt_this_cpu never returns, that means other CPUs have no chance to
deactivate the SGI interrupt, because the deactivate_irq operation is at
the end of do_sgi. During the next boot of Xen, CPU0 will issue
GIC_SGI_CALL_FUNCTION to other CPUs. As the Active state for SGI is left
untouched during the reboot, the GIC_SGI_CALL_FUNCTION will still be active
on the non-boot CPUs. This means the interrupt cannot be triggered again
until it get deactivated.

And according to IHI0069D_gic_architecture_specification, chapter
"8.11.3 GICR_ICACTIVER0, Interrupt Clear-Active Register 0", the RW
field of GICR_ICACTIVER0 resets to a value that is architecturally UNKNOWN.
So make sure all interrupts are deactivated during initialization by
clearing the state.

Signed-off-by: Peng Fan <peng.fan@nxp.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 0322e0db5b29a0d1ce4b452885e34023e3a4b00e)