dgit.raspbian.org Git

sched: don't let XEN_RUNSTATE_UPDATE leak into vcpu_runstate_get()

vcpu_runstate_get() should never return a state entry time with
XEN_RUNSTATE_UPDATE set. To avoid this let update_runstate_area()
operate on a local runstate copy.

As it is required to first set the XEN_RUNSTATE_UPDATE indicator in
guest memory, then update all the runstate data, and then at last
clear the XEN_RUNSTATE_UPDATE again it is much less effort to have
a local copy of the runstate data instead of keeping only a copy of
state_entry_time.

This problem was introduced with commit 2529c850ea48f036 ("add update
indicator to vcpu_runstate_info").

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
master commit: f28c4c4c10bdacb1e49cc6e9de57eb1f973cbdf6
master date: 2019-09-26 18:04:09 +0200

ACPI/cpuidle: bump maximum number of power states we support

Commit 4c6cd64519 ("mwait_idle: Skylake Client Support") added a table
with 8 entries, which - together with C0 - rendered the current limit
too low. It should have been accompanied by an increase of the constant;
do this now. Don't bump by too much though, as there are a number of on-
stack arrays which are dimensioned by this constant.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: ff22a91b4c45f9310d0ec0d7ee070d84a373dd87
master date: 2019-09-25 15:53:35 +0200

sched: fix freeing per-vcpu data in sched_move_domain()

In case of an allocation error of per-vcpu data in sched_move_domain()
the already allocated data is freed just using xfree(). This is wrong
as some schedulers need to do additional operations (e.g. the arinc653
scheduler needs to remove the vcpu-data from a list).

So instead xfree() make use of the sched_free_vdata() hook.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
master commit: b6656e6aa4dd5de537ce07ec16bfbbbb538b28b5
master date: 2019-09-25 15:52:53 +0200

libxc/x86: avoid certain overflows in CPUID APIC ID adjustments

Recent AMD processors may report up to 128 logical processors in CPUID
leaf 1. Doubling this value produces 0 (which OSes sincerely dislike),
as the respective field is only 8 bits wide. Suppress doubling the value
(and its leaf 0x80000008 counterpart) in such a case.

Note that while there's a similar overflow in intel_xc_cpuid_policy(),
that one is being left alone for now.

Note further that while it was considered to suppress the multiplication
by 2 altogether if the host topology already provides at least one bit
of thread ID within APIC IDs, it was decided to avoid more change here
than really needed at this point.

Also zap leaf 4 (and at the same time leaf 2) EDX output for AMD, as it
should have been from the beginning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
libxc/x86: correct overflow avoidance check in AMD CPUID handling

Commit df29d03f1d ("libxc/x86: avoid certain overflows in CPUID APIC ID
adjustments" introduced a one bit too narrow mask when checking whether
multiplying by 1 (in particular in leaf 1) would result in overflow.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: df29d03f1d97bdde1bc0cea8ef8538d4f524b3ec
master date: 2019-09-24 10:50:33 +0200
master commit: c9c7ac508b3f65f7d5f9685893096a1b22d8b176
master date: 2019-09-25 15:50:58 +0200

vpci: honor read-only devices

Don't allow the hardware domain write access the PCI config space of
devices marked as read-only.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 79f9ba78380fb3f4bf509e5c726c6cdd76e00c4f
master date: 2019-09-17 16:13:39 +0200

x86/boot: silence MADT table entry logging

Logging disabled LAPIC / x2APIC entries with invalid local APIC IDs
(ones having "broadcast" meaning when used) isn't very useful, and can
be quite noisy on larger systems. Suppress their logging unless
opt_cpu_info is true.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 936b77255269b3b9b5685d565550e77d5080ac81
master date: 2018-09-03 17:51:40 +0200

ioreq: fix hvm_all_ioreq_servers_add_vcpu fail path cleanup

The loop in FOR_EACH_IOREQ_SERVER is backwards hence the cleanup on
failure needs to be done forwards.

Fixes: 97a5a3e30161 ('x86/hvm/ioreq: maintain an array of ioreq servers rather than a list')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 215f2576b0ac1bc18f3ff74e34f0d8379bda9040
master date: 2019-09-10 16:32:47 +0200

x86/cpuid: Fix handling of the CPUID.7[0].eax levelling MSR

7a0 is an integer field, not a mask - taking the logical and of the hardware
and policy values results in nonsense. Instead, take the policy value
directly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@cirtrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b50d78d0eaffb43d5f5ceeda55fa22c11f47d01b
master date: 2019-09-10 13:33:21 +0100

x86/shadow: don't enable shadow mode with too small a shadow allocation (part 2)

Commit 2634b997af ("x86/shadow: don't enable shadow mode with too small
a shadow allocation") was incomplete: The adjustment done there to
shadow_enable() is also needed in shadow_one_bit_enable(). The (new)
problem report was (apparently) a failed PV guest migration followed by
another migration attempt for that same guest. Disabling log-dirty mode
after the first one had left a couple of shadow pages allocated (perhaps
something that also wants fixing), and hence the second enabling of
log-dirty mode wouldn't have allocated anything further.

Reported-by: James Wang <jnwang@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
master commit: 8b25551baa3307af0aa1ef8f7f43403f01c2c5d7
master date: 2019-09-05 09:56:42 +0200

x86: properly gate clearing of PKU feature

setup_clear_cpu_cap() is __init and hence may not be called post-boot.
Note that opt_pku nevertheless is not getting __initdata added - see
e.g. commit 43fa95ae6a ("mm: make opt_bootscrub non-init").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 41c7700a00011ad08be3c9d71126b67e08e58ac3
master date: 2019-08-29 15:10:07 +0200

p2m/ept: pass correct level to atomic_write_ept_entry in ept_invalidate_emt

The level passed to ept_invalidate_emt corresponds to the EPT entry
passed as the mfn parameter, which is a pointer to an EPT page table,
hence the entries in that page table will have one level less than the
parent.

Fix the call to atomic_write_ept_entry to pass the correct level, ie:
one level less than the parent.

Fixes: 50fe6e73705 ('pvh dom0: add and remove foreign pages')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>.
master commit: b806c91275fb1ab7696ebf033b56631693056c90
master date: 2019-08-28 16:57:36 +0200

x86/mm: correctly initialise M2P entries on boot

Since guest resource management work it's now possible to have a page
assigned to a domain without a valid M2P entry. Some paths in the code
rely on the fact a GFN returned from mfn_to_gfn() for such a page
is not valid as well, i.e. see arch_iommu_populate_page_table().

For systems without 512GB contiguous RAM M2P entries were already
correctly initialised on boot with INVALID_M2P_ENTRY (~0UL) but
on systems where M2P could be covered by a single 1GB page directory
0x77 poison was used instead. That eventually resulted in a crash
during IOMMU construction on systems without shared PTs enabled.

While here fix up compat M2P entries as well.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6c093931a765803cfc7b0df466ee032760cc8020
master date: 2019-08-27 13:40:42 +0100

x86: Restore IA32_MISC_ENABLE on wakeup

Code in intel.c:early_init_intel() modifies IA32_MISC_ENABLE MSR. Those
modifications must be restored after resuming from S3 (see e.g. Linux wakeup
code), otherwise bad things may happen (e.g. wakeup code may cause #GP when
trying to set IA32_EFER.NXE [1]).

This bug was noticed on a ThinkPad x230 with NX disabled in the BIOS:
Xen could correctly boot, but crashed when resuming from suspend.
Applying this patch fixed the problem.

[1] Intel SDM vol 3: "If the execute-disable capability is not
available, a write to set IA32_EFER.NXE produces a #GP exception."

Signed-off-by: Michał Kowalczyk <mkow@invisiblethingslab.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c3cfa5b3084d71bccd8360d044bea813688b587c
master date: 2019-08-19 15:07:34 +0100

x86/xpti: Don't leak TSS-adjacent percpu data via Meltdown

The XPTI work restricted the visibility of most of memory, but missed a few
aspects when it came to the TSS.

Given that the TSS is just an object in percpu data, the 4k mapping for it
created in setup_cpu_root_pgt() maps adjacent percpu data, making it all
leakable via Meltdown, even when XPTI is in use.

Furthermore, no care is taken to check that the TSS doesn't cross a page
boundary.  As it turns out, struct tss_struct is aligned on its size which
does prevent it straddling a page boundary.

Rework the TSS types while making this change.  Rename tss_struct to tss64, to
mirror the existing tss32 structure we have in HVM's Tast Switch logic.  Drop
tss64's alignment and __cacheline_filler[] field.

Introduce tss_page which contains a single tss64 and keeps the rest of the
page clear, so no adjacent data can be leaked.  Move the definition from
setup.c to traps.c, which is a more appropriate place for it to live.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 7888440625617693487495a7842e6a991ead2647
master date: 2019-08-12 14:10:09 +0100

xen/page_alloc: Keep away MFN 0 from the buddy allocator

Combining of buddies happens only such that the resulting larger buddy
is still order-aligned. To cross a zone boundary while merging, the
implication is that both the buddy [0, 2^n-1] and the buddy
[2^n, 2^(n+1)-1] are free.

Ideally we want to fix the allocator, but for now we can just prevent
adding the MFN 0 in the allocator to avoid merging across zone
boundaries.

On x86, the MFN 0 is already kept away from the buddy allocator. So the
bug can only happen on Arm platform where the first memory bank is
starting at 0.

As this is a specific to the allocator, the MFN 0 is removed in the common code
to cater all the architectures (current and future).

[Stefano: improve commit message]

Reported-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
master commit: 762b9a2d990bba1f3aefe660cff0c37ad2e375bc
master date: 2019-08-09 11:12:55 -0700

xen/link: Introduce .bss.percpu.page_aligned

Future changes are going to need to page align some percpu data.

Shuffle the exact link order of items within the BSS to give
.bss.percpu.page_aligned appropriate alignment, even on CPU0, which uses
.bss.percpu itself.

Insert explicit alignment such that there won't be a gap between
__per_cpu_start and the first actual per-CPU object. The POINTER_ALIGN
for __bss_end is to cover the lack of SMP_CACHE_BYTES alignment, as the
loops which zero the BSS use pointer-sized stores on all architectures.

Rework __DEFINE_PER_CPU() so the caller passes in all attributes, and
adjust DEFINE_PER_CPU{,_READ_MOSTLY}() to match. This has the added bonus
that it is now possible to grep for .bss.percpu and find all the users.

Finally, introduce DEFINE_PER_CPU_PAGE_ALIGNED() which specifies the
section attribute and verifies the type's alignment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Make DEFINE_PER_CPU_PAGE_ALIGNED() verify the alignment rather than
specifying it. It is the underlying type which should be suitably aligned.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6c9639a72f0ca3a9430ef75f375877182281fdef
master date: 2019-08-09 16:36:58 +0200

xen/sched: fix memory leak in credit2

csched2_deinit() is leaking the run-queue memory.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Dario Faggioli <dfaggioli@suse.com>
master commit: 70f9dff51ee873cf65246d3e95b27e2e92ca137b
master date: 2019-08-07 17:21:14 +0100

x86/boot: Set Accessed bits in boot_cpu_{,compat_}gdt_table[]

There is no point causing the CPU to performed a locked update of the
descriptors on first use.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: af292b41e9edc0a87f0205ece833e64808ec3883
master date: 2019-08-07 13:34:56 +0100

x86/apic: enable x2APIC mode before doing any setup

Current code calls apic_x2apic_probe which does some initialization
and setup before having enabled x2APIC mode (if it's not already
enabled by the firmware).

This can lead to issues if the APIC ID doesn't match the x2APIC ID, as
apic_x2apic_probe calls init_apic_ldr_x2apic_cluster which depending
on the APIC mode might set cpu_2_logical_apicid using the APIC ID
instead of the x2APIC ID (because x2APIC might not be enabled yet).

Fix this by enabling x2APIC before calling apic_x2apic_probe.

As a remark, this was discovered while I was trying to figure out why
one of my test boxes didn't report any iommu faults. The root cause
was that the iommu MSI address field was set using the stale value in
cpu_2_logical_apicid, and thus the iommu fault interrupt would get
lost. Even if the MSI address field gets sets to a correct value
afterwards as soon as a single iommu fault is pending no further
interrupts would get injected, so losing a single iommu fault
interrupt is fatal.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 260940578de348c38f18cadc6fa53f499e57919c
master date: 2019-08-07 12:09:51 +0200

x86/microcode: always collect_cpu_info() during boot

Currently cpu_sig struct is not updated during boot if no microcode blob
is specified by "ucode=[<interger>| scan]".

It will result in cpu_sig.rev being 0 which affects APIC's
check_deadline_errata() and retpoline_safe() functions.

Fix this by getting ucode revision early during boot and SMP bring up.
While at it, protect early_microcode_update_cpu() for cases when
microcode_ops is NULL.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2bb2c55cf870e78bc7f514784b2cd8c947d8729c
master date: 2019-08-01 18:45:32 +0100

xen/spec-ctrl: Speculative mitigation facilities report wrong status

Booting with spec-ctrl=0 results in Xen printing "None MD_CLEAR".

(XEN) Support for HVM VMs: None MD_CLEAR
(XEN) Support for PV VMs: None MD_CLEAR

Add a check about X86_FEATURE_MD_CLEAR to avoid to print "None".

Signed-off-by: James Wang <jnwang@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2adc580bd59f5c3034fd6ecacd5748678373f17a
master date: 2019-07-31 14:53:13 +0100

x86/boot: Fix build dependenices for reloc.c

c/s 201f852eaf added start_info.h and kconfig.h to reloc.c, but only updated
start_info.h in RELOC_DEPS.

This causes reloc.c to not be regenerated when Kconfig changes.  It is most
noticeable when enabling CONFIG_PVH and finding the resulting binary crash
early with:

  (d9) (XEN)
  (d9) (XEN) ****************************************
  (d9) (XEN) Panic on CPU 0:
  (d9) (XEN) Magic value is wrong: c2c2c2c2
  (d9) (XEN) ****************************************
  (d9) (XEN)
  (d9) (XEN) Reboot in five seconds...
  (XEN) d9v0 Triple fault - invoking HVM shutdown action 1

Reported-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 78c0000c87ce498bf621914c0554b83fac3ee00d
master date: 2019-07-31 11:19:45 +0100

video: fix handling framebuffer located above 4GB

On some machines (for example Thinkpad P52), UEFI GOP reports
framebuffer located above 4GB (0x4000000000 on that machine). This
address does not fit in {xen,dom0}_vga_console_info.u.vesa_lfb.lfb_base
field, which is 32bit. The overflow here cause all kind of memory
corruption when anything tries to write something on the screen,
starting with zeroing the whole framebuffer in vesa_init().

Fix this similar to how it's done in Linux: add ext_lfb_base field at
the end of the structure, to hold upper 32bits of the address. Since the
field is added at the end of the structure, it will work with older
Linux versions too (other than using possibly truncated address - no
worse than without this change). Thanks to ABI containing size of the
structure (start_info.console.dom0.info_size), Linux can detect when
this field is present and use it appropriately then.

Since this change public interface and use __XEN_INTERFACE_VERSION__,
bump __XEN_LATEST_INTERFACE_VERSION__.

Note: if/when backporting this change to Xen <= 4.12, #if in xen.h needs
to be extended with " || defined(__XEN__)".

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9cf11fdcd91ff8e9cd038f8336cf21f0701e8b7b
master date: 2019-05-17 14:48:23 +0200

x86/altp2m: make sure EPTP_INDEX is up-to-date when enabling #VE

vmx_vmexit_handler() assumes that if
SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS is set, that the value in
EPTP_INDEX is valid.  Unfortunately, the function which sets this bit
(vmx_vcpu_update_vmfunc_ve) doesn't actually set EPTP_INDEX; it will
only be set the next time vmx_vcpu_update_eptp() is called.

This means that if a vcpu makes a vmexit between these two points, the
EPTP_INDEX it reads will be invalid.  The first time this race happens
for a domain, EPTP_INDEX will most likely be zero, which is the index
for the "host" p2m -- and thus is often correct.  But the second time
this race happens, the value will typically be INVALID_ALTP2M, which
will hit the following BUG:

    BUG_ON(idx >= MAX_ALTP2M);

Worse, if for some reason the current altp2m was *not* `0` during this
window (say, because a toolstack changed the VM to a different view),
then the accounting of active vcpus for an altp2m will be thrown off.

Fix this by always updating EPTP_INDEX to the current altp2m index
when enabling #VE.

Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1dddfff4c39d3db17dfa709b1c57f44e3ed352e3
master date: 2018-08-02 12:12:43 +0200

x86/msi: fix loop termination condition in pci_msi_conf_write_intercept()

The for loop that deals with MSI masking is coded as follows:

for ( pos = 0; pos < entry->msi.nvec; ++pos, ++entry )

Thus the loop termination condition is dereferencing a struct pointer that
is being incremented by the loop.

A block of MSI entries stores the number of vectors in entry[0].msi.nvec,
with all subsequent entries using a value of 0. Therefore, for a block of
two or more MSIs will terminate the loop early, as entry[1].msi.nvec is 0.

However, for a single MSI, ++entry moves the pointer out of bounds, and a
bogus read is used for the termination condition. In the case that the
loop body gets entered, there are subsequent OoB writes which clobber
adjacent memory in the heap.

This patch simply initializes a stack variable to the value of
entry->msi.nvec before starting the loop and then uses that in the
termination condition instead.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 56ad626532eb7addeef2bb2f5f67a15756b5cee2
master date: 2019-07-02 12:00:42 +0100

x86/vvmx: set CR4 before CR0

Otherwise hvm_set_cr0() will check the wrong CR4 bits (L1 instead of L2
and vice-versa).

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3af3c95b81625adf7e6ea71c94b641424741eded
master date: 2019-06-28 13:17:53 +0100

x86/cpuid: leak OSXSAVE only when XSAVE is not clear in policy

This fixes booting of old non-PV-OPS kernels which historically
looked for OSXSAVE instead of XSAVE bit in CPUID to check whether
XSAVE feature is enabled. If such a guest appears to be started on
an XSAVE enabled CPU and the feature is explicitly cleared in
policy, leaked OSXSAVE bit from Xen will lead to guest crash early in
boot.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 902888922e6feda2c485cc4bdeffd0d6e6c26e14
master date: 2019-06-28 13:17:53 +0100

x86/SMP: don't try to stop already stopped CPUs

In particular with an enabled IOMMU (but not really limited to this
case), trying to invoke fixup_irqs() after having already done
disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea:

RIP:    e008:[<ffff82d08026a036>] amd_iommu_read_ioapic_from_ire+0xde/0x113
RFLAGS: 0000000000010006   CONTEXT: hypervisor (d0v0)
rax: ffff8320291de00c   rbx: 0000000000000003   rcx: ffff832035000000
rdx: 0000000000000000   rsi: 0000000000000000   rdi: ffff82d0805ca840
rbp: ffff83009e8a79c8   rsp: ffff83009e8a79a8   r8:  0000000000000000
r9:  0000000000000004   r10: 000000000008b9f9   r11: 0000000000000006
r12: 0000000000010000   r13: 0000000000000003   r14: 0000000000000000
r15: 00000000fffeffff   cr0: 0000000080050033   cr4: 00000000003406e0
cr3: 0000002035d59000   cr2: ffff88824ccb4ee0
fsb: 00007f2143f08840   gsb: ffff888256a00000   gss: 0000000000000000
ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
Xen code around <ffff82d08026a036> (amd_iommu_read_ioapic_from_ire+0xde/0x113):
  ff 07 00 00 39 d3 74 02 <0f> 0b 41 81 e4 00 f8 ff ff 8b 10 89 d0 25 00 00
Xen stack trace from rsp=ffff83009e8a79a8:
...
Xen call trace:
    [<ffff82d08026a036>] amd_iommu_read_ioapic_from_ire+0xde/0x113
    [<ffff82d08026bf7b>] iommu_read_apic_from_ire+0x10/0x12
    [<ffff82d08027f718>] io_apic.c#modify_IO_APIC_irq+0x5e/0x126
    [<ffff82d08027f9c5>] io_apic.c#unmask_IO_APIC_irq+0x2d/0x41
    [<ffff82d080289bc7>] fixup_irqs+0x320/0x40b
    [<ffff82d0802a82c4>] smp_send_stop+0x4b/0xa8
    [<ffff82d0802a7b2f>] machine_restart+0x98/0x288
    [<ffff82d080252242>] console_suspend+0/0x28
    [<ffff82d0802b01da>] do_general_protection+0x204/0x24e
    [<ffff82d080385a3d>] x86_64/entry.S#handle_exception_saved+0x68/0x94
    [<00000000aa5b526b>] 00000000aa5b526b
    [<ffff82d0802a7c7d>] machine_restart+0x1e6/0x288
    [<ffff82d080240f75>] hwdom_shutdown+0xa2/0x11d
    [<ffff82d08020baa2>] domain_shutdown+0x4f/0xd8
    [<ffff82d08023fe98>] do_sched_op+0x12f/0x42a
    [<ffff82d08037e404>] pv_hypercall+0x1e4/0x564
    [<ffff82d080385432>] lstar_enter+0x112/0x120

Don't call fixup_irqs() and don't send any IPI if there's only one
online CPU anyway, and don't call __stop_this_cpu() at all when the CPU
we're on was already marked offline (by a prior invocation of
__stop_this_cpu()).

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Extend this to the kexec/crash path as well.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6ff560f7f1f214fb89baaf97812c4c943e44a642
master date: 2019-06-18 16:35:35 +0200

x86/AMD: limit C1E disable family range

Just like for other family values of 0x17 (see "x86/AMD: correct certain
Fam17 checks"), commit 3157bb4e13 ("Add MSR support for various feature
AMD processor families") made the original check for Fam11 here include
families all the way up to Fam17. The involved MSR (0xC0010055),
however, is fully reserved starting from Fam16, and the two bits of
interest are reserved for Fam12 and onwards (albeit I admit I wasn't
able to find any Fam13 doc). Restore the upper bound to be Fam11.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5c2926f576c9127a8d47217e0cafe00cc741c452
master date: 2019-06-18 16:34:51 +0200

x86/AMD: correct certain Fam17 checks

Commit 3157bb4e13 ("Add MSR support for various feature AMD processor
families") converted certain checks for Fam11 to include families all
the way up to Fam17. The commit having no description, it is hard to
tell whether this was a mechanical dec->hex conversion mistake, or
indeed intended. In any event the NB_CFG handling needs to be restricted
to Fam16 and below: Fam17 doesn't really have such an MSR anymore. As
per observation it's read-zero / write-discard now, so make PV uniformly
(with the exception of pinned Dom0 vCPU-s) behave so, just like HVM
already does.

Mirror the NB_CFG behavior to MSR_FAM10H_MMIO_CONF_BASE as well, except
that here the vendor/model check is kept in place (for now at least).

A non-MMCFG extended config space access mechanism still appears to
exist, but code to deal with it will need to be written down the road,
when it can actually be tested.

Reported-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e0fbf3bf9871b00fa526c4ed893604e7ad6c3090
master date: 2019-06-18 16:33:53 +0200

x86/pv: Fix undefined behaviour in check_descriptor()

UBSAN reports:

  (XEN) ================================================================================
  (XEN) UBSAN: Undefined behaviour in x86_64/mm.c:1108:31
  (XEN) left shift of 255 by 24 places cannot be represented in type 'int'
  (XEN) ----[ Xen-4.13-unstable  x86_64  debug=y   Tainted:    H ]----
  (XEN) CPU:    60
  (XEN) RIP:    e008:[<ffff82d0802a54ce>] ubsan.c#ubsan_epilogue+0xa/0xc2
  <snip>
  (XEN) Xen call trace:
  (XEN)    [<ffff82d0802a54ce>] ubsan.c#ubsan_epilogue+0xa/0xc2
  (XEN)    [<ffff82d0802a6009>] __ubsan_handle_shift_out_of_bounds+0x15d/0x16c
  (XEN)    [<ffff82d08033abd7>] check_descriptor+0x191/0x3dd
  (XEN)    [<ffff82d0804ef920>] do_update_descriptor+0x7f/0x2b6
  (XEN)    [<ffff82d0804efb75>] compat_update_descriptor+0x1e/0x20
  (XEN)    [<ffff82d0804fa1cc>] pv_hypercall+0x87f/0xa6f
  (XEN)    [<ffff82d080501acb>] do_entry_int82+0x53/0x58
  (XEN)    [<ffff82d08050702b>] entry_int82+0xbb/0xc0
  (XEN)
  (XEN) ================================================================================

As this is a constant, express it in longhand for correctness, and consistency
with the surrounding code.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: bd5be40ce2307ea5e8f52e3103d1b48ca9dfdce9
master date: 2019-06-06 20:04:33 +0100

x86/irq: Fix undefined behaviour in irq_move_cleanup_interrupt()

UBSAN reports:

  (XEN) ================================================================================
  (XEN) UBSAN: Undefined behaviour in irq.c:682:22
  (XEN) left shift of 1 by 31 places cannot be represented in type 'int'
  (XEN) ----[ Xen-4.13-unstable  x86_64  debug=y   Not tainted ]----
  (XEN) CPU:    16
  (XEN) RIP:    e008:[<ffff82d0802a54ce>] ubsan.c#ubsan_epilogue+0xa/0xc2
  <snip>
  (XEN) Xen call trace:
  (XEN)    [<ffff82d0802a54ce>] ubsan.c#ubsan_epilogue+0xa/0xc2
  (XEN)    [<ffff82d0802a6009>] __ubsan_handle_shift_out_of_bounds+0x15d/0x16c
  (XEN)    [<ffff82d08031ae77>] irq_move_cleanup_interrupt+0x25c/0x4a0
  (XEN)    [<ffff82d08031b585>] do_IRQ+0x19d/0x104c
  (XEN)    [<ffff82d08050c8ba>] common_interrupt+0x10a/0x120
  (XEN)    [<ffff82d0803b13a6>] cpu_idle.c#acpi_idle_do_entry+0x1de/0x24b
  (XEN)    [<ffff82d0803b1d83>] cpu_idle.c#acpi_processor_idle+0x5c8/0x94e
  (XEN)    [<ffff82d0802fa8d6>] domain.c#idle_loop+0xee/0x101
  (XEN)
  (XEN) ================================================================================

Switch to an unsigned shift, and correct the surrounding style.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 0bf4a2560dd24a7a1285727a900b52adcb4594fb
master date: 2019-06-06 20:04:32 +0100

x86/spec-ctrl: Knights Landing/Mill are retpoline-safe

They are both Airmont-based and should have been included in c/s 17f74242ccf
"x86/spec-ctrl: Extend repoline safey calcuations for eIBRS and Atom parts".

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: e2105180f99d22aad47ee57113015e11d7397e54
master date: 2019-05-31 19:11:29 +0100

x86/vhpet: avoid 'small' time diff test on resume

It appears that even 64-bit versions of Windows 10, when not using syth-
etic timers, will use 32-bit HPET non-periodic timers. There is a test
in hpet_set_timer(), specific to 32-bit timers, that tries to disambiguate
between a comparator value that is in the past and one that is sufficiently
far in the future that it wraps. This is done by assuming that the delta
between the main counter and comparator will be 'small' [1], if the
comparator value is in the past. Unfortunately, more often than not, this
is not the case if the timer is being re-started after a migrate and so
the timer is set to fire far in the future (in excess of a minute in
several observed cases) rather then set to fire immediately. This has a
rather odd symptom where the guest console is alive enough to be able to
deal with mouse pointer re-rendering, but any keyboard activity or mouse
clicks yield no response.

This patch simply adds an extra check of 'creation_finished' into
hpet_set_timer() so that the 'small' time test is omitted when the function
is called to restart timers after migration, and thus any negative delta
causes a timer to fire immediately.

[1] The number of ticks that equate to 0.9765625 milliseconds

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b144cf45d50b603c2909fc32c6abf7359f86f1aa
master date: 2019-05-31 11:40:52 +0200

update Xen version to 4.11.3-pre

update Xen version to 4.11.2

xen/arm: time: cycles_t should be an uint64_t and not unsigned long

Since commit ca73ac8e7d "xen/arm: Add an isb() before reading CNTPCT_EL0
to prevent re-ordering", get_cycles() is now returning the number of
cycles and used in more callers.

While the counter registers is always 64-bit, get_cycles() will only
reutrn a 32-bit on Arm32 and therefore truncate the value. This will
result to weird behavior by both Xen and the Guest as the timer will not
be setup correctly.

This could be resolved by switch cycles_t from unsigned long to
unsigned int.

This change was originally introduced by
da3d55ae67225798c2ad8f42af2f432f6f2b2214 "console: avoid printing no or
null time stamps".

Signed-off-by: Julien Grall <julien.grall@arm.com>
[Stefano: improve commit message]
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

x86: drop arch_evtchn_inject()

For whatever reason this was omitted from the backport of d9195962a6
("events: drop arch_evtchn_inject()").

Signed-off-by: Jan Beulich <jbeulich@suse.com>

XSM: adjust Kconfig names

Since the Kconfig option renaming was not backported, the new uses of
involved CONFIG_* settings should have been adopted to the existing
names in the XSA-295 series. Do this now, also changing XSM_SILO to just
SILO to better match its FLASK counterpart.

To avoid breaking the Kconfig menu structure also adjust XSM_POLICY's
dependency (as was also silently done on master during the renaming).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

xen/arm: grant-table: Protect gnttab_clear_flag against guest misbehavior

The function gnttab_clear_flag is used to clear the access flags. On
Arm, it is implemented using a loop and guest_cmpxchg.

It is possible that guest_cmpxchg will always return a different value
than old. This can happen if the guest updated the memory before Xen has
time to do the exchange. Because of that, there are no way for to
promise the loop will end.

It is possible to make the current code safe by re-using the same
principle as applied on the guest atomic helper. However this patch
takes a different approach that should lead to more efficient code in
the default case.

A new helper is introduced to clear a set of bits on a 16-bits word.
This should avoid a an extra loop to check cmpxchg succeeded.

Note that a mask is used instead of a bit, so the helper can be re-used
later on for clearing multiple flags at the same time.

This is part of XSA-295.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Add performance counters in guest atomic helpers

Add performance counters in guest atomic helpers to be able to detect
whether a guest is often paused during the operations.

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen: Use guest atomics helpers when modifying atomically guest memory

On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.

This patch replaces all the atomics operations on shared memory with
a guest by the new guest atomics helpers. The x86 code was not audited
to know where guest atomics helpers could be used. I will leave that
to the x86 folks.

Note that some rework was required in order to plumb use the new guest
atomics in event channel and grant-table.

Because guest_test_bit is ignoring the parameter "d" for now, it
means there a lot of places do not need to drop the const. We may want
to revisit this in the future if the parameter "d" becomes necessary.

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/cmpxchg: Provide helper to safely modify guest memory atomically

On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.

This patch adds a new helper that will update the guest memory safely.
For x86, it is already possible to use the current helper safely. So
just wrap it.

For Arm, we will first attempt to update the guest memory with the
loop bounded by a maximum number of iterations. If it fails, we will
pause the domain and try again.

Note that this heuristics assumes that a page can only
be shared between Xen and one domain. Not Xen and multiple domain.

The maximum number of iterations is based on how many times atomic_inc()
can be executed in 1uS. The maximum value is per-CPU to cater big.LITTLE
and calculated when the CPU is booting.

The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum
value is per-CPU to cater big.LITTLE and calculated when the CPU is
booting. The heuristic was randomly chosen and can be modified if
impact too much good-behaving guest.

This is part of XSA-295.

Signed-of-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/bitops: Provide helpers to safely modify guest memory atomically

On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.

This patch adds a new set of helper that will update the guest memory
safely. For x86, it is already possible to use the current helpers
safely. So just wrap them.

For Arm, we will first attempt to update the guest memory with the loop
bounded by a maximum number of iterations. If it fails, we will pause the
domain and try again.

Note that this heuristics assumes that a page can only be shared between
Xen and one domain. Not Xen and multiple domain.

The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum value is
per-CPU to cater big.LITTLE and calculated when the CPU is booting. The
heuristic was randomly chosen and can be modified if impact too much
good-behaving guest.

Note, while test_bit does not requires to use atomic operation, a
wrapper for test_bit was added for completeness. In this case, the
domain stays constified to avoid major rework in the caller for the
time-being.

This is part of XSA-295.

Signed-of-by: Julien Grall <julien.grall@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Turn on SILO mode by default on Arm

On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.

Recent patches introduced new helpers to update shared memory with guest
atomically. Those helpers relies on a memory region to be be shared with
Xen and a single guest.

At the moment, nothing prevent a guest sharing a page with Xen and as
well with another guest (e.g via grant table).

For the scope of the XSA, the quickest way is to deny communications
between unprivileged guest. So this patch is enabling and using SILO
mode by default on Arm.

Users wanted finer graine policy could wrote their own Flask policy.

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/xsm: Add new SILO mode for XSM

When SILO is enabled, there would be no page-sharing or event notifications
between unprivileged VMs (no grant tables or event channels).

Signed-off-by: Xin Li <xin.li@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/xsm: Introduce new boot parameter xsm

Introduce new boot parameter xsm to choose which xsm module is enabled,
and set default to dummy. And add new option in Kconfig to choose the
default XSM implementation.

Signed-off-by: Xin Li <xin.li@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/xsm: remove unnecessary #define

this #define is unnecessary since XSM_INLINE is redefined in
xsm/dummy.h, it's a risk of build breakage, so remove it.

Signed-off-by: Xin Li <xin.li@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>

xen/arm: cmpxchg: Provide a new helper that can timeout

Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.

To prevent the infinite loop, we introduce a new helper that can timeout.
The timeout is based on the maximum number of iterations.

It will be used in follow-up patch to make atomic operations on shared
memory safe.

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>

xen/arm: bitops: Implement a new set of helpers that can timeout

Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.

To prevent the infinite loop, we introduce a new set of helpers that can
timeout. The timeout is based on the maximum number of iterations.

They will be used in follow-up patch to make atomic operations
on shared memory safe.

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm32: cmpxchg: Simplify the cmpxchg implementation

The only difference between each case of the cmpxchg is the size of
used. Rather than duplicating the code, provide a macro to generate each
cases.

This makes the code easier to read and modify.

While doing the rework, the case for 64-bit cmpxchg is removed. This is
unused today (already commented) and it would not be possible to use
it directly.

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm64: cmpxchg: Simplify the cmpxchg implementation

The only difference between each case of the cmpxchg is the size of
used. Rather than duplicating the code, provide a macro to generate each
cases.

This makes the code easier to read and modify.

This is part of XSA-295.

Signed-off-by; Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>

xen/arm: bitops: Consolidate prototypes in one place

The prototype are the same between arm32 and arm64. Consolidate them in
asm-arm/bitops.h.

This change will help the introductions of new helpers in a follow-up
patch.

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm32: bitops: Rewrite bitop helpers in C

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>

xen/arm64: bitops: Rewrite bitop helpers in C

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>

xen/grant_table: Rework the prototype of _set_status* for lisibility

It is not clear from the parameters name whether domid and gt_version
correspond to the local or remote domain. A follow-up patch will make
them more confusing.

So rename domid (resp. gt_version) to ldomid (resp. rgt_version). At
the same time re-order the parameters to hopefully make it more
readable.

This is part of XSA-295.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Add an isb() before reading CNTPCT_EL0 to prevent re-ordering

Per D8.2.1 in ARM DDI 0487C.a, "a read to CNTPCT_EL0 can occur
speculatively and out of order relative to other instructions executed
on the same PE."

Add an instruction barrier to get accurate number of cycles when
requested in get_cycles(). For the other users of CNPCT_EL0, replace by
a call to get_cycles().

This is part of XSA-295.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

common: avoid atomic read-modify-write accesses in map_vcpu_info()

There's no need to set the evtchn_pending_sel bits one by one. Simply
write full words with all ones.

For Arm this requires extending write_atomic() to also handle 64-bit
values; for symmetry read_atomic() gets adjusted as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>

events: drop arch_evtchn_inject()

Have the only user call vcpu_mark_events_pending() instead, at the same
time arranging for correct ordering of the writes (evtchn_pending_sel
should be written before evtchn_upcall_pending).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>

xen/arm: mm: Set-up page permission for Xen mappings earlier on

Xen mapping is first create using a 2MB page and then shatterred in 4KB
page for fine-graine permission. However, it is not safe to break-down
superpage page without going to an intermediate step invalidating
the entry.

As we are changing Xen mappings, we cannot go through the intermediate
step. The only solution is to create Xen mapping using 4KB entries
directly. As the Xen should always access the mappings according with
the runtime permission, it is then possible to set-up the permissions
while create the mapping.

We are still playing with the fire as there are still some
break-before-make issue in setup_pagetables (i.e switch between 2 sets of
page-tables). But it should slightly be better than the current state.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reported-by: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Reported-by: Jan-Peter Larsson <Jan-Peter.Larsson@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Matthew Daley <mattd@bugfuzz.com>
(cherry picked from commit 00c96d77422a4b84247bec5dadf434363d312cac)

libacpi: report PCI slots as enabled only for hotpluggable devices

DSDT for qemu-xen lacks _STA method of PCI slot object. If _STA method
doesn't exist then the slot is assumed to be always present and active
which in conjunction with _EJ0 method makes every device ejectable for
an OS even if it's not the case.

qemu-kvm is able to dynamically add _EJ0 method only to those slots
that either have hotpluggable devices or free for PCI passthrough.
As Xen lacks this capability we cannot use their way.

qemu-xen-traditional DSDT has _STA method which only reports that
the slot is present if there is a PCI devices hotplugged there.
This is done through querying of its PCI hotplug controller.
qemu-xen has similar capability that reports if device is "hotpluggable
or absent" which we can use to achieve the same result.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6761965243b113230bed900d6105be05b28f5cea
master date: 2019-05-24 10:30:21 +0200

x86/IO-APIC: fix build with gcc9

There are a number of pointless __packed attributes which cause gcc 9 to
legitimately warn:

utils.c: In function 'vtd_dump_iommu_info':
utils.c:287:33: error: converting a packed 'struct IO_APIC_route_entry' pointer (alignment 1) to a 'struct IO_APIC_route_remap_entry' pointer (alignment 8) may result in an unaligned pointer value [-Werror=address-of-packed-member]
  287 |                 remap = (struct IO_APIC_route_remap_entry *) &rte;
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~

intremap.c: In function 'ioapic_rte_to_remap_entry':
intremap.c:343:25: error: converting a packed 'struct IO_APIC_route_entry' pointer (alignment 1) to a 'struct IO_APIC_route_remap_entry' pointer (alignment 8) may result in an unaligned pointer value [-Werror=address-of-packed-member]
  343 |     remap_rte = (struct IO_APIC_route_remap_entry *) old_rte;
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~

Simply drop these attributes. Take the liberty and also re-format the
structure definitions at the same time.

Reported-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ca9310b24e6205de5387e5982ccd42c35caf89d4
master date: 2019-05-24 10:19:59 +0200

x86emul: add support for missing {,V}PMADDWD insns

Their pre-AVX512 incarnations have clearly been overlooked during much
earlier work. Their memory access pattern is entirely standard, so no
specific tests get added to the harness.

Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1a48bdd599b268a2d9b7d0c45f1fd40c4892186e
master date: 2019-05-16 13:43:17 +0200

x86/IRQ: avoid UB (or worse) in trace_irq_mask()

Dynamically allocated CPU mask objects may be smaller than cpumask_t, so
copying has to be restricted to the actual allocation size. This is
particulary important since the function doesn't bail early when tracing
is not active, so even production builds would be affected by potential
misbehavior here.

Take the opportunity and also
- use initializers instead of assignment + memset(),
- constify the cpumask_t input pointer,
- u32 -> uint32_t.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 6fafb8befa99620a2d7323b9eca5c387bad1f59f
master date: 2019-05-13 16:41:03 +0200

x86/boot: Fix latent memory corruption with early_boot_opts_t

c/s ebb26b509f "xen/x86: make VGA support selectable" added an #ifdef
CONFIG_VIDEO into the middle the backing space for early_boot_opts_t,
but didn't adjust the structure definition in cmdline.c

This only functions correctly because the affected fields are at the end
of the structure, and cmdline.c doesn't write to them in this case.

To retain the slimming effect of compiling out CONFIG_VIDEO, adjust
cmdline.c with enough #ifdef-ary to make C's idea of the structure match
the declaration in asm. This requires adding __maybe_unused annotations
to two helper functions.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 30596213617fcf4dd7b71d244e16c8fc0acf456b
master date: 2019-05-13 10:35:38 +0100

x86/svm: Fix handling of ICEBP intercepts

c/s 9338a37d "x86/svm: implement debug events" added support for introspecting
ICEBP debug exceptions, but didn't account for the fact that
svm_get_insn_len() (previously __get_instruction_length) can fail and may
already have raised #GP with the guest.

If svm_get_insn_len() fails, return back to guest context rather than
continuing and mistaking a trap-style VMExit for a fault-style one.

Spotted by Coverity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Brian Woods <brian.woods@amd.com>
master commit: 1495b4ff9b4af2b9c0f12cdb6491082cecf34f86
master date: 2019-05-13 10:35:37 +0100

drivers/video: drop framebuffer size constraints

The limit 1900x1200 do not match real world devices (1900 looks like a
typo, should be 1920). But in practice the limits are arbitrary and do
not serve any real purpose. As discussed in "Increase framebuffer size
to todays standards" thread, drop them completely.

This fixes graphic console on device with 3840x2160 native resolution.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
drivers/video: drop unused limits

MAX_BPP, MAX_FONT_W, MAX_FONT_H are not used in the code at all.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 19600eb75aa9b1df3e4b0a4e55a5d08b957e1fd9
master date: 2019-05-13 10:13:24 +0200
master commit: 343459e34a6d32ba44a21f8b8fe4c1f69b1714c2
master date: 2019-05-13 10:12:56 +0200

bitmap: fix bitmap_fill with zero-sized bitmap

When bitmap_fill(..., 0) is called, do not try to write anything. Before
this patch, it tried to write almost LONG_MAX, surely overwriting
something.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 93df28be2d4f620caf18109222d046355ac56327
master date: 2019-05-13 10:12:00 +0200

x86/vmx: correctly gather gs_shadow value for current vCPU

Currently the gs_shadow value is only cached when the vCPU is being scheduled
out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
since it doesn't represent the actual state of the vCPU at the time the event
was recorded. This prevents vm_event subscribers from correctly finding kernel
structures in the guest when it is trapped while in ring3.

Refresh shadow_gs value when the context being saved is for the current vCPU.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: f69fc1c2f36e8a74ba54c9c8fa5c904ea1ad319e
master date: 2019-05-13 09:55:59 +0200

x86/mtrr: recalculate P2M type for domains with iocaps

This change reflects the logic in epte_get_entry_emt() and allows
changes in guest MTTRs to be reflected in EPT for domains having
direct access to certain hardware memory regions but without IOMMU
context assigned (e.g. XenGT).

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f3d880bf2be92534c5bacf11de2f561cbad550fb
master date: 2019-05-13 09:54:45 +0200

AMD/IOMMU: disable previously enabled IOMMUs upon init failure

If any IOMMUs were successfully initialized before encountering failure,
the successfully enabled ones should be disabled again before cleaning
up their resources.

Move disable_iommu() next to enable_iommu() to avoid a forward
declaration, and take the opportunity to remove stray blank lines ahead
of both functions' final closing braces.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Brian Woods <brian.woods@amd.com>
master commit: 87a3347d476443c66c79953d77d6aef1d2bb3bbd
master date: 2019-05-13 09:52:43 +0200

trace: fix build with gcc9

While I've not observed this myself, gcc 9 (imo validly) reportedly may
complain

trace.c: In function '__trace_hypercall':
trace.c:826:19: error: taking address of packed member of 'struct <anonymous>' may result in an unaligned pointer value [-Werror=address-of-packed-member]
826 | uint32_t *a = d.args;

and the fix is rather simple - remove the __packed attribute. Introduce
a BUILD_BUG_ON() as replacement, for the unlikely case that Xen might
get ported to an architecture where array alignment higher that that of
its elements.

Reported-by: Martin Liška <martin.liska@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 3fd3b266d4198c06e8e421ca515d9ba09ccd5155
master date: 2019-05-13 09:51:23 +0200

xen/sched: fix csched2_deinit_pdata()

Commit 753ba43d6d16e688 ("xen/sched: fix credit2 smt idle handling")
introduced a regression when switching cpus between cpupools.

When assigning a cpu to a cpupool with credit2 being the default
scheduler csched2_deinit_pdata() is called for the credit2 private data
after the new scheduler's private data has been hooked to the per-cpu
scheduler data. Unfortunately csched2_deinit_pdata() will cycle through
all per-cpu scheduler areas it knows of for removing the cpu from the
respective sibling masks including the area of the just moved cpu. This
will (depending on the new scheduler) either clobber the data of the
new scheduler or in case of sched_rt lead to a crash.

Avoid that by removing the cpu from the list of active cpus in credit2
data first.

The opposite problem is occurring when removing a cpu from a cpupool:
init_pdata() of credit2 will access the per-cpu data of the old
scheduler.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
master commit: ffd3367ed682b6ac6f57fcb151921054dd4cce7e
master date: 2019-05-17 15:41:17 +0200

oxenstored: Don't re-open a xenctrl handle for every domain introduction

Currently, an xc handle is opened in main() which is used for cleanup
activities, and a new xc handle is temporarily opened every time a domain is
introduced. This is inefficient, and amongst other things, requires full root
privileges for the lifetime of oxenstored.

All code using the Xenctrl handle is in domains.ml, so initialise xc as a
global (now happens just before main() is called) and drop it as a parameter
from Domains.create and Domains.cleanup.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 129025fe30934c6a04bbd9c05ade479d34ce4985)

xl: handle PVH type in apply_global_affinity_masks again

A call site in create_domain can call it with PVH type. That site was
missed during the review of 48dab9767.

Reinstate PVH type in the switch.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 860d6e158dbb581c3aabc6a20ae8d83b325bffd8)
(cherry picked from commit b4f291b0ca914454cbac9fa5580bb35f8ab04eee)

tools/libxc: Fix issues with libxc and Xen having different featureset lengths

In almost all cases, Xen and libxc will agree on the featureset length,
because they are built from the same source.

However, there are circumstances (e.g. security hotfixes) where the featureset
gets longer and dom0 will, after installing updates, be running with an old
Xen but new libxc.  Despite writing the code with this scenario in mind, there
were some bugs.

First, xen-cpuid's get_featureset() erroneously allocates a buffer based on
Xen's featureset length, but records libxc's length, which may be longer.

In this situation, the hypercall bounce buffer code reads/writes the recorded
length, which is beyond the end of the allocated object, and a later free()
encounters corrupt heap metadata.  Fix this by recording the same length that
we allocate.

Secondly, get_cpuid_domain_info() has a related bug when the passed-in
featureset is a different length to libxc's.

A large amount of the libxc cpuid functionality depends on info->featureset
being as long as expected, and it is allocated appropriately.  However, in the
case that a shorter external featureset is passed in, the logic to check for
trailing nonzero bits may read off the end of it.  Rework the logic to use the
correct upper bound.

In addition, leave a comment next to the fields in struct cpuid_domain_info
explaining the relationship between the various lengths, and how to cope with
different lengths.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit c393b64dcee6684da25257b033148740cb6d7ff0)

tools/xl: use libxl_domain_info to get domain type for vcpu-pin

Parsing the config seems to be an overkill for this particular task
and the config might simply be absent. Type returned from libxl_domain_info
should be either LIBXL_DOMAIN_TYPE_HVM or LIBXL_DOMAIN_TYPE_PV but in
that context distinction between PVH and HVM should be irrelevant.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 48dab9767d2eb173495707cb1fd8ceaf73604ac1)
(cherry picked from commit c59579d8319b776ae6243da1999737e2b4737710)

tools/libxl: correct vcpu affinity output with sparse physical cpu map

With not all physical cpus online (e.g. with smt=0) the output of hte
vcpu affinities is wrong, as the affinity bitmaps are capped after
nr_cpus bits, instead of using max_cpu_id.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 2ec5339ec9218fbf1583fa85b74d1d2f15f1b3b8)

tools/ocaml: Dup2 /dev/null to stdin in daemonize()

Don't close stdin in daemonize() but dup2 /dev/null instead.  Otherwise, fd 0
gets reused later:

  [root@idol ~]# ls -lav /proc/`pgrep xenstored`/fd
  total 0
  dr-x------ 2 root root  0 Feb 28 11:02 .
  dr-xr-xr-x 9 root root  0 Feb 27 15:59 ..
  lrwx------ 1 root root 64 Feb 28 11:02 0 -> /dev/xen/evtchn
  l-wx------ 1 root root 64 Feb 28 11:02 1 -> /dev/null
  l-wx------ 1 root root 64 Feb 28 11:02 2 -> /dev/null
  lrwx------ 1 root root 64 Feb 28 11:02 3 -> /dev/xen/privcmd
  ...

Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 677e64dbe315343620c3b266e9eb16623b118038)

tools/misc/xenpm: fix getting info when some CPUs are offline

Use physinfo.max_cpu_id instead of physinfo.nr_cpus to get max CPU id.
This fixes for example 'xenpm get-cpufreq-para' with smt=off, which
otherwise would miss half of the cores.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit ffb60a58df48419c1f2607cd3cc919fa2bfc9c2d)

x86: fix build race when generating temporary object files

The rules to generate xen-syms and xen.efi may run in parallel, but both
recursively invoke $(MAKE) to build symbol/relocation table temporary
object files. These recursive builds would both re-generate the .*.d2
files (where needed). Both would in turn invoke the same rule, thus
allowing for a race on the .*.d2.tmp intermediate files.

The dependency files of the temporary .xen*.o files live in xen/ rather
than xen/arch/x86/ anyway, so won't be included no matter what. Take the
opportunity and delete them, as the just re-generated .xen*.S files will
trigger a proper re-build of the .xen*.o ones anyway.

Empty the DEPS variable in case the set of goals consists of just those
temporary object files, thus eliminating the race.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 761bb575ce97255029d2d2249b2719e54bc76825
master date: 2019-04-11 10:25:05 +0200

VT-d: posted interrupts require interrupt remapping

Initially I had just noticed the unnecessary indirection in the call
from pi_update_irte(). The generic wrapper having an iommu_intremap
conditional made me look at the setup code though. So first of all
enforce the necessary dependency.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6c54663786d9f1ed04153867687c158675e7277d
master date: 2019-04-09 15:12:07 +0200

vm_event: fix XEN_VM_EVENT_RESUME domctl

Make XEN_VM_EVENT_RESUME return 0 in case of success, instead of
-EINVAL.
Remove vm_event_resume form vm_event.h header and set the function's
visibility to static as is used only in vm_event.c.
Move the vm_event_check_ring test inside vm_event_resume in order to
simplify the code.

Signed-off-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
master commit: b32c0446b103aa801ee18780b2fdd78dfc0b9052
master date: 2019-04-05 15:42:03 +0200

xen/timers: Fix memory leak with cpu unplug/plug

timer_softirq_action() realloc's itself a larger timer heap whenever
necessary, which includes bootstrapping from the empty dummy_heap.  Nothing
ever freed this allocation.

CPU plug and unplug has the side effect of zeroing the percpu data area, which
clears ts->heap.  This in turn causes new timers to be put on the list rather
than the heap, and for timer_softirq_action() to bootstrap itself again.

This in practice leaks ts->heap every time a CPU is unplugged and replugged.

Implement free_percpu_timers() which includes freeing ts->heap when
appropriate, and update the notifier callback with the recent cpu parking
logic and free-avoidance across suspend.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/cpu: Fix ARM build following c/s 597fbb8

c/s 597fbb8 "xen/timers: Fix memory leak with cpu unplug/plug" broke the ARM
build by being the first patch to add park_offline_cpus to common code.

While it is currently specific to Intel hardware (for reasons of being able to
handle machine check exceptions without an immediate system reset), it isn't
inherently architecture specific, so define it to be false on ARM for now.

Add a comment in both smp.h headers explaining the intended behaviour of the
option.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
timers: move back migrate_timers_from_cpu() invocation

Commit 597fbb8be6 ("xen/timers: Fix memory leak with cpu unplug/plug")
went a little too far: Migrating timers away from a CPU being offlined
needs to heppen independent of whether it get parked or fully offlined.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
xen/timers: Fix memory leak with cpu unplug/plug (take 2)

Previous attempts to fix this leak failed to identify the root cause, and
ultimately failed.  The cause is the CPU_UP_PREPARE case (re)initialising
ts->heap back to dummy_heap, which leaks the previous allocation.

Rearrange the logic to only initialise ts once.  This also avoids the
redundant (but benign, due to ts->inactive always being empty) initialising of
the other ts fields.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 597fbb8be6021440cd53493c14201c32671bade1
master date: 2019-04-08 11:16:06 +0100
master commit: a6448adfd3d537aacbbd784e5bf1777ab3ff5f85
master date: 2019-04-09 10:12:57 +0100
master commit: 1aec95350ac8261cba516371710d4d837c26f6a0
master date: 2019-04-15 17:51:30 +0100
master commit: e978e9ed9e1ff0dc326e72708ed03cac2ba41db8
master date: 2019-05-13 10:35:37 +0100

x86emul: suppress general register update upon AVX gather failures

While destination and mask registers may indeed need updating in this
case, the rIP update in particular needs to be avoided, as well as e.g.
raising a single step trap.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 74f299bbd7d5cc52325b5866c17b44dd0bd1c5a2
master date: 2019-04-03 10:14:32 +0200

xen/sched: fix credit2 smt idle handling

Credit2's smt_idle_mask_set() and smt_idle_mask_clear() are used to
identify idle cores where vcpus can be moved to. A core is thought to
be idle when all siblings are known to have the idle vcpu running on
them.

Unfortunately the information of a vcpu running on a cpu is per
runqueue. So in case not all siblings are in the same runqueue a core
will never be regarded to be idle, as the sibling not in the runqueue
is never known to run the idle vcpu.

Use a credit2 specific cpumask of siblings with only those cpus
being marked which are in the same runqueue as the cpu in question.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
master commit: 753ba43d6d16e688f688e01e1c77463ea2c6ec9f
master date: 2019-03-29 18:28:21 +0000

x86/spec-ctrl: Introduce options to control VERW flushing

The Microarchitectural Data Sampling vulnerability is split into categories
with subtly different properties:

MLPDS - Microarchitectural Load Port Data Sampling
MSBDS - Microarchitectural Store Buffer Data Sampling
MFBDS - Microarchitectural Fill Buffer Data Sampling
MDSUM - Microarchitectural Data Sampling Uncacheable Memory

MDSUM is a special case of the other three, and isn't distinguished further.

These issues pertain to three microarchitectural buffers.  The Load Ports, the
Store Buffers and the Fill Buffers.  Each of these structures are flushed by
the new enhanced VERW functionality, but the conditions under which flushing
is necessary vary.

For this concise overview of the issues and default logic, the abbreviations
SP (Store Port), FB (Fill Buffer), LP (Load Port) and HT (Hyperthreading) are
used for brevity:

* Vulnerable hardware is divided into two categories - parts which suffer
   from SP only, and parts with any other combination of vulnerabilities.

* SP only has an HT interaction when the thread goes idle, due to the static
   partitioning of resources.  LP and FB have HT interactions at all points,
   due to the competitive sharing of resources.  All issues potentially leak
   data across the return-to-guest transition.

* The microcode which implements VERW flushing also extends MSR_FLUSH_CMD, so
   we don't need to do both on the HVM return-to-guest path.  However, some
   parts are not vulnerable to L1TF (therefore have no MSR_FLUSH_CMD), but are
   vulnerable to MDS, so do require VERW on the HVM path.

Note that we deliberately support mds=1 even without MD_CLEAR in case the
microcode has been updated but the feature bit not exposed.

This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3c04c258ab40405a74e194d9889a4cbc7abe94b4)

x86/spec-ctrl: Infrastructure to use VERW to flush pipeline buffers

Three synthetic features are introduced, as we need individual control of
each, depending on circumstances. A later change will enable them at
appropriate points.

The verw_sel field doesn't strictly need to live in struct cpu_info. It lives
there because there is a convenient hole it can fill, and it reduces the
complexity of the SPEC_CTRL_EXIT_TO_{PV,HVM} assembly by avoiding the need for
any temporary stack maintenance.

This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 548a932ac786d6bf3584e4b54f2ab993e1117710)

x86/spec-ctrl: CPUID/MSR definitions for Microarchitectural Data Sampling

The MD_CLEAR feature can be automatically offered to guests. No
infrastructure is needed in Xen to support the guest making use of it.

This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d4f6116c080dc013cd1204c4d8ceb95e5f278689)

x86/spec-ctrl: Misc non-functional cleanup

* Identify BTI in the spec_ctrl_{enter,exit}_idle() comments, as other
mitigations will shortly appear.
* Use alternative_input() and cover the lack of memory cobber with a further
barrier.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b62eba6c429c327e1507816bef403ccc87357ae)

x86/boot: Detect the firmware SMT setting correctly on Intel hardware

While boot_cpu_data.x86_num_siblings is an accurate value to use on AMD
hardware, it isn't on Intel when the user has disabled Hyperthreading in the
firmware.  As a result, a user which has chosen to disable HT still gets
nagged on L1TF-vulnerable hardware when they haven't chosen an explicit
smt=<bool> setting.

Make use of the largely-undocumented MSR_INTEL_CORE_THREAD_COUNT which in
practice exists since Nehalem, when booting on real hardware.  Fall back to
using the ACPI table APIC IDs.

While adjusting this logic, fix a latent bug in amd_get_topology().  The
thread count field in CPUID.0x8000001e.ebx is documented as 8 bits wide,
rather than 2 bits wide.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b12fec4a125950240573ea32f65c61fb9afa74c3)

x86/msr: Definitions for MSR_INTEL_CORE_THREAD_COUNT

This is a model specific register which details the current configuration
cores and threads in the package. Because of how Hyperthread and Core
configuration works works in firmware, the MSR it is de-facto constant and
will remain unchanged until the next system reset.

It is a read only MSR (so unilaterally reject writes), but for now retain its
leaky-on-read properties. Further CPUID/MSR work is required before we can
start virtualising a consistent topology to the guest, and retaining the old
behaviour is the safest course of action.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d4120936bcd1695faf5b575f1259c58e31d2b18b)

x86/spec-ctrl: Reposition the XPTI command line parsing logic

It has ended up in the middle of the mitigation calculation logic. Move it to
be beside the other command line parsing.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c2c2bb0d60c642e64a5243a79c8b1548ffb7bc5b)

x86/spec-ctrl: Extend repoline safey calcuations for eIBRS and Atom parts

All currently-released Atom processors are in practice retpoline-safe, because
they don't fall back to a BTB prediction on RSB underflow.

However, an additional meaning of Enhanced IRBS is that the processor may not
be retpoline-safe. The Gemini Lake platform, based on the Goldmont Plus
microarchitecture is the first Atom processor to support eIBRS.

Until Xen gets full eIBRS support, Gemini Lake will still be safe using
regular IBRS.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 17f74242ccf0ce6e51c03a5860947865c0ef0dc2
master date: 2019-03-18 16:26:40 +0000

x86/msr: Shorten ARCH_CAPABILITIES_* constants

They are unnecesserily verbose, and ARCH_CAPS_* is already the more common
version.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: ba27aaa88548c824a47dcf5609288ee1c05d2946
master date: 2019-03-18 16:26:40 +0000

x86/e820: fix build with gcc9

e820.c: In function ‘clip_to_limit’:
.../xen/include/asm/string.h:10:26: error: ‘__builtin_memmove’ offset [-16, -36] is out of the bounds [0, 20484] of object ‘e820’ with type ‘struct e820map’ [-Werror=array-bounds]
   10 | #define memmove(d, s, n) __builtin_memmove(d, s, n)
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
e820.c:404:13: note: in expansion of macro ‘memmove’
  404 |             memmove(&e820.map[i], &e820.map[i+1],
      |             ^~~~~~~
e820.c:36:16: note: ‘e820’ declared here
   36 | struct e820map e820;
      |                ^~~~

While I can't see where the negative offsets would come from, converting
the loop index to unsigned type helps. Take the opportunity and also
convert several other local variables and copy_e820_map()'s second
parameter to unsigned int (and bool in one case).

Reported-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 22e2f8dddf5fbed885b5e4db3ffc9e1101be9ec0
master date: 2019-03-18 11:38:36 +0100

xen: Fix backport of "xen/cmdline: Fix buggy strncmp(s, LITERAL, ss - s) construct"

These were missed as a consequence of being rebased over other cmdline
cleanup.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen: Fix backport of "x86/tsx: Implement controls for RTM force-abort mode"

The posted version of this patch depends on c/s 3c555295 "x86/vpmu: Improve
documentation and parsing for vpmu=" (Xen 4.12 and later) to prevent
`vpmu=rtm-abort` impliying `vpmu=1`, which is outside of security support.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

tools/firmware: update OVMF Makefile, when necessary

[ This is two commits from master aka staging-4.12: ]

OVMF has become dependent on OpenSSL, which is included as a
submodule. Initialise submodules before building.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit b16281870e06f5f526029a4e69634a16dc38e8e4)

tools: only call git when necessary in OVMF Makefile

Users may choose to export a snapshot of OVMF and build it
with xen.git supplied ovmf-makefile. In that case we don't
need to call `git submodule`.

Fixes b16281870e.

Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 68292c94a60eab24514ab4a8e4772af24dead807)

Arm/atomic: correct asm() constraints in build_add_sized()

The memory operand is an in/out one, and the auxiliary register gets
written to early.

Take the opportunity and also drop the redundant cast (the inline
functions' parameters are already of the casted-to type).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit 51ceb1623b9956440f1b9943c67010a90d61f5c5)