dgit.raspbian.org Git

xen: Plumb an is_priv boolean into domain_create()

The current mechanism of setting dom0->is_privileged after construction means
that the is_control_domain() predicate returns false during construction.

In particular, this means that the CPUID Faulting special case in
init_domain_msr_policy() fails to take effect. (In actual fact, faulting
support is advertised to dom0, but attempting to configure it is silently
ignored because of the dom0 special case in ctxt_switch_levelling().)

This could be implemented using a flag in xen_domctl_createdomain, but using
an extra boolean parameter like this means that we can't accidentally allow
domain_create() to create a second dom0 due to parameter mis-auditing.

While adjusting the setting of dom0->is_privileged, drop the redundant zeroing
of dom0->target.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>

VMX: don't needlessly write CR4 guest/host mask

In shadow mode the field never changes from ~0UL, so there's no need for
a VMWRITE (or an update of its cached value).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

x86: move per-vendor early CPU init declarations

They're local to cpu/, so they belong into cpu/cpu.h (and some of them
have been out of use for quite some time). Drop the asm/setup.h
inclusions then as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: remove dead code from cpuid4_cache_lookup()

... and make num_cache_leaves local to the only function using it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/HPET: drop useless check

Commit 9e051a840d ("x86/hpet: Improve handling of timer_deadline")
removed all code between for_each_cpu() and cpumask_test_cpu(),
rendering the latter pointless.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@ctirix.com>

schedulers: validate / correct global data just once

Also mark command line parsing routine __init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

svm: don't clear interception for MSRs required for introspection

This patch mirrors the VMX code that doesn't allow
vmx_clear_msr_intercept() to clear interception of MSRs that an
introspection agent is trying to monitor.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

vpci/msi: fix update of bound MSI interrupts

Current update process of already bound MSI interrupts is wrong
because unmap_domain_pirq calls pci_disable_msi, which disables MSI
interrupts on the device. On the other hand map_domain_pirq doesn't
enable MSI, so the current update process of already enabled MSI
entries is wrong because MSI control bit will be disabled by
unmap_domain_pirq and not re-enabled by map_domain_pirq.

In order to fix this avoid unmapping the PIRQs and just update the
binding of the PIRQ. A new arch helper to do that is introduced.

Note that MSI-X is not affected because unmap_domain_pirq only
disables the MSI enable control bit for the MSI case, for MSI-X the
bit is left untouched by unmap_domain_pirq.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

vpci/msi: split code to bind pirq

And put it in a separate update function. This is required in order to
improve binding of MSI PIRQs when using vPCI.

No functional change.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

VT-d: reconcile iommu_inclusive_mapping and iommu=dom0-strict

The documentation for the iommu_inclusive_mapping Xen command line option
states:

"Use this to work around firmware issues providing incorrect RMRR entries"

Unfortunately this workaround does not function correctly if the dom0-strict
iommu option is also specified.

The documentation goes on to say:

"Rather than only mapping RAM pages for IOMMU accesses for Dom0, with this
option all pages up to 4GB, not marked as unusable in the E820 table, will
get a mapping established."

This patch modifies the VT-d hardware domain initialization code such that
the workaround will continue to function in dom0-strict mode, by mapping
all pages not marked as unusable *unless* they are RAM pages not assigned
to dom0.

NOTE: This patch modifies the test in drivers/passthrough/vtd/iommu.c from
      need_iommu() to is_pv_domain() since dom0-strict implies need_iommu()
      so we no longer want to gate invocation of vtd_set_hwdom_mapping()
      on that.
      It also exports the iommu_dom0_strict flag so that the implementation
      of vtd_set_hwdom_mapping() can test it explicitly. It would be
      possible to test need_iommu() instead, but it is more illustrative
      to test the original flag rather than one of its side-effects.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Roger Pau Monne <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

VT-d: re-phrase logic in vtd_set_hwdom_mapping() for clarity

It is hard to reconcile the comment at the top of the loop in
vtd_set_hwdom_mapping() with the if statement following it. This patch
re-phrases the logic, preserving the semantics, but making it easier
to read.

The patch also modifies the Xen command line documentation to make it
clear that iommu_inclusive_mapping only applies to pages up to the 4GB
boundary.

NOTE: This patch also corrects the indentation of the printk() towards
the end of vtd_set_hwdom_mapping().

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Roger Pau Monne <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

gnttab: silence table expansion message

This currently shows up for basically every domain, when originally it
was logged only when going beyond the default table size. Restore that
behavior.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/XPTI: use %r12 to write zero into xen_cr3

Now that we zero all registers early on all entry paths, use that to
avoid a couple of immediates here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

libxc: remove xch parameter from xc_cpuid_policy

It's not used by the function or any of the helpers called by it.

Reported-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxc: do not return a value from xc_cpuid_policy

None of the called functions return any errors, so there's no point in
returning an int from xc_cpuid_policy.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxc: fix stale PVH comment

PVHv2 uses the HVM path, not the PV one.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/vmx: Drop VMX signal for full real-mode

The hvmloader code which used this signal was deleted 10 years ago (c/s
50b12df83 "x86 vmx: Remove vmxassist"). Furthermore, the value gets discarded
anyway because the HVM domain builder unconditionally sets %rax to 0 in the
same action it uses to set %rip to the appropriate entrypoint.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

x86/vmx: Defer vmx_vmcs_exit() as long as possible in construct_vmcs()

paging_update_paging_modes() and vmx_vlapic_msr_changed() both operate on the
VMCS being constructed. Avoid dropping and re-acquiring the reference
multiple times.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

x86/vmx: Simplify PAT handling during vcpu construction

The host PAT value is a compile time constant, and doesn't need to be read out
of hardware. Merge this if block into the previous block, which has an
identical condition.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

x86/pat: Simplify host PAT handling

With the removal of the 32bit hypervisor build, host_pat is a constant value.
Drop the variable and the redundant cpu_has_pat predicate, and use a define
instead.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

pci: treat class 0 devices as endpoints

Class 0 devices are legacy pre PCI 2.0 devices that didn't have a
class code. Treat them as endpoints, so that they can be handled by
the IOMMU and properly passed-through to the hardware domain.

Such device has been seen on a Super Micro server, lspci -vv reports:

00:13.0 Non-VGA unclassified device: Intel Corporation Device a135 (rev 31)
Subsystem: Super Micro Computer Inc Device 0931
Flags: bus master, fast devsel, latency 0, IRQ 11
Memory at df222000 (64-bit, non-prefetchable) [size=4K]
Capabilities: [80] Power Management version 3

Arguably this is not a legacy device (since this is a new server), but
in any case Xen needs to deal with it.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/shutdown: use ACPI reboot method for Dell PowerEdge R540

When EFI booting the Dell PowerEdge R540 it consistently wanders into
the weeds and gets an invalid opcode in the EFI ResetSystem call. This
is the same bug which affects the PowerEdge R740 so fix it in the same
way: quirk this hardware to use the ACPI reboot method instead.

BIOS Information
    Vendor: Dell Inc.
    Version: 1.3.7
    Release Date: 02/09/2018
System Information
    Manufacturer: Dell Inc.
    Product Name: PowerEdge R540

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/setup: properly update PTEs if src/dst overlaps when relocating Xen image

Commit 0d31d16 (x86/setup: do not relocate Xen over current Xen image
placement) disallowed src/dst images overlaps when relocating Xen image.
Though it deliberately allowed destination region between __image_base__
and (__image_base__ + XEN_IMG_OFFSET) overlaps with the end of source
image. And here is the problem. If anything between __page_tables_start
and __page_tables_end in source image lands in the overlap then some or
even all page table entries may not be updated. This usually means boom
in early boot which will be difficult to the investigate. So, I think
that we have three choices to fix the issue:
  - drop XEN_IMG_OFFSET from
    if ( (end > s) && (end - reloc_size + XEN_IMG_OFFSET >= __pa(_end)) )
  - add XEN_IMG_OFFSET to xen_phys_start in PFN_DOWN(xen_phys_start)
    used in loops as one of conditions and replace ">" with ">=",
  - change PFN_DOWN(xen_phys_start) to PFN_DOWN(xen_remap_end_pfn)
    proposed in earlier version of this patch.

This patch implements the second option. This way we still allow source
and destination partial overlap as described above but PTEs are properly
updated now.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

unmodified_drivers: unplug the emulated devices at resume time

Since qemu-2.10 it is required to unplug emulated devices again after
a live migration. If this is not done, qemu's block-backend driver
will be unable to open the backing disk image because it is still busy
by qemu's IDE driver. As a result the domUs block-frontend driver will
be unable to access the disks, and the domU has to be destroyed.
libxl is unable to detect the situation.

Apply the same workaround for this qemu bug that was done already
years ago in linux.git with commit 512b109ec962 ("xen: unplug the
emulated devices at resume time") to make sure xenlinux based domUs
can be migrated to unfixed hosts.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Jan Beulich <jbeulich@suse.com>

build: remove stray .*.d2 files during clean/distclean

Otherwise e.g. xen/..xen-syms.0.o.d2 and xen/..xen-syms.1.o.d2 files
stay untouched because they are not listed in DEPS_RM variable.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/cpuid: fix generation of auto cpuid header

The makefile rule to generate the cpuid-autogen.h header passes the
whole list of dependencies to gen-cpuid.py but only the first
dependency is actually needed.

So far this seems to be harmless.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/idle: don't mix up ACPI and APIC IDs

Correct a log message and, to clarify code as well, rename the
respective function parameter too.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: guard against #NM

Just in case we still don't get CR0.TS handling right, prevent a host
crash by honoring exception fixups in do_device_not_available(). This
would in particular cover emulator stubs raising #NM.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/HVM: don't cause #NM to be raised in Xen

The changes for XSA-267 did not touch management of CR0.TS for HVM
guests. In fully eager mode this bit should never be set when
respective vCPU-s are active, or else hvmemul_get_fpu() might leave it
wrongly set, leading to #NM in hypervisor context.

{svm,vmx}_enter() and {svm,vmx}_fpu_dirty_intercept() become unreachable
this way. Explicit {svm,vmx}_fpu_leave() invocations need to be guarded
now.

With no CR0.TS management necessary in fully eager mode, there's also no
need anymore to intercept #NM.

Reported-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

libxl: restore passing "readonly=" to qemu for SCSI disks

A read-only check was introduced for XSA-142, commit ef6cb76026 ("libxl:
relax readonly check introduced by XSA-142 fix") added the passing of
the extra setting, but commit dab0539568 ("Introduce COLO mode and
refactor relevant function") dropped the passing of the setting again,
quite likely due to improper re-basing.

Restore the readonly= parameter to SCSI disks. For IDE disks this is
supposed to be rejected; add an assert. And there is a bare ad-hoc
disk drive string in libxl__build_device_model_args_new, which we also
update.

This is XSA-266.

Reported-by: Andrew Reimers <andrew.reimers@orionvm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

libxl: qemu_disk_scsi_drive_string: Break out common parts of disk config

The generated configurations are identical apart from, in some cases,
reordering of the id=%s element. So, overall, no functional change.

This is part of XSA-266.

Reported-by: Andrew Reimers <andrew.reimers@orionvm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

x86: Refine checks in #DB handler for faulting conditions

One of the fix for XSA-260 (c/s 75d6828bc2 "x86/traps: Fix handling of #DB
exceptions in hypervisor context") added some safety checks to help avoid
livelocks of #DB faults.

While a General Detect #DB exception does have fault semantics, hardware
clears %dr7.gd on entry to the handler, meaning that it is actually safe to
return to.  Furthermore, %dr6.gd is guest controlled and sticky (never cleared
by hardware).  A malicious PV guest can therefore trigger the fatal_trap() and
crash Xen.

Instruction breakpoints are more tricky.  The breakpoint match bits in %dr6
are not sticky, but the Intel manual warns that they may be set for
non-enabled breakpoints, so add a breakpoint enabled check.

Beyond that, because of the restriction on the linear addresses PV guests can
set, and the fault (rather than trap) nature of instruction breakpoints
(i.e. can't be deferred by a MovSS shadow), there should be no way to
encounter an instruction breakpoint in Xen context.  However, for extra
robustness, deal with this situation by clearing the breakpoint configuration,
rather than crashing.

This is XSA-265

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: don't bypass preemption checks

While unlikely, it is not impossible for a multi-vCPU guest to leverage
bypasses of preemption checks to drive Xen into an unbounded loop.

This is XSA-264.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/cpuid: Fix up stale comments

* There is no legacy path any more. All static information is retrieved in
the first pass.
* d->arch.cpuids[] doesn't exist any more.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/EFI: further correct FPU state handling around runtime calls

We must not leave a vCPU with CR0.TS clear when it is not in fully eager
mode and has not touched non-lazy state. Instead of adding a 3rd
invocation of stts() to vcpu_restore_fpu_eager(), consolidate all of
them into a single one done at the end of the function.

Rename the function at the same time to better reflect its purpose, as
the patches touches all of its occurences anyway.

The new function parameter is not really well named, but
"need_stts_if_not_fully_eager" seemed excessive to me.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

libxl: fix return code in qmp_synchronous_send

Use error code from libxl namespace, a plain -1 is not valid in this context.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>

stubdom/vtpm: fix memcmp in TPM_ChangeAuthAsymFinish

gcc8 spotted this error:
error: 'memcmp' reading 20 bytes from a region of size 8 [-Werror=stringop-overflow=]

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

x86/dom0: add extra RAM regions as UNUSABLE for PVH memory map

When running as PVH Dom0 the native memory map is used in order to
craft a tailored memory map for Dom0 taking into account it's memory
limit.

Dom0 memory is always going to be smaller than the total amount
of memory present on the host, so in order to prevent Dom0 from
relocating PCI BARs over RAM regions mark all the RAM regions not
available to Dom0 as UNUSABLE in the memory map.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/HVM: alter completion-needed checking

The function only looks at the ioreq_t, so pass it a pointer to just
that. Also use it in hvmemul_do_io().

Suggested-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

x86/HVM: attempts to emulate FPU insns need to set fpu_initialised

My original way of thinking here was that this would be set anyway at
the point state gets reloaded after the adjustments hvmemul_put_fpu()
does, but the flag should already be set before that - after all the
guest may never again touch the FPU before e.g. getting migrated/saved.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Paul Durrant <paul.durrant@citrix.com>

configure: Rerun autogen.sh (on stretch)

This is just a version number update.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

README, Makefiles, Config.mk: Update for branching 4.11 vs 4.12-unstable

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

x86/EFI: fix FPU state handling around runtime calls

There are two issues. First, the nonlazy xstates were never restored
after returning from the runtime call.

Secondly, with the fully_eager_fpu mitigation for XSA-267 / LazyFPU, the
unilateral stts() is no longer correct, and hits an assertion later when
a lazy state restore tries to occur for a fully eager vcpu.

Fix both of these issues by calling vcpu_restore_fpu_eager(). As EFI
runtime services can be used in the idle context, the idle assertion
needs to move until after the fully_eager_fpu check.

Introduce a "curr" local variable and replace other uses of "current"
at the same time.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Juergen Gross <jgross@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

tools/libxc: retry hypercall in case of EFAULT

A hypercall issued via the privcmd driver can very rarely return
-EFAULT even if the hypercall buffers are locked in memory. This
happens for hypercall buffers in user memory when the Linux kernel
is doing memory scans e.g. for page migration or compaction.

Retry the getpageframeinfo3 hypercall up to 2 times in case
-EFAULT is returned and the hypervisor might see invalid PTEs for
user hypercall buffers (which should be the case only if the kernel
doesn't offer a /dev/xen/hypercall node).

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

tools/libxencalls: add new function to query hypercall buffer safety

Add a new function to query whether hypercall buffers are always safe
to access by the hypervisor or might result in EFAULT.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

tools/libxencall: use hypercall buffer device if available

Instead of using anonymous memory for hypercall buffers which is then
locked into memory, use the hypercall buffer device of the Linux
privcmd driver if available.

This has the advantage of needing just a single mmap() for allocating
the buffer and page migration or compaction can't make the buffer
unaccessible for the hypervisor.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

x86/HVM: account for fully eager FPU mode in emulation

In fully eager mode we must not clear fpu_dirtied, set CR0.TS, or invoke
the fpu_leave() hook. Instead do what the mode's name says: Restore
state right away.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec-ctrl: Mitigations for LazyFPU

Intel Core processors since at least Nehalem speculate past #NM, which is the
mechanism by which lazy FPU context switching is implemented.

On affected processors, Xen must use fully eager FPU context switching to
prevent guests from being able to read FPU state (SSE/AVX/etc) from previously
scheduled vcpus.

This is part of XSA-267 / CVE-2018-3665

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: Support fully eager FPU context switching

This is controlled on a per-vcpu bases for flexibility.

This is part of XSA-267 / CVE-2018-3665

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

scripts/add_maintainers.pl: Don't call get_maintainers.pl with -f

The option -f of scripts/get_maintainers.pl will return the maintainers
of a given file, *not* the list of maintainers if the file was a patch.

The output expected of add_maintainers is the latter, so drop the option
-f.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Lars Kurth <lars.kurth@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

xen/sndif: Change stream's unique-id to string

Display and input protocols define "unique-id" XenBus field as string
which is much more flexible in defining unique identifiers comparing
to integer used by sound protocol. For example, this allows to provide
UUIDs as unique ID's. Align sound protocol with display and input
and redefine "unique-id" field as string.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/displif: Add unique display connector identifier

If frontend is configured to expose multiple connectors then backend may
require a way to uniquely identify concrete virtual connector within the
frontend. This is useful for use-cases where connector needs to be
matched to physical display connector.
Add XenBus "unique-id" node parameter, so this sort of use-cases can
be implemented.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/kbdif: Add unique input device identifier

If frontend is configured to expose multiple input device instances
then backend may require a way to uniquely identify concrete input
device within the frontend. This is useful for use-cases where
virtual input device needs to be matched to physical input device.
Add XenBus "unique-id" node parameter, so this sort of use-cases can
be implemented.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/kbdif: Move multi-touch device parameters to backend nodes

In current kbdif protocol definition multi-touch device parameters
are described as a part of frontend's XenBus configuration nodes while
they belong to backend's configuration. Fix this by moving
the parameters to the proper section.

Fixes: b7a3ce49d528 ("xen/kbdif: add multi-touch support")
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reported-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Reviewed-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/VT-x: Fix printing of EFER in vmcs_dump_vcpu()

This is essentially a "take 2" of c/s 82540b66ce "x86/VT-x: Fix determination
of EFER.LMA in vmcs_dump_vcpu()" because in hindight, that change was more
problematic than useful.

The original reason was to fix the logic for determining when not to print the
PDPTE pointers. However, mutating the efer variable (particularly LME and
LMA) before printing it interferes with diagnosing vmentry failures.

Instead of modifying efer, change the PDPTE conditional to use
VM_ENTRY_IA32E_MODE.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

ocaml/xenstored: reduce use of unsafe conversions

The rationalisation of the Xs_ring interface in the xb library
allows to further reduce the unsafe calls withouth introducing
copies. This patch also contains some further code cleanups.

Signed-off-by: Marcello Seri <marcello.seri@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

ocaml/libs/xb: Use bytes in place of strings for mutable buffers

Since Ocaml 4.06.0, that made safe-string on by default, the compiler is
allowed to perform optimisations on immutable strings. They should no
longer be used as mutable buffers, and bytes should be used instead.

The C stubs for Xs_ring have been updated to use bytes, and the interface
rationalised mimicking the new Unix module in the standard library (the
implementation of Unix.write_substring uses unsafe_of_string in the exact same
way, and both the write implementations are using the bytes as an immutable
payload for the write).

Signed-off-by: Marcello Seri <marcello.seri@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/traps: Fix error handling of the pv %dr7 shadow state

c/s "x86/pv: Introduce and use x86emul_write_dr()" fixed a bug with IO shadow
handling, in that it remained stale and visible until %dr7.L/G got set again.

However, it neglected the -EPERM return inbetween these two hunks, introducing
a different bug in which a write to %dr7 which tries to set IO breakpoints
without %cr4.DE being set clobbers the IO state, rather than leaves it alone.

Instead, move the zeroing slightly later, which guarentees that the shadow
gets written exactly once, on a successful update to %dr7.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/CPUID: don't override tool stack decision to hide STIBP

Other than in the feature sets, where we indeed want to offer the
feature even if not enumerated on hardware, we shouldn't dictate the
feature being available if tool stack or host admin have decided to not
expose it (for whatever [questionable?] reason). That feature set side
override is sufficient to achieve the intended guest side safety
property (in offering - by default - STIBP independent of actual
availability in hardware).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: correct default_xen_spec_ctrl calculation

Even with opt_msr_sc_{pv,hvm} both false we should set up the variable
as usual, to ensure proper one-time setup during boot and CPU bringup.
This then also brings the code in line with the comment immediately
ahead of the printk() being modified saying "irrespective of guests".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: suppress sync when XPTI is disabled for a domain

Now that we have a per-domain flag we can and should control sync-ing in
a more fine grained manner: Only domains having XPTI enabled need the
sync to occur.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxc/x86/PV: don't hand through CPUID leaf 0x80000008 as is

Just like for HVM the feature set should be used for EBX output, while
EAX should be restricted to the low 16 bits and ECX/EDX should be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/kdd: alternative way of muting spurious gcc warning

Older gcc does not support #pragma GCC diagnostics, so use alternative
approach - change variable type to uint32_t (this code handle 32-bit
requests only anyway), which apparently also avoid gcc complaining about
this (otherwise correct) code.

Fixes 437e00fea04becc91c1b6bc1c0baa636b067a5cc "tools/kdd: mute spurious
gcc warning"

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

docs/process/xen-release-management: Lesson to learn

The 4.10 release preparation was significantly more hairy than ideal.
(We seem to have a good overall outcome despite, rather than because
of, our approach.)

This is the second time (at least) that we have come close to failure
by committing to a release date before the exact code to be released
is known and has been made and tested.

Evidently our docs makes it insufficiently clear not to do that.

CC: Julien Grall <julien.grall@arm.com>
Acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Lars Kurth <lars.kurth@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

docs/process: Add RUBRIC

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Juergen Gross <jgross@suse.com>

x86/traps: Dump the instruction stream even for double faults

This helps debug #DF's which occur in alternative patches

Reported-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/XPTI: fix S3 resume (and CPU offlining in general)

We should index an L1 table with an L1 index.

Reported-by: Simon Gaiser <simon@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/HVM: correct mtrr_pat_not_equal()

The two vCPU-s differing in MTRR-enabled state means MTRR settings are
not equal. Both vCPU-s having MTRRs disabled means only PAT needs to be
compared. Along those lines for fixed range MTRRs. Differing variable
range counts likewise mean settings are different overall (even if
that's not a very reasonable setup to have).

Constify types and convert bool_t to bool.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: correct vCPU dirty CPU handling

Commit df8234fd2c ("replace vCPU's dirty CPU mask by numeric ID") was
too lax in two respects: First of all it didn't consider the case of a
vCPU not having a valid dirty CPU in the descriptor table TLB flush
case. This is the issue Manual has run into with NetBSD.

Additionally reads of ->dirty_cpu for other than the current vCPU are at
risk of racing with scheduler actions, i.e. single atomic reads need to
be used there. Obviously the non-init write sites then better also use
atomic writes.

Having to touch the descriptor table TLB flush code here anyway, take
the opportunity and switch it to be at most one flush_tlb_mask()
invocation.

Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec-ctrl: Rename ARCH_CAPS.SSBD_NO to SSB_NO

A last-minute rename of the feature occured, and the patch committed to
staging was unfortunately stale. Correct it.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/msr: Virtualise MSR_SPEC_CTRL.SSBD for guests to use

Almost all infrastructure is already in place. Update the reserved bits
calculation in guest_wrmsr(), and offer SSBD to guests by default.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass

To combat GPZ SP4 "Speculative Store Bypass", Intel have extended their
speculative sidechannel mitigations specification as follows:

* A feature bit to indicate that Speculative Store Bypass Disable is
   supported.
* A new bit in MSR_SPEC_CTRL which, when set, disables memory disambiguation
   in the pipeline.
* A new bit in MSR_ARCH_CAPABILITIES, which will be set in future hardware,
   indicating that the hardware is not susceptible to Speculative Store Bypass
   sidechannels.

For contemporary processors, this interface will be implemented via a
microcode update.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass

AMD processors will execute loads and stores with the same base register in
program order, which is typically how a compiler emits code.

Therefore, by default no mitigating actions are taken, despite there being
corner cases which are vulnerable to the issue.

For performance testing, or for users with particularly sensitive workloads,
the `spec-ctrl=ssbd` command line option is available to force Xen to disable
Memory Disambiguation on applicable hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

doc: correct livepatch.markdown syntax

"make -C docs all" fails due to incorrect markdown syntax in
livepatch.markdown. Correct it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Misc fixes:
* Insert real URLs
* Drop trailing whitespace
* Consistent alignment and indentation for code blocks and lists
* Consistent capitalisation
* Consistent use of `` blocks for command line arguments and function names
* Rearrange things not to leave < and > in the text

No change in content. The document now reads rather more consistently in HTML
and PDF form.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xl: show full value of cpu_khz in xl info output

The exact value of cpu_khz can only be obtained via 'xl dmesg', and
therefore can be lost after some time. 'xl info' truncates the value to
full MHz. Adjust the output to show the full khz value.
This helps the host admin to track how a host has calibrated itself. The
value of cpu_khz is used during live migration for the decision if
access to TSC should be emualted.

Commit eb5277a30e ("bitkeeper revision 1.959.1.4
(40d04a87acOb29u-5Y5OxMhHvP2x9g)" gives no hint why cpu_mhz instead of
cpu_khz was chosen.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

Config.mk: Update QEMU to include build fixes

This tag includes two build fixes:
- dump: Fix build with newer gcc
Fix build with GCC-8
- Fix libusb-1.0.22 deprecated libusb_set_debug with libusb_set_option

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>

xen/kbdif: Add features to disable keyboard and pointer

It is now not fully possible to control if and which virtual devices
are created by the frontend, e.g. keyboard and pointer devices
are always created and multi-touch device is created if the
backend advertises multi-touch support. In some cases this
behavior is not desirable and better control over the frontend's
configuration is required.

Add new XenStore feature fields, so it is possible to individually
control set of exposed virtual devices for each guest OS:
- set feature-disable-keyboard to 1 if no keyboard device needs
to be created
- set feature-disable-pointer to 1 if no pointer device needs
to be created

Keep old behavior by default.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

scripts/add_maintainers.pl: New script

This provides a much better workflow when using git format-patch and
git send-email, with get_maintainer.pl.

The tool covers step 2 of the following workflow

  Step 1: git format-patch ... -o <patchdir> ...
  Step 2: ./scripts/add_maintainers.pl -d <patchdir>
          This overwrites  *.patch files in <patchdir>
  Step 3: git send-email -to xen-devel@lists.xenproject.org <patchdir>/*.patchxm

I manually tested all options and the most common combinations
on Mac.

Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Lars Kurth <lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <lars.kurth@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

vpci/msi: fix unbind loop

The current unbind loop on failure in vpci_msi_enable is wrong and
will only work correctly if the initial pirq is 0. Fix this by adding
a proper bound.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`

In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.

One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains.  In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit.  To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.

For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings.  See the xen-command-line.markdown for
full details of the new options.

While changing the command line options, take the time to change how the data
is reported to the user.  The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.

Sample output from booting with `spec-ctrl=no-pv` looks like:

  (XEN) Speculative mitigation facilities:
  (XEN)   Hardware features: IBRS/IBPB STIBP IBPB
  (XEN)   Compiled-in support: INDIRECT_THUNK
  (XEN)   Xen settings: BTI-Thunk RETPOLINE, SPEC_CTRL: IBRS-, Other: IBPB
  (XEN)   Support for VMs: PV: None, HVM: MSR_SPEC_CTRL RSB
  (XEN)   XPTI (64-bit PV only): Dom0 enabled, DomU enabled

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/cpuid: Improvements to guest policies for speculative sidechannel features

If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised.  It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.

Another useful option in some usecases is to offer IBPB without IBRS.  When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS.  This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.

Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value

With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value.  Explicitly set Xen's default during the BSP
and AP boot paths.

For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so.  Oracle report that this
speeds up boots of some hardware by 50s.

"when safe to do so" is based on whether we are virtualised.  A native boot
won't have any other code running in a position to mount an attack.

Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants

In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible

If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.

Requested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT

In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.

Take the opportunity to fix some stale documentation in spec_ctrl_asm.h. The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching. Note this in the comment.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together

Currently, the SPEC_CTRL_{ENTRY,EXIT}_* macros encode Xen's choice of
MSR_SPEC_CTRL as an immediate constant, and chooses between IBRS or not by
doubling up the entire alternative block.

There is now a variable holding Xen's choice of value, so use that and
simplify the alternatives.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags

All 3 bits of information here are control flags for the entry/exit code
behaviour. Treat them as such, rather than having two different variables.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable

At the moment, we have two different encodings of Xen's MSR_SPEC_CTRL value,
which is a side effect of how the Spectre series developed. One encoding is
via an alias with the bottom bit of bti_ist_info, and can encode IBRS or not,
but not other configurations such as STIBP.

Break Xen's value out into a separate variable (in the top of stack block for
XPTI reasons) and use this instead of bti_ist_info in the IST path.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once

Make it available from the beginning of init_speculation_mitigations(), and
pass it into appropriate functions. Fix an RSBA typo while moving the
affected comment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

tools/ocaml/libs/xc fix gcc-8 format-truncation warning

CC       xenctrl_stubs.o
xenctrl_stubs.c: In function 'failwith_xc':
xenctrl_stubs.c:65:17: error: 'snprintf' output may be truncated before the last format character [-Werror=format-truncation=]
      "%d: %s: %s", error->code,
                 ^
xenctrl_stubs.c:64:4: note: 'snprintf' output 6 or more bytes (assuming 1029) into a destination of size 1028
    snprintf(error_str, sizeof(error_str),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      "%d: %s: %s", error->code,
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
      xc_error_code_to_desc(error->code),
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      error->message);
      ~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[8]: *** [/build/xen-git/src/xen/tools/ocaml/libs/xc/../../Makefile.rules:37: xenctrl_stubs.o] Error 1
m

Signed-off-by: John Thomson <git@johnthomson.fastmail.com.au>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/kbdif: Add string constants for raw pointer

Add missing string constants for {feature|request}-raw-pointer
to align with the rest of the interface file.

Fixes 7868654ff7fe ("kbdif: Define "feature-raw-pointer" and "request-raw-pointer")

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/parse-support-md: Correctly process caveats in multi-status sections

When SUPPORT.md uses the syntax
Status, <some thing>: <support status>
the caveats were lost (not footnoted) because they were attached
only to <some thing>.

Caveats occur in running text, so they are necessarily part of a real
section, not an individual status line like that. So attach them to
the RealSectNode, and look there for them.

Reported-by: Lars Kurth <lars.kurth@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <Lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/parse-support-md: Provide $sectnode->{RealSectNode}

No functional change yet.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <Lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/parse-support-md: Rename RealSect to RealInSect

This makes the distinction between insections and sectnodes clearer.

No functional change.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <Lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

viridian: fix cpuid leaf 0x40000003

The response to viridian leaf 3 needs to split a 64-bit mask across EAX and
EBX, with the low order 32 bits in EAX and the high order 32 bits in EBX.
To facilitate this a union of two uint32_t values and the mask (type
HV_PARTITION_PRIVILEGE_MASK) is allocated on stack as follows:

union {
    HV_PARTITION_PRIVILEGE_MASK mask;
    uint32_t lo, hi;
} u;

This, of course, is incorrect as both lo and hi will alias the low order
32 bits of the mask.

This patch wraps lo and hi in an anonmymous struct to achieve the desired
effect.

NOTE: Fixing this also stops Windows making the HvGetPartitionId hypercall
      which was previously considered erroneous behaviour. Thus the
      hypercall handler is also modified to stop squashing the
      'unimplemented' warning for this hypercall.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libacpi: fixes for iasl >= 20180427

New versions of iasl have introduced improved C file generation, as
reported in the changelog:

iASL: Enhanced the -tc option (which creates an AML hex file in C,
suitable for import into a firmware project):
1) Create a unique name for the table, to simplify use of multiple
SSDTs.
2) Add a protection #ifdef in the file, similar to a .h header file.

The net effect of that on generated files is:

-unsigned char AmlCode[] =
+#ifndef __SSDT_S4_HEX__
+#define __SSDT_S4_HEX__
+
+unsigned char ssdt_s4_aml_code[] =

The above example is from ssdt_s4.asl.

Fix the build with newer versions of iasl by stripping the '_aml_code'
suffix from the variable name on generated files.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/HVM: guard against emulator driving ioreq state in weird ways

In the case where hvm_wait_for_io() calls wait_on_xen_event_channel(),
p->state ends up being read twice in succession: once to determine that
state != p->state, and then again at the top of the loop. This gives a
compromised emulator a chance to change the state back between the two
reads, potentially keeping Xen in a loop indefinitely.

Instead:
* Read p->state once in each of the wait_on_xen_event_channel() tests,
* re-use that value the next time around,
* and insist that the states continue to transition "forward" (with the
exception of the transition to STATE_IOREQ_NONE).

This is XSA-262.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

x86/vpt: add support for IO-APIC routed interrupts

And modify the HPET code to make use of it. Currently HPET interrupts
are always treated as ISA and thus injected through the vPIC. This is
wrong because HPET interrupts when not in legacy mode should be
injected from the IO-APIC.

To make things worse, the supported interrupt routing values are set
to [20..23], which clearly falls outside of the ISA range, thus
leading to an ASSERT in debug builds or memory corruption in non-debug
builds because the interrupt injection code will write out of the
bounds of the arch.hvm_domain.vpic array.

Since the HPET interrupt source can change between ISA and IO-APIC
always destroy the timer before changing the mode, or else Xen risks
changing it while the timer is active.

Note that vpt interrupt injection is racy in the sense that the
vIO-APIC RTE entry can be written by the guest in between the call to
pt_irq_masked and hvm_ioapic_assert, or the call to pt_update_irq and
pt_intr_post. Those are not deemed to be security issues, but rather
quirks of the current implementation. In the worse case the guest
might lose interrupts or get multiple interrupt vectors injected for
the same timer source.

This is part of XSA-261.

Address actual and potential compiler warnings. Fix formatting.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/traps: Fix handling of #DB exceptions in hypervisor context

The WARN_ON() can be triggered by guest activities, and emits a full stack
trace without rate limiting. Swap it out for a ratelimited printk with just
enough information to work out what is going on.

Not all #DB exceptions are traps, so blindly continuing is not a safe action
to take. We don't let PV guests select these settings in the real %dr7 to
begin with, but for added safety against unexpected situations, detect the
fault cases and crash in an obvious manner.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/traps: Use an Interrupt Stack Table for #DB

PV guests can use architectural corner cases to cause #DB to be raised after
transitioning into supervisor mode.

Use an interrupt stack table for #DB to prevent the exception being taken with
a guest controlled stack pointer.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>